What is dplyr

There’s the joke that 80 percent of data science is cleaning the data and 20 percent is complaining about cleaning the data.

Anthony Goldbloom, Founder and CEO of Kaggle

Having clean data in any Data Science project is super important, because the results only get as good as is the data correct. Cleaning data is also the part which usually consumes most of the time and causes the biggest pains for data scientists. R already offers a broad set of tools and functions to manipulate data frames. However, due to its long history, the available base R tool set is fragmented and hard to use for new users.

The dplyr package facilitates the data transformation process through a consistent collection of functions. These functions support different transformations on data frames, including

  • filter rows
  • select columns
  • sort data
  • aggregate data

Multiple data frames can also be joined together by common attribute values.

The consistency of dplyr functions improves usability and enables user to connect transformations together to form data pipelines. These pipelines can also be seen as a high-level query language—much like e.g. the SQL language for database queries. Additionally, it is even possible to translate created data pipelines to other back-ends including databases.

Introduction to dplyr