Open In App

Aggregating and analyzing data with dplyr | R Language

Improve
Improve
Like Article
Like
Save
Share
Report

In this article we will we will discuss how we Aggregate and analyze data with dplyr package in the R Programming Language.

What is dplyr package in R?

The dplyr package is used in R Programming Language to perform simulations in the data by performing manipulations and transformations. It can be installed into the working space using the following command.

The dplyr package in R is like a toolbox for working with data. we want to do different things with it, like picking specific parts, filtering out what you don’t need, grouping similar things together, finding averages, and combining information from different tables.

Here are the main functions of dplyr package

  1. select()
  2. filter()
  3. arrange()
  4. mutate()
  5. group_by() and summarize()
  6. join()
  7. distinct()
  8. rename()
install.packages("dplyr")

There are a large number of inbuilt methods in the Dplyr package that can be used in aggregating and analyzing data. Some of these methods are as follows.

Filtering columns and rows

The filter method in the dplyr package in R is used to select a subset of rows of the original data frame based on whether the specified condition holds true. The condition may use any logical or comparative operator to filter the necessary values. 

Syntax : filter(data , cond)

Arguments:

data- the data frame to be manipulated 

cond-  the condition to be checked to filter the values

In the following code snippet, we are checking for all the rows that correspond to marks in the subject maths. All the rows are returned wherein the students have marks corresponding to the subject maths. Two such rows are returned from the original database.

R




#installing the required libraries
library(dplyr)
#creating a data frame
data_frame = data.frame(subject = c("Maths","Hindi","English","English","Hindi",
                                   "Maths","Hindi"),
                        marks = c(34,23,41,11,35,67,87))
print("Original Data frame")
print(data_frame)
print("Data frame with maths subject")
filter(data_frame, subject == "Maths")


Output : 

[1] "Original Data frame"

subject marks
1 Maths 34
2 Hindi 23
3 English 41
4 English 11
5 Hindi 35
6 Maths 67
7 Hindi 87

[1] "Data frame with maths subject"

subject marks
1 Maths 34
2 Maths 67

Mutate Method 

The Mutate method in the dplyr package is used to add modify or delete the original data frame columns. A new column can be added by specifying the new column name and the formula used to compute the value within this column.

Syntax : Mutate(new-col-name=formula)

Arguments:

new-col-name- the name of the new column to be added

formula-the formula to compute the value of the newly added column.

In the following code snippet, a new column named new_marks is added to the data frame wherein 10 marks is added as grace marks to the existing marks of the various students

R




#installing the required libraries
library(dplyr)
#creating a data frame
data_frame = data.frame(subject = c("Maths","Hindi","English","English","Hindi",
                                   "Maths","Hindi"),
                        marks = c(34,23,41,11,35,67,87))
print("Original Data frame")
print(data_frame)
print("Data frame with 10 grace marks added to all marks")
#adding 10 grace marks to all marks
data_frame %>%
  mutate(new_marks = marks+10)


Output : 

[1] "Original Data frame"

subject marks
1 Maths 34
2 Hindi 23
3 English 41
4 English 11
5 Hindi 35
6 Maths 67
7 Hindi 87

[1] "Data frame with 10 grace marks added to all marks"

subject marks new_marks
1 Maths 34 44
2 Hindi 23 33
3 English 41 51
4 English 11 21
5 Hindi 35 45
6 Maths 67 77
7 Hindi 87 97

Arrange() Method

The arrange() function in the dplyr package is used to reorder rows based on one or more columns.

R




# installing the required libraries
library(dplyr)
 
# creating a data frame
data_frame <- data.frame(
  subject = c("Maths", "Hindi", "English", "English", "Hindi", "Maths", "Hindi"),
  marks = c(34, 23, 41, 11, 35, 67, 87)
)
 
# print the original data frame
cat("Original Data frame:\n")
print(data_frame)
 
# arrange the data frame based on the 'marks' column in ascending order
arranged_df <- arrange(data_frame, marks)
 
# print the arranged data frame
cat("\nArranged Data frame based on marks (ascending order):\n")
print(arranged_df)


Output:

Original Data frame:

subject marks
1 Maths 34
2 Hindi 23
3 English 41
4 English 11
5 Hindi 35
6 Maths 67
7 Hindi 87

Arranged Data frame based on marks (ascending order):

subject marks
1 English 11
2 Hindi 23
3 Maths 34
4 Hindi 35
5 English 41
6 Maths 67
7 Hindi 87

Selecting columns and rows

The Select method in the dplyr package is used to select the specified columns from the data frame. The columns are retrieved in the order in which they occur in the definition of this method all the rows are retained for these columns. 

Syntax : select(list-of-columns-to-be-retrieved)

In the following code snippet, the columns name and marks are extracted from the database in the order such that the name column appear before the marks column

R




#installing the required libraries
library(dplyr)
#creating a data frame
data_frame = data.frame(subject = c("Maths","Hindi","English","English","Hindi",
                                   "Maths","Hindi"),
                        marks = c(34,23,41,11,35,67,87),
                        name = c("A","V","B","D","S","Y","M"))
print("Original Data frame")
print(data_frame)
print("Data frame with 10 grace marks added to all marks")
#selecting name and marks from the data frame
data_frame %>%
  select(name,marks)


Output : 

[1] "Original Data frame"

subject marks name
1 Maths 34 A
2 Hindi 23 V
3 English 41 B
4 English 11 D
5 Hindi 35 S
6 Maths 67 Y
7 Hindi 87 M

[1] "Data frame with 10 grace marks added to all marks"

name marks
1 A 34
2 V 23
3 B 41
4 D 11
5 S 35
6 Y 67
7 M 87

Using Group_by and Summarise

The group_by method is used to divide the data that is available in the data frame into segments based on the groups that can be created from the specified column name. The group_by method may contain one or more columns. 

Syntax : group_by(list-of-columns-to-used-for-grouping)

In the following code snippet, the subject column has been used to group the data.

Now this data frame is subjected to the summarise operation wherein the new column can be created by using available inbuilt functions to calculate the number of entries falling in each group summarise(new-col-name=n()).

The n() method is used to return the counter of values following in each group.

In the following code snippet, for instance the number of students studying English were 2 so 2 is displayed for the subject english. 

R




#installing the required libraries
library(dplyr)
#creating a data frame
data_frame = data.frame(subject = c("Maths","Hindi","English","English","Hindi",
                                   "Maths","Hindi"),
                        marks = c(34,23,41,11,35,67,87),
                        name = c("A","V","B","D","S","Y","M"))
print("Original Data frame")
print(data_frame)
print("Calculating students in each subject")
#grouping the data frame by subject
data_frame %>%
  group_by(subject) %>%
  summarise(sum_marks = n())


Output : 

[1] "Original Data frame"

subject marks name
1 Maths 34 A
2 Hindi 23 V
3 English 41 B
4 English 11 D
5 Hindi 35 S
6 Maths 67 Y
7 Hindi 87 M

[1] "Calculating students in each subject"

subject sum_marks
<chr> <int>
1 English 2
2 Hindi 3
3 Maths 2

Using Aggregate Functions

Instead of the inbuilt methods aggregate methods like sum() or mean() can be used to provide statistical information for the data. For instance, in the summarise method we have used the summarise function with the sum method taking in argument as marks the sum of the marks falling in each category of the subject are then displayed as the output.

R




#installing the required libraries
library(dplyr)
#creating a data frame
data_frame = data.frame(subject = c("Maths","Hindi","English","English","Hindi",
                                   "Maths","Hindi"),
                        marks = c(34,23,41,11,35,67,87),
                        name = c("A","V","B","D","S","Y","M"))
print("Original Data frame")
print(data_frame)
print("Calculating sum of marks of students in each subject")
#grouping the data frame by subject
data_frame %>%
  group_by(subject) %>%
  summarise(sum_marks = sum(marks))


Output :

[1] "Original Data frame"

subject marks name
1 Maths 34 A
2 Hindi 23 V
3 English 41 B
4 English 11 D
5 Hindi 35 S
6 Maths 67 Y
7 Hindi 87 M

[1] "Calculating sum of marks of students in each subject"

subject sum_marks
<chr> <dbl>
1 English 52
2 Hindi 145
3 Maths 101

Join() Method

The join operation in the dplyr package is used to combine two data frames based on common columns.

R




# installing the required libraries
library(dplyr)
 
# creating a data frame
data_frame1 <- data.frame(
  subject = c("Maths", "Hindi", "English", "English", "Hindi", "Maths", "Hindi"),
  marks = c(34, 23, 41, 11, 35, 67, 87)
)
 
data_frame2 <- data.frame(
  subject = c("Maths", "Hindi", "English", "Science"),
  teacher = c("Mr. Smith", "Mrs. Patel", "Ms. Johnson", "Mr. Brown")
)
 
# print the original data frames
print(data_frame1)
print(data_frame2)
 
# inner join the two data frames based on the 'subject' column
joined_df <- inner_join(data_frame1, data_frame2, by = "subject")
 
# print the joined data frame
print(joined_df)


Output:

  subject marks
1 Maths 34
2 Hindi 23
3 English 41
4 English 11
5 Hindi 35
6 Maths 67
7 Hindi 87

subject teacher
1 Maths Mr. Smith
2 Hindi Mrs. Patel
3 English Ms. Johnson
4 Science Mr. Brown

print the joined data frame

subject marks teacher
1 Maths 34 Mr. Smith
2 Hindi 23 Mrs. Patel
3 English 41 Ms. Johnson
4 English 11 Ms. Johnson
5 Hindi 35 Mrs. Patel
6 Maths 67 Mr. Smith
7 Hindi 87 Mrs. Patel

Here we combines the two data frames based on the ‘subject’ column. we can adjust the by parameter to specify which columns should be used for the join.

Distinct Function

The distinct function in the dplyr package is used to extract unique rows from a data frame.

R




# installing the required libraries
library(dplyr)
 
# creating a data frame
data_frame <- data.frame(
  subject = c("Maths", "Hindi", "English", "English", "Hindi", "Maths", "Hindi"),
  marks = c(34, 23, 41, 41, 35, 34, 87)
)
 
# print the original data frame
cat("Original Data frame:\n")
print(data_frame)
 
# extract distinct rows based on all columns
distinct_df <- distinct(data_frame)
 
# print the data frame with distinct rows
cat("\nData frame with distinct rows:\n")
print(distinct_df)


Output:

Original Data frame:

subject marks
1 Maths 34
2 Hindi 23
3 English 41
4 English 41
5 Hindi 35
6 Maths 34
7 Hindi 87

Data frame with distinct rows:

subject marks
1 Maths 34
2 Hindi 23
3 English 41
4 Hindi 35
5 Hindi 87

distinct(data_frame) extracts the unique rows from the original data frame. If we want to consider only specific columns for identifying uniqueness, we have to provide those column names as arguments to the distinct function.

Rename Method

The rename function in the dplyr package is used to change the names of columns in a data frame.

R




# installing the required libraries
library(dplyr)
 
# creating a data frame
data_frame <- data.frame(
  subject = c("Maths", "Hindi", "English", "English", "Hindi", "Maths", "Hindi"),
  marks = c(34, 23, 41, 11, 35, 67, 87)
)
 
# print the original data frame
cat("Original Data frame:\n")
print(data_frame)
 
# rename the 'subject' column to 'course'
renamed_df <- rename(data_frame, course = subject)
 
# print the data frame with renamed column
cat("\nData frame with renamed column:\n")
print(renamed_df)


Output:

Original Data frame:
subject marks
1 Maths 34
2 Hindi 23
3 English 41
4 English 11
5 Hindi 35
6 Maths 67
7 Hindi 87
Data frame with renamed column:
course marks
1 Maths 34
2 Hindi 23
3 English 41
4 English 11
5 Hindi 35
6 Maths 67
7 Hindi 87

In this example, rename(data_frame, course = subject) changes the name of the ‘subject’ column to ‘course’. we can include multiple rename operations in the same by adding multiple columns names.



Last Updated : 19 Dec, 2023
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads