Open In App

How to get summary statistics by group in R

Last Updated : 23 Aug, 2021
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we will learn how to get summary statistics by the group in R programming language.

Sample dataframe in use:

   grpBy num
1      A  20
2      A  30
3      A  40
4      B  50
5      B  50
6      C  70
7      C  80
8      C  25
9      C  35
10     D  45
11     E  55
12     E  65
13     E  75
14     E  85
15     E  95
16     E 105

Method 1: Using tapply()

tapply() function in R Language is used to apply a function over a subset of vectors given by a combination of factors. This function takes 3 arguments according to the syntax. The first argument is the data column, the second argument is the column according to which the data will be grouped, in this example the data is grouped according the letters. Third argument is a function which will be applied to each group, in this example we have passed summary() function  as we want to compute summary statistics by group.

Syntax: tapply(df$data, df$groupBy, summary)

Parameters:

  • df$data: data on which summary function is to be applied
  • df$groupBy: column according to which the data should be grouped by
  • summary: summary function is applied to each group

Example: R program to get summary statistics by group

R




num < - c(20, 30, 40, 50, 50, 70, 80, 25, 35, 45, 55, 65, 75, 85, 95, 105) char < - factor( rep(LETTERS[1:5], c(3, 2, 4, 1, 6))) df < - data.frame(grpBy=char, num=num) tapply(df$num, df$grpBy, summary)


Output:

$A
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     20      25      30      30      35      40 
$B
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     50      50      50      50      50      50 
$C
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   25.0    32.5    52.5    52.5    72.5    80.0 
$D
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     45      45      45      45      45      45 
$E
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   55.0    67.5    80.0    80.0    92.5   105.0 

Method 2:  Using data.table approach

In this approach, we first need to import data.table package using library() function. Then we convert the data.frame to a data.table, data.table  in R is an enhanced version of the data.frame. Due to its speed of execution and the less code to type it became popular in R. Then the most important step, we follow the syntax provided and compute the summary statistics by each group.

Syntax:

setDT(df)

df[, as.list(summary(num)), by = grpBy]

Parameters:

  • df: dataframe object
  • num: data column
  • grpBy: column according to which grouping is to be done
  • summary(): function applied on each group

Example: R program to get summary statistics by group

R




library(data.table)

num < - c(20, 30, 40, 50, 50, 70, 80, 25, 35, 45, 55, 65, 75, 85, 95, 105) char < - factor( rep(LETTERS[1:5], c(3, 2, 4, 1, 6))) df < - data.frame(grpBy=char, num=num) setDT(df) df[, as.list(summary(num)), by = grpBy]


Output:

   grpBy Min. 1st Qu. Median Mean 3rd Qu. Max.
1:     A   20    25.0   30.0 30.0    35.0   40
2:     B   50    50.0   50.0 50.0    50.0   50
3:     C   25    32.5   52.5 52.5    72.5   80
4:     D   45    45.0   45.0 45.0    45.0   45
5:     E   55    67.5   80.0 80.0    92.5  105

Method 3: Using split() function and purrr package

split() function in R Language is used to divide a data vector into groups as defined by the factor provided. We import purrr library using library() function .purrr is a functional programming toolkit. Which comes with many useful functions such as a map. The map() function iterates across all groups and returns the output as a list. It allows us to replace for loop within the code and makes it easier to read.

Syntax: df %>% split(.$grpBy) %>% map(summary)

Parameters:

df: dataframe object

grpBy: dataframe column according to which it should be grouped

Example: R program to get summary statistics by group

R




library(purrr)

num < - c(20, 30, 40, 50, 50, 70, 80, 25, 35, 45, 55, 65, 75, 85, 95, 105) char < - factor(rep(LETTERS[1:5], c(3, 2, 4, 1, 6))) df < - data.frame(grpBy=char, num=num) df % > % split(.$grpBy) % > % map(summary)


Output:

$A
 grpBy      num    
 A:3   Min.   :20  
 B:0   1st Qu.:25  
 C:0   Median :30  
 D:0   Mean   :30  
 E:0   3rd Qu.:35  
       Max.   :40  
$B
 grpBy      num    
 A:0   Min.   :50  
 B:2   1st Qu.:50  
 C:0   Median :50  
 D:0   Mean   :50  
 E:0   3rd Qu.:50  
       Max.   :50  
$C
 grpBy      num      
 A:0   Min.   :25.0  
 B:0   1st Qu.:32.5  
 C:4   Median :52.5  
 D:0   Mean   :52.5  
 E:0   3rd Qu.:72.5  
       Max.   :80.0  
$D
 grpBy      num    
 A:0   Min.   :45  
 B:0   1st Qu.:45  
 C:0   Median :45  
 D:1   Mean   :45  
 E:0   3rd Qu.:45  
       Max.   :45  
$E
 grpBy      num       
 A:0   Min.   : 55.0  
 B:0   1st Qu.: 67.5  
 C:0   Median : 80.0  
 D:0   Mean   : 80.0  
 E:6   3rd Qu.: 92.5  
       Max.   :105.0  

Method 4: Using dplyr

group_by function is used to group by variable provided. Then summarize function is used to compute min, q1, median, mean, q3, max on the grouped data. These statistical values are the same values produces by summary function. The only difference is that here we have to explicitly call those functions upon the grouped data using summarize function. This function reduces a grouped column to a single value according to the function specified.

Syntax: 

df %>%                            

 group_by(grpBy) %>%

 summarize(min = min(num), q1 = quantile(num, 0.25), median = median(num), mean = mean(num), q3 = quantile(num, 0.75), max = max(num))

Parameters: 

df: dataframe object 

grpBy: column according to which grouping is to be done

Example: R program to get summary statistics by group

R




library(dplyr)

num < - c(20, 30, 40, 50, 50, 70, 80, 25, 35, 45, 55, 65, 75, 85, 95, 105) char < - factor( rep(LETTERS[1:5], c(3, 2, 4, 1, 6))) df < - data.frame(grpBy=char, num=num) df % >%
group_by(grpBy) % >%
summarize(min=min(num),
q1=quantile(num, 0.25),
median=median(num),
mean=mean(num),
q3=quantile(num, 0.75),
max=max(num))


Output:

  grpBy   min    q1 median  mean    q3   max
  <fct> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
1 A        20  25     30    30    35      40
2 B        50  50     50    50    50      50
3 C        25  32.5   52.5  52.5  72.5    80
4 D        45  45     45    45    45      45
5 E        55  67.5   80    80    92.5   105


Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads