Pandas is one of the most popular libraries in Python. Pandas provide data structures, a large collection of inbuilt methods, and operations for data analysis. It’s made mainly for working with relational or labeled data both easily and intuitively. There are many in-build methods supported by the pandas library which enables you to quickly perform operations on a large dataset. In this article, we will study how you can efficiently count the number of rows in pandas groupby using some in-build pandas library function along with example and output. So, let's get started!
What is Groupby in Pandas?
When dealing with data science projects, you’ll often experiment with a large amount of data and keep trying the operations on datasets over and over. This is where the concept of groupby comes into the picture. You can define groupby as the ability to aggregate the given data efficiently by improving the performance and efficiency of your code. Groupby concept mainly refers to:
- Splitting the dataset in form of the group by applying some operations
- Applying the given function to each group independently
- Combining the different results of each dataset using the groupby() method and result into a data structure.
As pandas groupby refers to individual groups of a given dataset, what if you wish to count the number of rows present in each of these groups? Counting them manually is quite an infeasible and impossible task, and therefore, let us study some of the efficient methods which can help you with this task.
How to Count Rows in Each Group of Pandas Groupby?
Below are two methods by which you can count the number of objects in groupby pandas:
1) Using pandas groupby size() method
The most simple method for pandas groupby count is by using the in-built pandas method named size(). It returns a pandas series that possess the total number of row count for each group. The basic working of the size() method is the same as len() method and hence, it is not affected by NaN values in the dataset.
For better understanding, let us go through an example below:
Consider the dataframe consisting of the bunch of students' names with respect to the subjects they study.
import pandas as pd data = { "Students": ["Ray", "John", "Mole", "Smith", "Jay", "Milli", "Tom", "Rick"], "Subjects": ["Maths", "Economics", "Science", "Maths", "Statistics", "Statistics", "Statistics", "Computers"] } #load data into a DataFrame object: df = pd.DataFrame(data) print(df)
Output:
Students Subjects 0 Ray Maths 1 John Economics 2 Mole Science 3 Smith Maths 4 Jay Statistics 5 Milli Statistics 6 Tom Statistics 7 Rick Computers
Now, let us group the above dataframe with the column “Subjects” and identify the number of rows in each group using the groupby size() method.
For example:
import pandas as pd data = { "Students": ["Ray", "John", "Mole", "Smith", "Jay", "Milli", "Tom", "Rick"], "Subjects": ["Maths", "Economics", "Science", "Maths", "Statistics", "Statistics", "Statistics", "Computers"] } #load data into a DataFrame object: df = pd.DataFrame(data) print(df.groupby('Subjects').size())
Output:
Subjects Computers 1 Economics 1 Maths 2 Science 1 Statistics 3 dtype: int64
As a result, the output for the above example displays the count of rows for each group in the dataframe with respective to the subjects available.
2) Using pandas grouby count() method
Instead of the size() method, you can also use the pandas groupby count() method to count the values of each column in each group. Note that the number of counts is always similar to the row sizes if there is no presence of NaN value in the dataframe. Check out the below example for a better understanding of the pandas grouby count() method:
For example:
import pandas as pd data = { "Students": ["Ray", "John", "Mole", "Smith", "Jay", "Milli", "Tom", "Rick"], "Subjects": ["Maths", "Economics", "Science", "Maths", "Statistics", "Statistics", "Statistics", "Computers"] } #load data into a DataFrame object: df = pd.DataFrame(data) print(df.groupby('Subjects').count())
Output:
Students Subjects Computers 1 Economics 1 Maths 2 Science 1 Statistics 3
Apart from this, you can also use the value_count() method if you are grouping the dataframe using a single column.
For example:
import pandas as pd data = { "Students": ["Ray", "John", "Mole", "Smith", "Jay", "Milli", "Tom", "Rick"], "Subjects": ["Maths", "Economics", "Science", "Maths", "Statistics", "Statistics", "Statistics", "Computers"] } #load data into a DataFrame object: df = pd.DataFrame(data) print(df['Subjects'].value_counts())
Output:
Statistics 3 Maths 2 Economics 1 Science 1 Computers 1 Name: Subjects, dtype: int64
Difference between Size() and Count() Methods
Looking at the above examples, you must have made up your mind to interchangeably use the size() and count() method while working with pandas groupby. However, note that both of these methods are quite distinct on their own. The count() function returns the number of values in each group, which may or may not be equal to the number of rows because any NaN values encountered by the count() method will be ignored in this case. However, on the other hand, the size() method will get the actual number of rows for each group of dataframe irrespective of NaN values. Let’s understand this using an example:
For example:
import numpy as np # create a dataframe data = { "Students": ["Ray", "John", "Mole", "John", "John", "John", "Ray", "Rick"], "Subjects": ["Maths", "Economics", "Science", "Maths", np.nan, "Statistics", "Statistics", "Computers"] } df = pd.DataFrame(data) # display the dataframe print(df.groupby('Students').size())
Output:
Students John 4 Mole 1 Ray 2 Rick 1 dtype: int64
Now using the count() method on the “Students” column of dataframe
For example:
import numpy as np # create a dataframe data = { "Students": ["Ray", "John", "Mole", "John", "John", "John", "Ray", "Rick"], "Subjects": ["Maths", "Economics", "Science", "Maths", np.nan, "Statistics", "Statistics", "Computers"] } df = pd.DataFrame(data) # display the dataframe print(df.groupby('Students').count())
Output:
Subjects Students John 3 Mole 1 Ray 2 Rick 1
Looking at the above example, you must have understood that if you wish to count the total number of rows in each dataframe, make use of the size() method on groupby, and if you wish to count only the non-null values, get your task done with pandas groupby count() method.
Conclusion
Python Pandas is an open-source library that provides the ability of high data manipulation and data analysis tools. However, to utilize this ability of pandas efficiently, you must be familiar with a huge collection of pandas in-built libraries which enables you to perform certain operations on large datasets. In this article, we studied how you can count the number of rows in each group of pandas groupby using some in-built functions and make your programming easy and efficient while working with massive data. If you want to practice more about pandas, try these exercises for beginners.