Python, data munging

GroupBy and Aggregate Using Pandas

Joel McCune

Apr 12, 2021 • 5 min read

Remove duplicates in one column, and create a list of tags in another column using Pandas.

Consider this. I have a table of data. With me, yes, I know, big surprise. This table of data has a column with values, and some of these values repeat. This is the names column. Another column has categories the names belong to. Some of the names belong to more than one category, but the category currently does not have more than one value.

	name	alias	description	data_collection
0	MALE35_FY	2025 Males Age 35-39	2025 Male Population Age 35-39 (Esri)	5yearincrements
1	MOEMEDCRNT	2018 Median Contract Rent MOE (ACS 5-Yr)	2018 Median Contract Rent (Households Paying Cash Rent) MOE (ACS 5-Yr)	ACS_Housing_Summary_rep
2	REL35DIRHI	2018 Pop 35-64: Direct-Purch Health Ins REL (ACS 5-Yr)	2018 Population 35-64: Direct-Purchase Health Ins Only REL (ACS 5-Yr)	ACS_Population_Summary_rep
3	REL35VAHCR	2018 Pop 35-64: VA Health Care Only REL (ACS 5-Yr)	2018 Population 35-64: VA Health Care Only REL (ACS 5-Yr)	ACS_Population_Summary_rep
4	REL35DIRHI	2018 Pop 35-64: Direct-Purch Health Ins REL (ACS 5-Yr)	2018 Population 35-64: Direct-Purchase Health Ins Only REL (ACS 5-Yr)	Health
5	REL35VAHCR	2018 Pop 35-64: VA Health Care Only REL (ACS 5-Yr)	2018 Population 35-64: VA Health Care Only REL (ACS 5-Yr)	Health
6	ACS65HI2MM	2018 Pop 65+: Medicare & Medicaid (ACS 5-Yr)	2018 Population 65+: Medicare & Medicaid Coverage (ACS 5-Yr)	Health
7	X4037_I	2020 Index: Housekeeping Supply-Other HH Product	2020 Housekeeping Supplies - Other HH Products: Index	House_and_Home_Expenditures_rep
8	X4061_I	2020 Index: Wall Units/Cabinets/Other LR/FR/RR Furniture	2020 Furniture -Wall Units/Cabinets/Other LR/FR/RR Furniture: Index	House_and_Home_Expenditures_rep
9	X4072_I	2020 Index: Sewing Machines/Misc Appliances	2020 Major Appliances - Sewing Machines/Misc Appliances: Index	House_and_Home_Expenditures_rep
10	X2001_I	2020 Index: Alcoholic Beverages	2020 Alcoholic Beverages: Index	Household_Budget_Expenditures_rep
11	X4037_I	2020 Index: Housekeeping Supply-Other HH Product	2020 Housekeeping Supplies - Other HH Products: Index	HousingHousehold
12	X4061_I	2020 Index: Wall Units/Cabinets/Other LR/FR/RR Furniture	2020 Furniture -Wall Units/Cabinets/Other LR/FR/RR Furniture: Index	HousingHousehold
13	X4072_I	2020 Index: Sewing Machines/Misc Appliances	2020 Major Appliances - Sewing Machines/Misc Appliances: Index	HousingHousehold
14	X2001_I	2020 Index: Alcoholic Beverages	2020 Alcoholic Beverages: Index	Retail_Goods_and_Services_Expenditures_rep
15	N38_BUS	2020 Other Service excl Pub Admin Bus (NAICS)	2020 Other Srv excl Public Admin (NAICS81) Businesses	businesses
16	X2001_I	2020 Index: Alcoholic Beverages	2020 Alcoholic Beverages: Index	food
17	MALE35_FY	2025 Males Age 35-39	2025 Male Population Age 35-39 (Esri)	gender
18	REL35DIRHI	2018 Pop 35-64: Direct-Purch Health Ins REL (ACS 5-Yr)	2018 Population 35-64: Direct-Purchase Health Ins Only REL (ACS 5-Yr)	healthinsurancecoverage
19	REL35VAHCR	2018 Pop 35-64: VA Health Care Only REL (ACS 5-Yr)	2018 Population 35-64: VA Health Care Only REL (ACS 5-Yr)	healthinsurancecoverage
20	ACS65HI2MM	2018 Pop 65+: Medicare & Medicaid (ACS 5-Yr)	2018 Population 65+: Medicare & Medicaid Coverage (ACS 5-Yr)	healthinsurancecoverage
21	MOEMEDCRNT	2018 Median Contract Rent MOE (ACS 5-Yr)	2018 Median Contract Rent (Households Paying Cash Rent) MOE (ACS 5-Yr)	housingcosts
22	N38_BUS	2020 Other Service excl Pub Admin Bus (NAICS)	2020 Other Srv excl Public Admin (NAICS81) Businesses	industrybynaicscode

I want to change this so none of the names repeat, and the categories are in a column with a list, similar to how tags are stored. Thankfully, Pandas has a really handy way to do this - one I forget most of the time and have to look up. Hence, I am documenting it here so you and I both can find it easily. The magic sauce is this little snippet.

groupby_column = 'name'
aggregate_column = 'data_collection'
agg_df = df.groupby(groupby_column).aggregate({aggregate_column: list})

This outputs a very simple dataframe (agg_df) with the names as unique values and the categories now as a list.

name	data_collection
MOEMEDCRNT	['ACS_Housing_Summary_rep', 'housingcosts']
REL35DIRHI	['ACS_Population_Summary_rep', 'Health', 'healthinsurancecoverage']
REL35VAHCR	['ACS_Population_Summary_rep', 'Health', 'healthinsurancecoverage']
X4061_I	['House_and_Home_Expenditures_rep', 'HousingHousehold']
X4037_I	['House_and_Home_Expenditures_rep', 'HousingHousehold']
ACS65HI2MM	['Health', 'healthinsurancecoverage']
X2001_I	['Household_Budget_Expenditures_rep', 'Retail_Goods_and_Services_Expenditures_rep', 'food']
MALE35_FY	['5yearincrements', 'gender']
X4072_I	['House_and_Home_Expenditures_rep', 'HousingHousehold']
N38_BUS	['businesses', 'industrybynaicscode']

To get the final result, we need to add the aliases and description columns back on. After all, who knows what X4072_I is, anyway? We accomplish this by first creating a dataframe to add back on without the aggregation column (`drop`), and with the grouping column as the index. Next, after joining this new dataframe on, we pull the grouping column out of the index, remove any duplicate names, and clean up the index so it is sequential.

df_alias = df.drop(columns=aggregate_column).set_index(groupby_column)

agg_df.join(df_alias).reset_index(groupby_column).drop_duplicates(groupby_column).reset_index(drop=True)

The output, initially 21 rows, is now only 10.

	name	data_collection	alias	description
0	ACS65HI2MM	['Health', 'healthinsurancecoverage']	2018 Pop 65+: Medicare & Medicaid (ACS 5-Yr)	2018 Population 65+: Medicare & Medicaid Coverage (ACS 5-Yr)
1	MALE35_FY	['5yearincrements', 'gender']	2025 Males Age 35-39	2025 Male Population Age 35-39 (Esri)
2	MOEMEDCRNT	['ACS_Housing_Summary_rep', 'housingcosts']	2018 Median Contract Rent MOE (ACS 5-Yr)	2018 Median Contract Rent (Households Paying Cash Rent) MOE (ACS 5-Yr)
3	N38_BUS	['businesses', 'industrybynaicscode']	2020 Other Service excl Pub Admin Bus (NAICS)	2020 Other Srv excl Public Admin (NAICS81) Businesses
4	REL35DIRHI	['ACS_Population_Summary_rep', 'Health', 'healthinsurancecoverage']	2018 Pop 35-64: Direct-Purch Health Ins REL (ACS 5-Yr)	2018 Population 35-64: Direct-Purchase Health Ins Only REL (ACS 5-Yr)
5	REL35VAHCR	['ACS_Population_Summary_rep', 'Health', 'healthinsurancecoverage']	2018 Pop 35-64: VA Health Care Only REL (ACS 5-Yr)	2018 Population 35-64: VA Health Care Only REL (ACS 5-Yr)
6	X2001_I	['Household_Budget_Expenditures_rep', 'Retail_Goods_and_Services_Expenditures_rep', 'food']	2020 Index: Alcoholic Beverages	2020 Alcoholic Beverages: Index
7	X4037_I	['House_and_Home_Expenditures_rep', 'HousingHousehold']	2020 Index: Housekeeping Supply-Other HH Product	2020 Housekeeping Supplies - Other HH Products: Index
8	X4061_I	['House_and_Home_Expenditures_rep', 'HousingHousehold']	2020 Index: Wall Units/Cabinets/Other LR/FR/RR Furniture	2020 Furniture -Wall Units/Cabinets/Other LR/FR/RR Furniture: Index
9	X4072_I	['House_and_Home_Expenditures_rep', 'HousingHousehold']	2020 Index: Sewing Machines/Misc Appliances	2020 Major Appliances - Sewing Machines/Misc Appliances: Index

If we want to consolidate this into a succicent function, one easily copied and pasted, here is what it looks like.

import pandas as pd

def aggregate_column(df: pd.DataFrame, groupby_column: str = 'name', 
                     aggregate_column: str = 'data_collection') -> pd.DataFrame:
    
    # make sure the columns are in the dataframe
    assert groupby_column in df.columns, f'"groupby_column", {groupby_column}, '
                                         'does not appear to be in the input '
                                         'DataFrame columns.'
    assert aggregate_column in df.columns, f'"aggregate_column", {aggregate_column}, '
                                           'does not appear to be in the input '
                                           'DataFrame columns.'
    
    # create aggregated dataframe
    agg_df = df.groupby(groupby_column).aggregate({aggregate_column: list})
    
    # create a dataframe of values without the aggregation column ready to join
    df_alias = 	df.drop(columns=aggregate_column).\
        set_index(groupby_column)
    
    # join the aliases on and clean up the result
    out_df = agg_df.join(df_alias).\
        reset_index(groupby_column).\
        drop_duplicates(groupby_column).\
        reset_index(drop=True)
    
    return out_df

Yes, this post likely is somewhat harder to read than most, but hopefully all of the details make it easier to get the point of you really are interesed in how to accomplish this, and with the entire thing cosolodated into a succicent function at the end, easy to copy and paste if you do not want to worry about the details.

Have fun and happy Monday!