GroupBy and Aggregate Using Pandas

GroupBy and Aggregate Using Pandas

Remove duplicates in one column, and create a list of tags in another column using Pandas.

Consider this. I have a table of data. With me, yes, I know, big surprise. This table of data has a column with values, and some of these values repeat. This is the names column. Another column has categories the names belong to. Some of the names belong to more than one category, but the category currently does not have more than one value.

name alias description data_collection
0 MALE35_FY 2025 Males Age 35-39 2025 Male Population Age 35-39 (Esri) 5yearincrements
1 MOEMEDCRNT 2018 Median Contract Rent MOE (ACS 5-Yr) 2018 Median Contract Rent (Households Paying Cash Rent) MOE (ACS 5-Yr) ACS_Housing_Summary_rep
2 REL35DIRHI 2018 Pop 35-64: Direct-Purch Health Ins REL (ACS 5-Yr) 2018 Population 35-64: Direct-Purchase Health Ins Only REL (ACS 5-Yr) ACS_Population_Summary_rep
3 REL35VAHCR 2018 Pop 35-64: VA Health Care Only REL (ACS 5-Yr) 2018 Population 35-64: VA Health Care Only REL (ACS 5-Yr) ACS_Population_Summary_rep
4 REL35DIRHI 2018 Pop 35-64: Direct-Purch Health Ins REL (ACS 5-Yr) 2018 Population 35-64: Direct-Purchase Health Ins Only REL (ACS 5-Yr) Health
5 REL35VAHCR 2018 Pop 35-64: VA Health Care Only REL (ACS 5-Yr) 2018 Population 35-64: VA Health Care Only REL (ACS 5-Yr) Health
6 ACS65HI2MM 2018 Pop 65+: Medicare & Medicaid (ACS 5-Yr) 2018 Population 65+: Medicare & Medicaid Coverage (ACS 5-Yr) Health
7 X4037_I 2020 Index: Housekeeping Supply-Other HH Product 2020 Housekeeping Supplies - Other HH Products: Index House_and_Home_Expenditures_rep
8 X4061_I 2020 Index: Wall Units/Cabinets/Other LR/FR/RR Furniture 2020 Furniture -Wall Units/Cabinets/Other LR/FR/RR Furniture: Index House_and_Home_Expenditures_rep
9 X4072_I 2020 Index: Sewing Machines/Misc Appliances 2020 Major Appliances - Sewing Machines/Misc Appliances: Index House_and_Home_Expenditures_rep
10 X2001_I 2020 Index: Alcoholic Beverages 2020 Alcoholic Beverages: Index Household_Budget_Expenditures_rep
11 X4037_I 2020 Index: Housekeeping Supply-Other HH Product 2020 Housekeeping Supplies - Other HH Products: Index HousingHousehold
12 X4061_I 2020 Index: Wall Units/Cabinets/Other LR/FR/RR Furniture 2020 Furniture -Wall Units/Cabinets/Other LR/FR/RR Furniture: Index HousingHousehold
13 X4072_I 2020 Index: Sewing Machines/Misc Appliances 2020 Major Appliances - Sewing Machines/Misc Appliances: Index HousingHousehold
14 X2001_I 2020 Index: Alcoholic Beverages 2020 Alcoholic Beverages: Index Retail_Goods_and_Services_Expenditures_rep
15 N38_BUS 2020 Other Service excl Pub Admin Bus (NAICS) 2020 Other Srv excl Public Admin (NAICS81) Businesses businesses
16 X2001_I 2020 Index: Alcoholic Beverages 2020 Alcoholic Beverages: Index food
17 MALE35_FY 2025 Males Age 35-39 2025 Male Population Age 35-39 (Esri) gender
18 REL35DIRHI 2018 Pop 35-64: Direct-Purch Health Ins REL (ACS 5-Yr) 2018 Population 35-64: Direct-Purchase Health Ins Only REL (ACS 5-Yr) healthinsurancecoverage
19 REL35VAHCR 2018 Pop 35-64: VA Health Care Only REL (ACS 5-Yr) 2018 Population 35-64: VA Health Care Only REL (ACS 5-Yr) healthinsurancecoverage
20 ACS65HI2MM 2018 Pop 65+: Medicare & Medicaid (ACS 5-Yr) 2018 Population 65+: Medicare & Medicaid Coverage (ACS 5-Yr) healthinsurancecoverage
21 MOEMEDCRNT 2018 Median Contract Rent MOE (ACS 5-Yr) 2018 Median Contract Rent (Households Paying Cash Rent) MOE (ACS 5-Yr) housingcosts
22 N38_BUS 2020 Other Service excl Pub Admin Bus (NAICS) 2020 Other Srv excl Public Admin (NAICS81) Businesses industrybynaicscode

I want to change this so none of the names repeat, and the categories are in a column with a list, similar to how tags are stored. Thankfully, Pandas has a really handy way to do this - one I forget most of the time and have to look up. Hence, I am documenting it here so you and I both can find it easily. The magic sauce is this little snippet.

groupby_column = 'name'
aggregate_column = 'data_collection'
agg_df = df.groupby(groupby_column).aggregate({aggregate_column: list})

This outputs a very simple dataframe (agg_df) with the names as unique values and the categories now as a list.

name data_collection
MOEMEDCRNT ['ACS_Housing_Summary_rep', 'housingcosts']
REL35DIRHI ['ACS_Population_Summary_rep', 'Health', 'healthinsurancecoverage']
REL35VAHCR ['ACS_Population_Summary_rep', 'Health', 'healthinsurancecoverage']
X4061_I ['House_and_Home_Expenditures_rep', 'HousingHousehold']
X4037_I ['House_and_Home_Expenditures_rep', 'HousingHousehold']
ACS65HI2MM ['Health', 'healthinsurancecoverage']
X2001_I ['Household_Budget_Expenditures_rep', 'Retail_Goods_and_Services_Expenditures_rep', 'food']
MALE35_FY ['5yearincrements', 'gender']
X4072_I ['House_and_Home_Expenditures_rep', 'HousingHousehold']
N38_BUS ['businesses', 'industrybynaicscode']

To get the final result, we need to add the aliases and description columns back on. After all, who knows what X4072_I is, anyway? We accomplish this by first creating a dataframe to add back on without the aggregation column (`drop`), and with the grouping column as the index. Next, after joining this new dataframe on, we pull the grouping column out of the index, remove any duplicate names, and clean up the index so it is sequential.

df_alias = df.drop(columns=aggregate_column).set_index(groupby_column)

agg_df.join(df_alias).reset_index(groupby_column).drop_duplicates(groupby_column).reset_index(drop=True)

The output, initially 21 rows, is now only 10.

name data_collection alias description
0 ACS65HI2MM ['Health', 'healthinsurancecoverage'] 2018 Pop 65+: Medicare & Medicaid (ACS 5-Yr) 2018 Population 65+: Medicare & Medicaid Coverage (ACS 5-Yr)
1 MALE35_FY ['5yearincrements', 'gender'] 2025 Males Age 35-39 2025 Male Population Age 35-39 (Esri)
2 MOEMEDCRNT ['ACS_Housing_Summary_rep', 'housingcosts'] 2018 Median Contract Rent MOE (ACS 5-Yr) 2018 Median Contract Rent (Households Paying Cash Rent) MOE (ACS 5-Yr)
3 N38_BUS ['businesses', 'industrybynaicscode'] 2020 Other Service excl Pub Admin Bus (NAICS) 2020 Other Srv excl Public Admin (NAICS81) Businesses
4 REL35DIRHI ['ACS_Population_Summary_rep', 'Health', 'healthinsurancecoverage'] 2018 Pop 35-64: Direct-Purch Health Ins REL (ACS 5-Yr) 2018 Population 35-64: Direct-Purchase Health Ins Only REL (ACS 5-Yr)
5 REL35VAHCR ['ACS_Population_Summary_rep', 'Health', 'healthinsurancecoverage'] 2018 Pop 35-64: VA Health Care Only REL (ACS 5-Yr) 2018 Population 35-64: VA Health Care Only REL (ACS 5-Yr)
6 X2001_I ['Household_Budget_Expenditures_rep', 'Retail_Goods_and_Services_Expenditures_rep', 'food'] 2020 Index: Alcoholic Beverages 2020 Alcoholic Beverages: Index
7 X4037_I ['House_and_Home_Expenditures_rep', 'HousingHousehold'] 2020 Index: Housekeeping Supply-Other HH Product 2020 Housekeeping Supplies - Other HH Products: Index
8 X4061_I ['House_and_Home_Expenditures_rep', 'HousingHousehold'] 2020 Index: Wall Units/Cabinets/Other LR/FR/RR Furniture 2020 Furniture -Wall Units/Cabinets/Other LR/FR/RR Furniture: Index
9 X4072_I ['House_and_Home_Expenditures_rep', 'HousingHousehold'] 2020 Index: Sewing Machines/Misc Appliances 2020 Major Appliances - Sewing Machines/Misc Appliances: Index

If we want to consolidate this into a succicent function, one easily copied and pasted, here is what it looks like.

import pandas as pd

def aggregate_column(df: pd.DataFrame, groupby_column: str = 'name', 
                     aggregate_column: str = 'data_collection') -> pd.DataFrame:
    
    # make sure the columns are in the dataframe
    assert groupby_column in df.columns, f'"groupby_column", {groupby_column}, '
                                         'does not appear to be in the input '
                                         'DataFrame columns.'
    assert aggregate_column in df.columns, f'"aggregate_column", {aggregate_column}, '
                                           'does not appear to be in the input '
                                           'DataFrame columns.'
    
    # create aggregated dataframe
    agg_df = df.groupby(groupby_column).aggregate({aggregate_column: list})
    
    # create a dataframe of values without the aggregation column ready to join
    df_alias = 	df.drop(columns=aggregate_column).\
        set_index(groupby_column)
    
    # join the aliases on and clean up the result
    out_df = agg_df.join(df_alias).\
        reset_index(groupby_column).\
        drop_duplicates(groupby_column).\
        reset_index(drop=True)
    
    return out_df

Yes, this post likely is somewhat harder to read than most, but hopefully all of the details make it easier to get the point of you really are interesed in how to accomplish this, and with the entire thing cosolodated into a succicent function at the end, easy to copy and paste if you do not want to worry about the details.

Have fun and happy Monday!