GroupBy and Aggregate Using Pandas
Remove duplicates in one column, and create a list of tags in another column using Pandas.
Consider this. I have a table of data. With me, yes, I know, big surprise. This table of data has a column with values, and some of these values repeat. This is the names column. Another column has categories the names belong to. Some of the names belong to more than one category, but the category currently does not have more than one value.
name | alias | description | data_collection | |
---|---|---|---|---|
0 | MALE35_FY | 2025 Males Age 35-39 | 2025 Male Population Age 35-39 (Esri) | 5yearincrements |
1 | MOEMEDCRNT | 2018 Median Contract Rent MOE (ACS 5-Yr) | 2018 Median Contract Rent (Households Paying Cash Rent) MOE (ACS 5-Yr) | ACS_Housing_Summary_rep |
2 | REL35DIRHI | 2018 Pop 35-64: Direct-Purch Health Ins REL (ACS 5-Yr) | 2018 Population 35-64: Direct-Purchase Health Ins Only REL (ACS 5-Yr) | ACS_Population_Summary_rep |
3 | REL35VAHCR | 2018 Pop 35-64: VA Health Care Only REL (ACS 5-Yr) | 2018 Population 35-64: VA Health Care Only REL (ACS 5-Yr) | ACS_Population_Summary_rep |
4 | REL35DIRHI | 2018 Pop 35-64: Direct-Purch Health Ins REL (ACS 5-Yr) | 2018 Population 35-64: Direct-Purchase Health Ins Only REL (ACS 5-Yr) | Health |
5 | REL35VAHCR | 2018 Pop 35-64: VA Health Care Only REL (ACS 5-Yr) | 2018 Population 35-64: VA Health Care Only REL (ACS 5-Yr) | Health |
6 | ACS65HI2MM | 2018 Pop 65+: Medicare & Medicaid (ACS 5-Yr) | 2018 Population 65+: Medicare & Medicaid Coverage (ACS 5-Yr) | Health |
7 | X4037_I | 2020 Index: Housekeeping Supply-Other HH Product | 2020 Housekeeping Supplies - Other HH Products: Index | House_and_Home_Expenditures_rep |
8 | X4061_I | 2020 Index: Wall Units/Cabinets/Other LR/FR/RR Furniture | 2020 Furniture -Wall Units/Cabinets/Other LR/FR/RR Furniture: Index | House_and_Home_Expenditures_rep |
9 | X4072_I | 2020 Index: Sewing Machines/Misc Appliances | 2020 Major Appliances - Sewing Machines/Misc Appliances: Index | House_and_Home_Expenditures_rep |
10 | X2001_I | 2020 Index: Alcoholic Beverages | 2020 Alcoholic Beverages: Index | Household_Budget_Expenditures_rep |
11 | X4037_I | 2020 Index: Housekeeping Supply-Other HH Product | 2020 Housekeeping Supplies - Other HH Products: Index | HousingHousehold |
12 | X4061_I | 2020 Index: Wall Units/Cabinets/Other LR/FR/RR Furniture | 2020 Furniture -Wall Units/Cabinets/Other LR/FR/RR Furniture: Index | HousingHousehold |
13 | X4072_I | 2020 Index: Sewing Machines/Misc Appliances | 2020 Major Appliances - Sewing Machines/Misc Appliances: Index | HousingHousehold |
14 | X2001_I | 2020 Index: Alcoholic Beverages | 2020 Alcoholic Beverages: Index | Retail_Goods_and_Services_Expenditures_rep |
15 | N38_BUS | 2020 Other Service excl Pub Admin Bus (NAICS) | 2020 Other Srv excl Public Admin (NAICS81) Businesses | businesses |
16 | X2001_I | 2020 Index: Alcoholic Beverages | 2020 Alcoholic Beverages: Index | food |
17 | MALE35_FY | 2025 Males Age 35-39 | 2025 Male Population Age 35-39 (Esri) | gender |
18 | REL35DIRHI | 2018 Pop 35-64: Direct-Purch Health Ins REL (ACS 5-Yr) | 2018 Population 35-64: Direct-Purchase Health Ins Only REL (ACS 5-Yr) | healthinsurancecoverage |
19 | REL35VAHCR | 2018 Pop 35-64: VA Health Care Only REL (ACS 5-Yr) | 2018 Population 35-64: VA Health Care Only REL (ACS 5-Yr) | healthinsurancecoverage |
20 | ACS65HI2MM | 2018 Pop 65+: Medicare & Medicaid (ACS 5-Yr) | 2018 Population 65+: Medicare & Medicaid Coverage (ACS 5-Yr) | healthinsurancecoverage |
21 | MOEMEDCRNT | 2018 Median Contract Rent MOE (ACS 5-Yr) | 2018 Median Contract Rent (Households Paying Cash Rent) MOE (ACS 5-Yr) | housingcosts |
22 | N38_BUS | 2020 Other Service excl Pub Admin Bus (NAICS) | 2020 Other Srv excl Public Admin (NAICS81) Businesses | industrybynaicscode |
I want to change this so none of the names repeat, and the categories are in a column with a list, similar to how tags are stored. Thankfully, Pandas has a really handy way to do this - one I forget most of the time and have to look up. Hence, I am documenting it here so you and I both can find it easily. The magic sauce is this little snippet.
groupby_column = 'name'
aggregate_column = 'data_collection'
agg_df = df.groupby(groupby_column).aggregate({aggregate_column: list})
This outputs a very simple dataframe (agg_df
) with the names as unique values and the categories now as a list.
name | data_collection |
---|---|
MOEMEDCRNT | ['ACS_Housing_Summary_rep', 'housingcosts'] |
REL35DIRHI | ['ACS_Population_Summary_rep', 'Health', 'healthinsurancecoverage'] |
REL35VAHCR | ['ACS_Population_Summary_rep', 'Health', 'healthinsurancecoverage'] |
X4061_I | ['House_and_Home_Expenditures_rep', 'HousingHousehold'] |
X4037_I | ['House_and_Home_Expenditures_rep', 'HousingHousehold'] |
ACS65HI2MM | ['Health', 'healthinsurancecoverage'] |
X2001_I | ['Household_Budget_Expenditures_rep', 'Retail_Goods_and_Services_Expenditures_rep', 'food'] |
MALE35_FY | ['5yearincrements', 'gender'] |
X4072_I | ['House_and_Home_Expenditures_rep', 'HousingHousehold'] |
N38_BUS | ['businesses', 'industrybynaicscode'] |
To get the final result, we need to add the aliases and description columns back on. After all, who knows what X4072_I
is, anyway? We accomplish this by first creating a dataframe to add back on without the aggregation column (`drop`), and with the grouping column as the index. Next, after joining this new dataframe on, we pull the grouping column out of the index, remove any duplicate names, and clean up the index so it is sequential.
df_alias = df.drop(columns=aggregate_column).set_index(groupby_column)
agg_df.join(df_alias).reset_index(groupby_column).drop_duplicates(groupby_column).reset_index(drop=True)
The output, initially 21 rows, is now only 10.
name | data_collection | alias | description | |
---|---|---|---|---|
0 | ACS65HI2MM | ['Health', 'healthinsurancecoverage'] | 2018 Pop 65+: Medicare & Medicaid (ACS 5-Yr) | 2018 Population 65+: Medicare & Medicaid Coverage (ACS 5-Yr) |
1 | MALE35_FY | ['5yearincrements', 'gender'] | 2025 Males Age 35-39 | 2025 Male Population Age 35-39 (Esri) |
2 | MOEMEDCRNT | ['ACS_Housing_Summary_rep', 'housingcosts'] | 2018 Median Contract Rent MOE (ACS 5-Yr) | 2018 Median Contract Rent (Households Paying Cash Rent) MOE (ACS 5-Yr) |
3 | N38_BUS | ['businesses', 'industrybynaicscode'] | 2020 Other Service excl Pub Admin Bus (NAICS) | 2020 Other Srv excl Public Admin (NAICS81) Businesses |
4 | REL35DIRHI | ['ACS_Population_Summary_rep', 'Health', 'healthinsurancecoverage'] | 2018 Pop 35-64: Direct-Purch Health Ins REL (ACS 5-Yr) | 2018 Population 35-64: Direct-Purchase Health Ins Only REL (ACS 5-Yr) |
5 | REL35VAHCR | ['ACS_Population_Summary_rep', 'Health', 'healthinsurancecoverage'] | 2018 Pop 35-64: VA Health Care Only REL (ACS 5-Yr) | 2018 Population 35-64: VA Health Care Only REL (ACS 5-Yr) |
6 | X2001_I | ['Household_Budget_Expenditures_rep', 'Retail_Goods_and_Services_Expenditures_rep', 'food'] | 2020 Index: Alcoholic Beverages | 2020 Alcoholic Beverages: Index |
7 | X4037_I | ['House_and_Home_Expenditures_rep', 'HousingHousehold'] | 2020 Index: Housekeeping Supply-Other HH Product | 2020 Housekeeping Supplies - Other HH Products: Index |
8 | X4061_I | ['House_and_Home_Expenditures_rep', 'HousingHousehold'] | 2020 Index: Wall Units/Cabinets/Other LR/FR/RR Furniture | 2020 Furniture -Wall Units/Cabinets/Other LR/FR/RR Furniture: Index |
9 | X4072_I | ['House_and_Home_Expenditures_rep', 'HousingHousehold'] | 2020 Index: Sewing Machines/Misc Appliances | 2020 Major Appliances - Sewing Machines/Misc Appliances: Index |
If we want to consolidate this into a succicent function, one easily copied and pasted, here is what it looks like.
import pandas as pd
def aggregate_column(df: pd.DataFrame, groupby_column: str = 'name',
aggregate_column: str = 'data_collection') -> pd.DataFrame:
# make sure the columns are in the dataframe
assert groupby_column in df.columns, f'"groupby_column", {groupby_column}, '
'does not appear to be in the input '
'DataFrame columns.'
assert aggregate_column in df.columns, f'"aggregate_column", {aggregate_column}, '
'does not appear to be in the input '
'DataFrame columns.'
# create aggregated dataframe
agg_df = df.groupby(groupby_column).aggregate({aggregate_column: list})
# create a dataframe of values without the aggregation column ready to join
df_alias = df.drop(columns=aggregate_column).\
set_index(groupby_column)
# join the aliases on and clean up the result
out_df = agg_df.join(df_alias).\
reset_index(groupby_column).\
drop_duplicates(groupby_column).\
reset_index(drop=True)
return out_df
Yes, this post likely is somewhat harder to read than most, but hopefully all of the details make it easier to get the point of you really are interesed in how to accomplish this, and with the entire thing cosolodated into a succicent function at the end, easy to copy and paste if you do not want to worry about the details.
Have fun and happy Monday!