By Tampa


2013-11-05 20:24:35 8 Comments

Below is my dataframe. I made some transformations to create the category column and dropped the original column it was derived from. Now I need to do a group-by to remove the dups e.g. Love and Fashion can be rolled up via a groupby sum.

df.colunms = array([category, clicks, revenue, date, impressions, size], dtype=object)
df.values=
[[Love 0 0.36823 2013-11-04 380 300x250]
 [Love 183 474.81522 2013-11-04 374242 300x250]
 [Fashion 0 0.19434 2013-11-04 197 300x250]
 [Fashion 9 18.26422 2013-11-04 13363 300x250]]

Here is the index that is created when I created the dataframe

print df.index
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48])

I assume I want to drop the index, and create date, and category as a multiindex then do a groupby sum of the metrics. How do I do this in pandas dataframe?

df.head(15).to_dict()= {'category': {0: 'Love', 1: 'Love', 2: 'Fashion', 3: 'Fashion', 4: 'Hair', 5: 'Movies', 6: 'Movies', 7: 'Health', 8: 'Health', 9: 'Celebs', 10: 'Celebs', 11: 'Travel', 12: 'Weightloss', 13: 'Diet', 14: 'Bags'}, 'impressions': {0: 380, 1: 374242, 2: 197, 3: 13363, 4: 4, 5: 189, 6: 60632, 7: 269, 8: 40189, 9: 138, 10: 66590, 11: 2227, 12: 22668, 13: 21707, 14: 229}, 'date': {0: '2013-11-04', 1: '2013-11-04', 2: '2013-11-04', 3: '2013-11-04', 4: '2013-11-04', 5: '2013-11-04', 6: '2013-11-04', 7: '2013-11-04', 8: '2013-11-04', 9: '2013-11-04', 10: '2013-11-04', 11: '2013-11-04', 12: '2013-11-04', 13: '2013-11-04', 14: '2013-11-04'}, 'cpc_cpm_revenue': {0: 0.36823, 1: 474.81522000000001, 2: 0.19434000000000001, 3: 18.264220000000002, 4: 0.00080000000000000004, 5: 0.23613000000000001, 6: 81.391139999999993, 7: 0.27171000000000001, 8: 51.258200000000002, 9: 0.11536, 10: 83.966859999999997, 11: 3.43248, 12: 31.695889999999999, 13: 28.459320000000002, 14: 0.43524000000000002}, 'clicks': {0: 0, 1: 183, 2: 0, 3: 9, 4: 0, 5: 1, 6: 20, 7: 0, 8: 21, 9: 0, 10: 32, 11: 1, 12: 12, 13: 9, 14: 2}, 'size': {0: '300x250', 1: '300x250', 2: '300x250', 3: '300x250', 4: '300x250', 5: '300x250', 6: '300x250', 7: '300x250', 8: '300x250', 9: '300x250', 10: '300x250', 11: '300x250', 12: '300x250', 13: '300x250', 14: '300x250'}}

Python is 2.7 and pandas is 0.7.0 on ubuntu 12.04. Below is the error I get if I run the below

import pandas
print pandas.__version__
df = pandas.DataFrame.from_dict(
    {
     'category': {0: 'Love', 1: 'Love', 2: 'Fashion', 3: 'Fashion', 4: 'Hair', 5: 'Movies', 6: 'Movies', 7: 'Health', 8: 'Health', 9: 'Celebs', 10: 'Celebs', 11: 'Travel', 12: 'Weightloss', 13: 'Diet', 14: 'Bags'}, 
     'impressions': {0: 380, 1: 374242, 2: 197, 3: 13363, 4: 4, 5: 189, 6: 60632, 7: 269, 8: 40189, 9: 138, 10: 66590, 11: 2227, 12: 22668, 13: 21707, 14: 229}, 
     'date': {0: '2013-11-04', 1: '2013-11-04', 2: '2013-11-04', 3: '2013-11-04', 4: '2013-11-04', 5: '2013-11-04', 6: '2013-11-04', 7: '2013-11-04', 8: '2013-11-04', 9: '2013-11-04', 10: '2013-11-04', 11: '2013-11-04', 12: '2013-11-04', 13: '2013-11-04', 14: '2013-11-04'}, 'cpc_cpm_revenue': {0: 0.36823, 1: 474.81522000000001, 2: 0.19434000000000001, 3: 18.264220000000002, 4: 0.00080000000000000004, 5: 0.23613000000000001, 6: 81.391139999999993, 7: 0.27171000000000001, 8: 51.258200000000002, 9: 0.11536, 10: 83.966859999999997, 11: 3.43248, 12: 31.695889999999999, 13: 28.459320000000002, 14: 0.43524000000000002}, 'clicks': {0: 0, 1: 183, 2: 0, 3: 9, 4: 0, 5: 1, 6: 20, 7: 0, 8: 21, 9: 0, 10: 32, 11: 1, 12: 12, 13: 9, 14: 2}, 'size': {0: '300x250', 1: '300x250', 2: '300x250', 3: '300x250', 4: '300x250', 5: '300x250', 6: '300x250', 7: '300x250', 8: '300x250', 9: '300x250', 10: '300x250', 11: '300x250', 12: '300x250', 13: '300x250', 14: '300x250'}
    }
)
df.set_index(['date', 'category'], inplace=True)
df.groupby(level=[0,1]).sum()


Traceback (most recent call last):
  File "/home/ubuntu/workspace/devops/reports/groupby_sub.py", line 9, in <module>
    df.set_index(['date', 'category'], inplace=True)
  File "/usr/lib/pymodules/python2.7/pandas/core/frame.py", line 1927, in set_index
    raise Exception('Index has duplicate keys: %s' % duplicates)
Exception: Index has duplicate keys: [('2013-11-04', 'Celebs'), ('2013-11-04', 'Fashion'), ('2013-11-04', 'Health'), ('2013-11-04', 'Love'), ('2013-11-04', 'Movies')]

1 comments

@Paul H 2013-11-05 20:31:39

You can create the index on the existing dataframe. With the subset of data provided, this works for me:

import pandas
df = pandas.DataFrame.from_dict(
    {
     'category': {0: 'Love', 1: 'Love', 2: 'Fashion', 3: 'Fashion', 4: 'Hair', 5: 'Movies', 6: 'Movies', 7: 'Health', 8: 'Health', 9: 'Celebs', 10: 'Celebs', 11: 'Travel', 12: 'Weightloss', 13: 'Diet', 14: 'Bags'}, 
     'impressions': {0: 380, 1: 374242, 2: 197, 3: 13363, 4: 4, 5: 189, 6: 60632, 7: 269, 8: 40189, 9: 138, 10: 66590, 11: 2227, 12: 22668, 13: 21707, 14: 229}, 
     'date': {0: '2013-11-04', 1: '2013-11-04', 2: '2013-11-04', 3: '2013-11-04', 4: '2013-11-04', 5: '2013-11-04', 6: '2013-11-04', 7: '2013-11-04', 8: '2013-11-04', 9: '2013-11-04', 10: '2013-11-04', 11: '2013-11-04', 12: '2013-11-04', 13: '2013-11-04', 14: '2013-11-04'}, 'cpc_cpm_revenue': {0: 0.36823, 1: 474.81522000000001, 2: 0.19434000000000001, 3: 18.264220000000002, 4: 0.00080000000000000004, 5: 0.23613000000000001, 6: 81.391139999999993, 7: 0.27171000000000001, 8: 51.258200000000002, 9: 0.11536, 10: 83.966859999999997, 11: 3.43248, 12: 31.695889999999999, 13: 28.459320000000002, 14: 0.43524000000000002}, 'clicks': {0: 0, 1: 183, 2: 0, 3: 9, 4: 0, 5: 1, 6: 20, 7: 0, 8: 21, 9: 0, 10: 32, 11: 1, 12: 12, 13: 9, 14: 2}, 'size': {0: '300x250', 1: '300x250', 2: '300x250', 3: '300x250', 4: '300x250', 5: '300x250', 6: '300x250', 7: '300x250', 8: '300x250', 9: '300x250', 10: '300x250', 11: '300x250', 12: '300x250', 13: '300x250', 14: '300x250'}
    }
)
df.set_index(['date', 'category'], inplace=True)
df.groupby(level=[0,1]).sum()

If you're having duplicate index issues with the full dataset, you'll need to clean up the data a bit. Remove the duplicate rows if that's amenable. If the duplicate rows are valid, then what sets them apart from each other? If you can add that to the dataframe and include it in the index, that's ideal. If not, just create a dummy column that defaults to 1, but can be 2 or 3 or ... N in the case of N duplicates -- and then include that field in the index as well.

Alternatively, I'm pretty sure you can skip the index creation and directly groupby with columns:

df.groupby(by=['date', 'category']).sum()

Again, that works on the subset of data that you posted.

@Tampa 2013-11-05 20:44:30

raise Exception('Index has duplicate keys: %s' % duplicates) Exception: Index has duplicate keys: [('2013-11-04', 'Beauty'), ('2013-11-04', 'Celebs'), ('2013-11-04', 'Diet'), ('2013-11-04', 'Fashion'), ('2013-11-04', 'Health'), ('2013-11-04', 'Inspiration'), ('2013-11-04', 'Lifestyle'), ('2013-11-04', 'Love'), ('2013-11-04', 'Movies'), ('2013-11-04', 'Parenting')]

@Paul H 2013-11-05 20:58:45

@Tampa Looks like you might need to clean up your data a bit. The portion you posted works for me (see my edits).

@Tampa 2013-11-05 21:14:30

this worked... df.groupby(by=['date', 'category']).sum() Thanks!

Related Questions

Sponsored Content

20 Answered Questions

[SOLVED] How to iterate over rows in a DataFrame in Pandas?

42 Answered Questions

[SOLVED] How do I merge two dictionaries in a single expression?

21 Answered Questions

[SOLVED] How do I list all files of a directory?

  • 2010-07-08 19:31:22
  • duhhunjonn
  • 3880522 View
  • 3474 Score
  • 21 Answer
  • Tags:   python directory

19 Answered Questions

[SOLVED] Get list from pandas DataFrame column headers

15 Answered Questions

[SOLVED] Selecting multiple columns in a pandas dataframe

13 Answered Questions

[SOLVED] Delete column from pandas DataFrame

38 Answered Questions

[SOLVED] How do I check whether a file exists without exceptions?

23 Answered Questions

[SOLVED] Adding new column to existing DataFrame in Python pandas

10 Answered Questions

[SOLVED] How to select rows from a DataFrame based on column values?

23 Answered Questions

[SOLVED] Renaming columns in pandas

Sponsored Content