By Adriano Almeida


2013-03-14 13:50:03 8 Comments

I am using pandas as a db substitute as I have multiple databases (oracle, mssql, etc) and I am unable to make a sequence of commands to a SQL equivalent.

I have a table loaded in a DataFrame with some columns:

YEARMONTH, CLIENTCODE, SIZE, .... etc etc

In SQL, to count the amount of different clients per year would be:

SELECT count(distinct CLIENTCODE) FROM table GROUP BY YEARMONTH;

And the result would be

201301    5000
201302    13245

How can I do that in pandas?

8 comments

@Vivek Payasi 2020-04-20 10:47:51

Distinct of column along with aggregations on other columns

To get the distinct number of values for any column (CLIENTCODE in your case), we can use nunique. We can pass the input as a dictionary in agg function, along with aggregations on other columns:

grp_df = df.groupby('YEARMONTH').agg({'CLIENTCODE': ['nunique'],
                                      'other_col_1': ['sum', 'count']})

# to flatten the multi-level columns
grp_df.columns = ["_".join(col).strip() for col in grp_df.columns.values]

# if you wish to reset the index
grp_df.reset_index(inplace=True)

@Ramon 2020-02-03 00:40:45

Here an approach to have count distinct over multiple columns. Let's have some data:

data = {'CLIENT_CODE':[1,1,2,1,2,2,3],
        'YEAR_MONTH':[201301,201301,201301,201302,201302,201302,201302],
        'PRODUCT_CODE': [100,150,220,400,50,80,100]
       }
table = pd.DataFrame(data)
table

CLIENT_CODE YEAR_MONTH  PRODUCT_CODE
0   1       201301      100
1   1       201301      150
2   2       201301      220
3   1       201302      400
4   2       201302      50
5   2       201302      80
6   3       201302      100

Now, list the columns of interest and use groupby in a slightly modified syntax:

columns = ['YEAR_MONTH', 'PRODUCT_CODE']
table[columns].groupby(table['CLIENT_CODE']).nunique()

We obtain:

YEAR_MONTH  PRODUCT_CODE CLIENT_CODE        
1           2            3
2           2            3
3           1            1

@Vivek Ananthan 2019-10-02 14:58:40

With new pandas version, it is easy to get as dataframe

unique_count = pd.groupby(['YEARMONTH'], as_index=False).agg(uniq_CLIENTCODE =('CLIENTCODE',pd.Series.count))

@Gangaraju 2019-08-01 09:38:19

I am also using nunique but it will be very helpful if you have to use an aggregate function like 'min', 'max', 'count' or 'mean' etc.

df.groupby('YEARMONTH')['CLIENTCODE'].transform('nunique') #count(distinct)
df.groupby('YEARMONTH')['CLIENTCODE'].transform('min')     #min
df.groupby('YEARMONTH')['CLIENTCODE'].transform('max')     #max
df.groupby('YEARMONTH')['CLIENTCODE'].transform('mean')    #average
df.groupby('YEARMONTH')['CLIENTCODE'].transform('count')   #count

@YOBEN_S 2018-11-23 15:16:20

Using crosstab, this will return more information than groupby nunique

pd.crosstab(df.YEARMONTH,df.CLIENTCODE)
Out[196]: 
CLIENTCODE  1  2  3
YEARMONTH          
201301      2  1  0
201302      1  2  1

After a little bit modify ,yield the result

pd.crosstab(df.YEARMONTH,df.CLIENTCODE).ne(0).sum(1)
Out[197]: 
YEARMONTH
201301    2
201302    3
dtype: int64

@Murtaza Haji 2020-04-30 01:22:18

How can I export this as two column YEARMONTH and count. Also can i set the count in descending order?

@StatguyUser 2017-07-02 11:16:54

Here is another method, much simple, lets say your dataframe name is daat and column name is YEARMONTH

daat.YEARMONTH.value_counts()

@user5569985 2017-12-18 23:38:21

I like this answer. How can I use this method if my column name has a '.' in it (e.g. 'ck.Class')? Thanks

@StatguyUser 2017-12-20 02:18:05

daat['ck.Class'].value_counts()

@Aaron Schumacher 2018-01-16 19:21:41

This does not address the question asked.

@Jason Goal 2018-10-22 20:36:02

this counting the number of observations within each group, not the unique value of a certain column each group has.

@Corey Levinson 2019-01-23 20:03:58

This is the incorrect answer; it does not reflect the DISTINCT requirement from the question! Moreover, it does not include counts of NaN!

@Roman Kh 2014-05-05 02:59:28

Interestingly enough, very often len(unique()) is a few times (3x-15x) faster than nunique().

@user4015990 2015-12-03 00:24:05

You mean this? .CLIENTCODE.apply(lambda x: len(x.unique())), from here

@3novak 2018-04-17 18:59:34

@user32185 you'd have to drop it into an apply call with a lambda. For instance, df.groupby('YEARMONTH')['CLIENTCODE'].apply(lambda x: x.unique().shape[0]).

@mlh351 2018-09-24 20:27:57

Syntax isn't completely clear, I used len(df['column'].unique()) no need for lambda function

@Jason Goal 2018-10-22 20:44:20

I got TypeError: object of type 'method' has no len() from Chen's comment, 3novak's worked for me.

@Dan Allan 2013-03-14 14:09:06

I believe this is what you want:

table.groupby('YEARMONTH').CLIENTCODE.nunique()

Example:

In [2]: table
Out[2]: 
   CLIENTCODE  YEARMONTH
0           1     201301
1           1     201301
2           2     201301
3           1     201302
4           2     201302
5           2     201302
6           3     201302

In [3]: table.groupby('YEARMONTH').CLIENTCODE.nunique()
Out[3]: 
YEARMONTH
201301       2
201302       3

@ErnestScribbler 2017-10-02 08:10:41

What if I have multiple columns that I want to be unique together, like in .drop_duplicates(subset=['col1','col2'])?

@Tarun Khaneja 2018-05-03 12:08:09

How to access this unique count . As there is no column name

@Mehdi Kazemi 2018-07-22 08:44:01

Thanks lot, I used this style on output of resample. df_watch_record.resample('M').user.nunique() counts the number of unique users who have watched movie per month.

@wllbll 2018-09-10 06:03:13

and sort them with table.groupby('YEARMONTH').CLIENTCODE.nunique().sort_values(‌​ascending=False)

@Josh Hansen 2019-05-09 04:56:57

Is it possible to retrieve the group identifier after nunique? Try as I might I cannot find a way, since the result of this answer is a Series, not a DataFrame.

@Indrajit 2019-07-17 07:11:59

Is it possible to apply this for multiple columns? Right now in the example, only one column is selected.

@Ramon 2020-04-06 00:54:15

@Indrajit, see my answer below to apply this for multiple columns.

Related Questions

Sponsored Content

27 Answered Questions

[SOLVED] Renaming columns in pandas

17 Answered Questions

[SOLVED] Selecting multiple columns in a pandas dataframe

24 Answered Questions

[SOLVED] How can I count the occurrences of a list item?

  • 2010-04-08 13:30:00
  • weakish
  • 1715699 View
  • 1543 Score
  • 24 Answer
  • Tags:   python list count

15 Answered Questions

[SOLVED] Delete column from pandas DataFrame

22 Answered Questions

[SOLVED] How to iterate over rows in a DataFrame in Pandas

14 Answered Questions

[SOLVED] How do I get the row count of a pandas DataFrame?

19 Answered Questions

[SOLVED] Get list from pandas DataFrame column headers

20 Answered Questions

[SOLVED] LINQ's Distinct() on a particular property

18 Answered Questions

[SOLVED] Count the number occurrences of a character in a string

  • 2009-07-20 20:00:36
  • Mat
  • 979375 View
  • 961 Score
  • 18 Answer
  • Tags:   python string count

21 Answered Questions

Sponsored Content