By Adriano Almeida


2013-03-14 13:50:03 8 Comments

I am using pandas as a db substitute as I have multiple databases (oracle, mssql, etc) and I am unable to make a sequence of commands to a SQL equivalent.

I have a table loaded in a DataFrame with some columns:

YEARMONTH, CLIENTCODE, SIZE, .... etc etc

In SQL, to count the amount of different clients per year would be:

SELECT count(distinct CLIENTCODE) FROM table GROUP BY YEARMONTH;

And the result would be

201301    5000
201302    13245

How can I do that in pandas?

6 comments

@Vivek Ananthan 2019-10-02 14:58:40

With new pandas version, it is easy to get as dataframe

unique_count = pd.groupby(['YEARMONTH'], as_index=False).agg(uniq_CLIENTCODE =('CLIENTCODE',pd.Series.count))

@Gangaraju 2019-08-01 09:38:19

I am also using nunique but it will be very helpful if you have to use an aggregate function like 'min', 'max', 'count' or 'mean' etc.

df.groupby('YEARMONTH')['CLIENTCODE'].transform('nunique') #count(distinct)
df.groupby('YEARMONTH')['CLIENTCODE'].transform('min')     #min
df.groupby('YEARMONTH')['CLIENTCODE'].transform('max')     #max
df.groupby('YEARMONTH')['CLIENTCODE'].transform('mean')    #average
df.groupby('YEARMONTH')['CLIENTCODE'].transform('count')   #count

@YO and BEN_W 2018-11-23 15:16:20

Using crosstab, this will return more information than groupby nunique

pd.crosstab(df.YEARMONTH,df.CLIENTCODE)
Out[196]: 
CLIENTCODE  1  2  3
YEARMONTH          
201301      2  1  0
201302      1  2  1

After a little bit modify ,yield the result

pd.crosstab(df.YEARMONTH,df.CLIENTCODE).ne(0).sum(1)
Out[197]: 
YEARMONTH
201301    2
201302    3
dtype: int64

@Enthusiast 2017-07-02 11:16:54

Here is another method, much simple, lets say your dataframe name is daat and column name is YEARMONTH

daat.YEARMONTH.value_counts()

@user5569985 2017-12-18 23:38:21

I like this answer. How can I use this method if my column name has a '.' in it (e.g. 'ck.Class')? Thanks

@Enthusiast 2017-12-20 02:18:05

daat['ck.Class'].value_counts()

@Aaron Schumacher 2018-01-16 19:21:41

This does not address the question asked.

@Jason Goal 2018-10-22 20:36:02

this counting the number of observations within each group, not the unique value of a certain column each group has.

@Corey Levinson 2019-01-23 20:03:58

This is the incorrect answer; it does not reflect the DISTINCT requirement from the question! Moreover, it does not include counts of NaN!

@Roman Kh 2014-05-05 02:59:28

Interestingly enough, very often len(unique()) is a few times (3x-15x) faster than nunique().

@user4015990 2015-12-03 00:24:05

You mean this? .CLIENTCODE.apply(lambda x: len(x.unique())), from here

@rpanai 2018-04-17 16:03:52

@3novak This doesn't seem to work with a groupby.

@3novak 2018-04-17 18:59:34

@user32185 you'd have to drop it into an apply call with a lambda. For instance, df.groupby('YEARMONTH')['CLIENTCODE'].apply(lambda x: x.unique().shape[0]).

@mlh351 2018-09-24 20:27:57

Syntax isn't completely clear, I used len(df['column'].unique()) no need for lambda function

@Jason Goal 2018-10-22 20:44:20

I got TypeError: object of type 'method' has no len() from Chen's comment, 3novak's worked for me.

@Dan Allan 2013-03-14 14:09:06

I believe this is what you want:

table.groupby('YEARMONTH').CLIENTCODE.nunique()

Example:

In [2]: table
Out[2]: 
   CLIENTCODE  YEARMONTH
0           1     201301
1           1     201301
2           2     201301
3           1     201302
4           2     201302
5           2     201302
6           3     201302

In [3]: table.groupby('YEARMONTH').CLIENTCODE.nunique()
Out[3]: 
YEARMONTH
201301       2
201302       3

@ErnestScribbler 2017-10-02 08:10:41

What if I have multiple columns that I want to be unique together, like in .drop_duplicates(subset=['col1','col2'])?

@Tarun Khaneja 2018-05-03 12:08:09

How to access this unique count . As there is no column name

@MeHdi 2018-07-22 08:44:01

Thanks lot, I used this style on output of resample. df_watch_record.resample('M').user.nunique() counts the number of unique users who have watched movie per month.

@wllbll 2018-09-10 06:03:13

and sort them with table.groupby('YEARMONTH').CLIENTCODE.nunique().sort_values(‌​ascending=False)

@Josh Hansen 2019-05-09 04:56:57

Is it possible to retrieve the group identifier after nunique? Try as I might I cannot find a way, since the result of this answer is a Series, not a DataFrame.

@Indrajit 2019-07-17 07:11:59

Is it possible to apply this for multiple columns? Right now in the example, only one column is selected.

Related Questions

Sponsored Content

14 Answered Questions

[SOLVED] "Large data" work flows using pandas

4 Answered Questions

[SOLVED] Counting distinct using pandas groupby not working

1 Answered Questions

[SOLVED] Pandas equivalent of Oracle Lead/Lag function

  • 2014-05-14 20:51:34
  • gcarmiol
  • 25000 View
  • 35 Score
  • 1 Answer
  • Tags:   python pandas

11 Answered Questions

[SOLVED] How to use count and group by at the same select statement

  • 2010-04-27 15:15:17
  • Stavros
  • 653589 View
  • 213 Score
  • 11 Answer
  • Tags:   sql count group-by

1 Answered Questions

[SOLVED] What is the equivalent of a SQL count in Pandas

  • 2017-07-30 13:44:07
  • babsdoc
  • 1036 View
  • 1 Score
  • 1 Answer
  • Tags:   python pandas

4 Answered Questions

[SOLVED] SQL query for finding records where count > 1

1 Answered Questions

[SOLVED] Python pandas: Add a column to my dataframe that counts a variable

1 Answered Questions

group distinct rows and counting repeats

2 Answered Questions

[SOLVED] Pandas distinct count as a DataFrame

1 Answered Questions

[SOLVED] pandas groupby count string occurrence over column

Sponsored Content