By Zhubarb


2015-04-23 16:31:53 8 Comments

I have the Pandas (version 0.15.2) dataframe below. I want to make the code column an ordered variable of type Categorical after the df creation as below.

import pandas as pd
df = pd.DataFrame({'id' : range(1,9),
                    'code' : ['one', 'one', 'two', 'three',
                                'two', 'three', 'one', 'two'],
                    'amount' : np.random.randn(8)},  columns= ['id','code','amount'])

df.code = df.code.astype('category')
>> 0      one
>> 1      one
>> 2      two
>> 3    three
>> 4      two
>> 5    three
>> 6      one
>> 7      two
>> Name: code, dtype: category
>> Categories (3, object): [one < three < two]

So this works, but only partially. I cannot impose the order. All functionality below, which are demonstrated on the documentation webpage, throw syntax errors for me:

df.code = df.code.astype('category', categories=['one','two','three'], ordered=True)
>> error: astype() got an unexpected keyword argument 'categories'

Or even:

df.code.ordered
>> error: 'Series' object has no attribute 'ordered'
df.code.categories
>> error: 'Series' object has no attribute 'categories'

1) This is annoying. I cannot even get the categories (levels) of my Categorical variable. Am I doing something wrong or is the web documentation out of date/ inconsistent?

2) Also, do you know whether the type Categorical has a distance notion, i.e. does Pandas know that based on the ordering above, one is closer to two than three? I plan to use this for (dis)similarity calculation.

2 comments

@CT Zhu 2015-04-23 17:07:05

I don't think you can specify an order, pd.factorize appears to give that option, but it is not implemented, see here.

Based on what you described, you are looking for coding the code variable into an ordinal variable, not a categorical variable, which are slightly different.

If you can assume the difference between 'one' and 'two' is equal to that between 'two' and 'three'. I guess you can just code them into ints (0, 1, 2, 3 ...).

If you use patsy, then there is a nice example for ordinal variables

@Zhubarb 2015-04-23 18:15:11

Thank you, Categorical definitely allows order specification. I am more confused on the implementation and syntaxing.

@JohnE 2015-04-23 18:22:25

Here's a short example with an ordered categorical variable and (to me) a surprising result from using rank() (as a sort of distance measure):

df = pd.DataFrame({ 'code':['one','two','three','one'], 'num':[1,2,3,1] }) 
df.code = df.code.astype('category', categories=['one','two','three'], ordered=True)

    code  num
0    one    1
1    two    2
2  three    3
3    one    1

df.sort('code')

    code  num
0    one    1
3    one    1
1    two    2
2  three    3

So sort() works as expected, in the order specified. But rank() doesn't do what I would have guessed, it ranks lexicographically and ignores the ordering of the categorical variable.

 df.sort('code').rank()

   code  num
0   1.5  1.5
3   1.5  1.5
1   4.0  3.0
2   3.0  4.0

All of which is perhaps a longer way of asking: Maybe you just want an integer type? I mean, you could make up some kind of distance function here post-sorting, but ultimately that's going to be a lot more work than what you could do with a standard int or float (and possibly problematic if you look at how rank() handles an ordered categorical.

edit to add: Part of the above may not work for pandas 15.2 but I believe you can still do this to specify order:

df['code'].cat.categories = ['one','two','three']

What will happen in 15.2 by default (as I understand it) is that ordered will be True by default (but False in version 16.0), but order will be lexicographical rather than as specified in the constructor. I'm not sure though, and am working in 16.0 so you'll have to just observe how your version behaves. Remember that Categorical is still fairly new...

@Zhubarb 2015-04-24 07:20:20

I am using Pandas version 0.15.2, which version are you using? This syntax does not work on my machine: df.cat.astype('category', categories=['one','two','three'], ordered=True). It throws: error: astype() got an unexpected keyword argument 'categories'.

@Zhubarb 2015-04-24 11:27:32

Yes, I think so.. I think it is just disappointing how unhelpful pd.Categorical is at this stage. I am not sure what use cases it can solve as it stands..

@JohnE 2015-04-24 11:37:06

It looks like the ability to use arguments with astype is new in 16.0, but 'm pretty sure you can still do ordered in 15.2. The docs mention that ordered was the default in 15 and now unordered is the default.

@Zhubarb 2015-04-24 11:40:06

Unfortunately version 0.15.2 certainly does not recognise either df.code.ordered or df.code.categories. It just throws the syntax errors in my question.

@JohnE 2015-04-24 11:43:46

I think it needs to be df.cat.code.categories. I think I may have accidentally chosen a very poor column name in 'cat' (I'll change that). It looks like you need to use 'cat' like 'str' or 'dt'.

@Zhubarb 2015-04-24 11:54:51

I tried the syntax as it appears on the web documentation and it just doesn't work... I am giving up on using this, will go with manually converting all ordinals to integers :(

Related Questions

Sponsored Content

15 Answered Questions

[SOLVED] Selecting multiple columns in a pandas dataframe

21 Answered Questions

[SOLVED] How to iterate over rows in a DataFrame in Pandas?

8 Answered Questions

[SOLVED] Change data type of columns in Pandas

14 Answered Questions

[SOLVED] Delete column from pandas DataFrame

15 Answered Questions

[SOLVED] "Large data" work flows using pandas

25 Answered Questions

[SOLVED] Renaming columns in pandas

19 Answered Questions

[SOLVED] Get list from pandas DataFrame column headers

12 Answered Questions

[SOLVED] Determine the type of an object?

23 Answered Questions

[SOLVED] Adding new column to existing DataFrame in Python pandas

3 Answered Questions

[SOLVED] Categorical Variables In A Pandas Dataframe?

Sponsored Content