By user7289


2013-11-11 18:52:06 8 Comments

I have a dataframe along the lines of the below:

    Type       Set
1    A          Z
2    B          Z           
3    B          X
4    C          Y

I want to add another column to the dataframe (or generate a series) of the same length as the dataframe (= equal number of records/rows) which sets a colour green if Set = 'Z' and 'red' if Set = otherwise.

What's the best way to do this?

6 comments

@HKRC 2019-03-06 17:51:55

Maybe this has been possible with newer updates of Pandas, but I think the following is the shortest and maybe best answer for the question, so far. You can use one condition or several depending on your need.

df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))
df['Color'] = "red"
df.loc[(df['Set']=="Z"), 'Color'] = "green"
print(df)

# result: 
  Type Set  Color
0    A   Z  green
1    B   Z  green
2    B   X    red
3    C   Y    red

@unutbu 2013-11-11 19:03:15

If you only have two choices to select from:

df['color'] = np.where(df['Set']=='Z', 'green', 'red')

For example,

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
print(df)

yields

  Set Type  color
0   Z    A  green
1   Z    B  green
2   X    B    red
3   Y    C    red

If you have more than two conditions then use np.select. For example, if you want color to be

  • yellow when (df['Set'] == 'Z') & (df['Type'] == 'A')
  • otherwise blue when (df['Set'] == 'Z') & (df['Type'] == 'B')
  • otherwise purple when (df['Type'] == 'B')
  • otherwise black,

then use

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
conditions = [
    (df['Set'] == 'Z') & (df['Type'] == 'A'),
    (df['Set'] == 'Z') & (df['Type'] == 'B'),
    (df['Type'] == 'B')]
choices = ['yellow', 'blue', 'purple']
df['color'] = np.select(conditions, choices, default='black')
print(df)

which yields

  Set Type   color
0   Z    A  yellow
1   Z    B    blue
2   X    B  purple
3   Y    C   black

@Matt O'Brien 2015-09-21 20:57:07

can't plus one it enough. A real bread and butter line of code.

@Amol Sharma 2016-01-18 13:34:43

doesn't work if i put two conditions inside where clause with and

@unutbu 2016-01-18 14:10:28

@AmolSharma: Use & instead of and. See stackoverflow.com/q/13589390/190597

@denson 2016-10-19 16:48:02

df['color'] = list(np.where(df['Set']=='Z', 'green', 'red')) will suppress the pandas warning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

@Alejandro 2017-07-25 18:26:16

'green' and 'red' can also be replaced with column arithmetic. e.g., df['foo'] = np.where(df['Set']=='Z', df['Set'], df['Type'].shift(1))

@vvv 2018-02-13 01:17:22

does np.where create a new column? I used this code and when I do df.color.head() I get: 'numpy.ndarray' object has no attribute 'head'

@unutbu 2018-02-13 01:25:29

Did you use the assignment df.color = np.where(...) to create a new column? If you did, then no new column was created. Instead, df gets an attribute named color which is assigned to a NumPy array. In contrast, df['color'] = np.where(...) create a new DataFrame column. Then df.color would return the Series df['color']. This asymmetry in the behavior of df['color'] vs df.color is (I believe) a reason never to use the dot syntax to access columns. Saving a few keystrokes is not worth this mental confusion.

@unutbu 2018-02-13 01:29:27

But anyway, you are correct: np.where always returns a NumPy array. When you assign values to a DataFrame column using df['color'] = ..., and later retrieve it using df['color'] you get back a Series instead.

@unutbu 2018-02-13 01:33:53

<rant>The other reason I find the dot syntax distasteful is you can't access columns named last or mean (to name just a few) the same as foo because last and mean are method names. df.last rightly returns the method. So if you use the dot syntax you have to have perfect recall off all DataFrame attribute names or else you'll constantly be writing buggy code, possibly overwriting methods with column values....</rant> :)

@user989762 2018-08-02 09:46:29

This is a great solution. I wonder though, is there also a "pure" pandas solution rather than a solution that mixes numpy and pandas? Would that use df.Set.map(lambda x: ...) as proposed in the answer by @cheekybastard below?

@unutbu 2018-08-02 12:02:13

@user989762: Pandas is built on top of NumPy, so while df['Set'].map(lambda x: ...) is "pure Pandas" I don't really see the attraction in replacing np.where with a lambda function. (Under the hood, map has to call the lambda function in a loop. Multiple calls to a lambda function tend to be slower than 1 call to np.where.) To me, the really interesting part of cheekbastard's answer is that a pure Python list comprehension beats np.where for a wide range of inputs.

@Datanovice 2018-11-06 22:27:53

It took me so long to find this, Thanks!

@KMFR 2019-01-06 15:32:30

Is np.where also good for situations where I want to check if the value CONTAINS a string but doesn't need to match it (which is the case in OP's example)? Or should I use something else?

@unutbu 2019-01-06 15:36:37

To check if a Pandas Series contains a string, use Series.str.contains.

@Vishal 2019-04-21 01:43:59

Lovely answer. Side question -- PyCharm warns about "Expected type 'Union[int, float, complex, None]' got 'str' instead" when I use np.select(conditions, choices, default="foo"). Why is PyCharm warning if str is a valid np.ScalarType?

@unutbu 2019-04-21 10:09:11

@Vishal: PyCharm attempts to guess what type np.select will return, but made a mistake in this case. I don't use PyCharm so I can't test this, but you might want to look here for ways to suppress this warning.

@cheekybastard 2015-07-02 00:27:06

List comprehension is another way to create another column conditionally. If you are working with object dtypes in columns, like in your example, list comprehensions typically outperform most other methods.

Example list comprehension:

df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]

%timeit tests:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
%timeit df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit df['color'] = np.where(df['Set']=='Z', 'green', 'red')
%timeit df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

1000 loops, best of 3: 239 µs per loop
1000 loops, best of 3: 523 µs per loop
1000 loops, best of 3: 263 µs per loop

@blacksite 2017-04-20 16:45:28

Note that, with much larger dataframes (think pd.DataFrame({'Type':list('ABBC')*100000, 'Set':list('ZZXY')*100000})-size), numpy.where outpaces map, but the list comprehension is king (about 50% faster than numpy.where).

@Mappi 2019-01-01 06:42:24

Can the list comprehension method be used if the condition needs information from multiple columns? I am looking for something like this (this does not work): df['color'] = ['red' if (x['Set'] == 'Z') & (x['Type'] == 'B') else 'green' for x in df]

@cheekybastard 2019-01-14 01:38:23

Add iterrows to the dataframe, then you can access multiple columns via row: ['red' if (row['Set'] == 'Z') & (row['Type'] == 'B') else 'green' for index, row in in df.iterrows()]

@blacksite 2017-02-15 21:43:16

Here's yet another way to skin this cat, using a dictionary to map new values onto the keys in the list:

def map_values(row, values_dict):
    return values_dict[row]

values_dict = {'A': 1, 'B': 2, 'C': 3, 'D': 4}

df = pd.DataFrame({'INDICATOR': ['A', 'B', 'C', 'D'], 'VALUE': [10, 9, 8, 7]})

df['NEW_VALUE'] = df['INDICATOR'].apply(map_values, args = (values_dict,))

What's it look like:

df
Out[2]: 
  INDICATOR  VALUE  NEW_VALUE
0         A     10          1
1         B      9          2
2         C      8          3
3         D      7          4

This approach can be very powerful when you have many ifelse-type statements to make (i.e. many unique values to replace).

And of course you could always do this:

df['NEW_VALUE'] = df['INDICATOR'].map(values_dict)

But that approach is more than three times as slow as the apply approach from above, on my machine.

And you could also do this, using dict.get:

df['NEW_VALUE'] = [values_dict.get(v, None) for v in df['INDICATOR']]

@Monica Heddneck 2018-06-21 00:16:24

I like this answer because it shows how to do multiple replacements of values

@bli 2017-02-08 13:04:29

The following is slower than the approaches timed here, but we can compute the extra column based on the contents of more than one column, and more than two values can be computed for the extra column.

Simple example using just the "Set" column:

def set_color(row):
    if row["Set"] == "Z":
        return "red"
    else:
        return "green"

df = df.assign(color=df.apply(set_color, axis=1))

print(df)
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C  green

Example with more colours and more columns taken into account:

def set_color(row):
    if row["Set"] == "Z":
        return "red"
    elif row["Type"] == "C":
        return "blue"
    else:
        return "green"

df = df.assign(color=df.apply(set_color, axis=1))

print(df)
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C   blue

@acharuva 2014-06-06 04:43:52

Another way in which this could be achieved is

df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

Related Questions

Sponsored Content

16 Answered Questions

[SOLVED] Selecting multiple columns in a pandas dataframe

21 Answered Questions

33 Answered Questions

[SOLVED] Renaming columns in pandas

17 Answered Questions

[SOLVED] How to iterate over rows in a DataFrame in Pandas?

14 Answered Questions

[SOLVED] Select rows from a DataFrame based on values in a column in pandas

23 Answered Questions

[SOLVED] Adding new column to existing DataFrame in Python pandas

18 Answered Questions

[SOLVED] Add one row to pandas DataFrame

15 Answered Questions

[SOLVED] Delete column from pandas DataFrame by column name

13 Answered Questions

[SOLVED] "Large data" work flows using pandas

7 Answered Questions

[SOLVED] Change data type of columns in Pandas

Sponsored Content