By user7289


2013-11-11 18:52:06 8 Comments

I have a dataframe along the lines of the below:

    Type       Set
1    A          Z
2    B          Z           
3    B          X
4    C          Y

I want to add another column to the dataframe (or generate a series) of the same length as the dataframe (= equal number of records/rows) which sets a colour green if Set = 'Z' and 'red' if Set = otherwise.

What's the best way to do this?

6 comments

@bli 2017-02-08 13:04:29

The following is slower than the approaches timed here, but we can compute the extra column based on the contents of more than one column, and more than two values can be computed for the extra column.

Simple example using just the "Set" column:

def set_color(row):
    if row["Set"] == "Z":
        return "red"
    else:
        return "green"

df = df.assign(color=df.apply(set_color, axis=1))

print(df)
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C  green

Example with more colours and more columns taken into account:

def set_color(row):
    if row["Set"] == "Z":
        return "red"
    elif row["Type"] == "C":
        return "blue"
    else:
        return "green"

df = df.assign(color=df.apply(set_color, axis=1))

print(df)
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C   blue

Edit (21/06/2019): Using plydata

It is also possible to use plydata to do this kind of things (this seems even slower than using assign and apply, though).

from plydata import define, if_else

Simple if_else:

df = define(df, color=if_else('Set=="Z"', '"red"', '"green"'))

print(df)
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C  green

Nested if_else:

df = define(df, color=if_else(
    'Set=="Z"',
    '"red"',
    if_else('Type=="C"', '"green"', '"blue"')))

print(df)                            
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B   blue
3   Y    C  green

@HKRC 2019-03-06 17:51:55

Maybe this has been possible with newer updates of Pandas, but I think the following is the shortest and maybe best answer for the question, so far. You can use one condition or several depending on your need.

df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))
df['Color'] = "red"
df.loc[(df['Set']=="Z"), 'Color'] = "green"
print(df)

# result: 
  Type Set  Color
0    A   Z  green
1    B   Z  green
2    B   X    red
3    C   Y    red

@unutbu 2013-11-11 19:03:15

If you only have two choices to select from:

df['color'] = np.where(df['Set']=='Z', 'green', 'red')

For example,

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
print(df)

yields

  Set Type  color
0   Z    A  green
1   Z    B  green
2   X    B    red
3   Y    C    red

If you have more than two conditions then use np.select. For example, if you want color to be

  • yellow when (df['Set'] == 'Z') & (df['Type'] == 'A')
  • otherwise blue when (df['Set'] == 'Z') & (df['Type'] == 'B')
  • otherwise purple when (df['Type'] == 'B')
  • otherwise black,

then use

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
conditions = [
    (df['Set'] == 'Z') & (df['Type'] == 'A'),
    (df['Set'] == 'Z') & (df['Type'] == 'B'),
    (df['Type'] == 'B')]
choices = ['yellow', 'blue', 'purple']
df['color'] = np.select(conditions, choices, default='black')
print(df)

which yields

  Set Type   color
0   Z    A  yellow
1   Z    B    blue
2   X    B  purple
3   Y    C   black

@Amol Sharma 2016-01-18 13:34:43

doesn't work if i put two conditions inside where clause with and

@unutbu 2016-01-18 14:10:28

@AmolSharma: Use & instead of and. See stackoverflow.com/q/13589390/190597

@denson 2016-10-19 16:48:02

df['color'] = list(np.where(df['Set']=='Z', 'green', 'red')) will suppress the pandas warning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

@Alejandro 2017-07-25 18:26:16

'green' and 'red' can also be replaced with column arithmetic. e.g., df['foo'] = np.where(df['Set']=='Z', df['Set'], df['Type'].shift(1))

@vvv 2018-02-13 01:17:22

does np.where create a new column? I used this code and when I do df.color.head() I get: 'numpy.ndarray' object has no attribute 'head'

@unutbu 2018-02-13 01:25:29

Did you use the assignment df.color = np.where(...) to create a new column? If you did, then no new column was created. Instead, df gets an attribute named color which is assigned to a NumPy array. In contrast, df['color'] = np.where(...) create a new DataFrame column. Then df.color would return the Series df['color']. This asymmetry in the behavior of df['color'] vs df.color is (I believe) a reason never to use the dot syntax to access columns. Saving a few keystrokes is not worth this mental confusion.

@unutbu 2018-02-13 01:29:27

But anyway, you are correct: np.where always returns a NumPy array. When you assign values to a DataFrame column using df['color'] = ..., and later retrieve it using df['color'] you get back a Series instead.

@unutbu 2018-02-13 01:33:53

<rant>The other reason I find the dot syntax distasteful is you can't access columns named last or mean (to name just a few) the same as foo because last and mean are method names. df.last rightly returns the method. So if you use the dot syntax you have to have perfect recall off all DataFrame attribute names or else you'll constantly be writing buggy code, possibly overwriting methods with column values....</rant> :)

@user989762 2018-08-02 09:46:29

This is a great solution. I wonder though, is there also a "pure" pandas solution rather than a solution that mixes numpy and pandas? Would that use df.Set.map(lambda x: ...) as proposed in the answer by @cheekybastard below?

@unutbu 2018-08-02 12:02:13

@user989762: Pandas is built on top of NumPy, so while df['Set'].map(lambda x: ...) is "pure Pandas" I don't really see the attraction in replacing np.where with a lambda function. (Under the hood, map has to call the lambda function in a loop. Multiple calls to a lambda function tend to be slower than 1 call to np.where.) To me, the really interesting part of cheekbastard's answer is that a pure Python list comprehension beats np.where for a wide range of inputs.

@KMFR 2019-01-06 15:32:30

Is np.where also good for situations where I want to check if the value CONTAINS a string but doesn't need to match it (which is the case in OP's example)? Or should I use something else?

@unutbu 2019-01-06 15:36:37

To check if a Pandas Series contains a string, use Series.str.contains.

@Vishal 2019-04-21 01:43:59

Lovely answer. Side question -- PyCharm warns about "Expected type 'Union[int, float, complex, None]' got 'str' instead" when I use np.select(conditions, choices, default="foo"). Why is PyCharm warning if str is a valid np.ScalarType?

@unutbu 2019-04-21 10:09:11

@Vishal: PyCharm attempts to guess what type np.select will return, but made a mistake in this case. I don't use PyCharm so I can't test this, but you might want to look here for ways to suppress this warning.

@Lena 2019-06-27 11:06:51

I try this with conditions being whether a certain value in a row of a column is also found in a certain list: conditions = [(mydata['column'] in someList), (mydata['column'] in someOtherList)] However, then I get the following error: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). Why would my solution not work for me?

@unutbu 2019-06-27 20:23:41

The Python language specification demands that the in operator return a single True/False value. Presumably you want (mydata['column'] in someList) to return an array-like of True/False values -- one boolean for each value in mydata['column']. For this, use the Series.isin method: conditions = [(mydata['column'].isin(someList)), (mydata['column'].isin(someOtherList))]

@unutbu 2019-06-27 20:30:17

For more on why operations that try to map an entire Series to a single boolean value raise ValueError: The truth value of an array is ambiguous see this post.

@cheekybastard 2015-07-02 00:27:06

List comprehension is another way to create another column conditionally. If you are working with object dtypes in columns, like in your example, list comprehensions typically outperform most other methods.

Example list comprehension:

df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]

%timeit tests:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
%timeit df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit df['color'] = np.where(df['Set']=='Z', 'green', 'red')
%timeit df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

1000 loops, best of 3: 239 µs per loop
1000 loops, best of 3: 523 µs per loop
1000 loops, best of 3: 263 µs per loop

@blacksite 2017-04-20 16:45:28

Note that, with much larger dataframes (think pd.DataFrame({'Type':list('ABBC')*100000, 'Set':list('ZZXY')*100000})-size), numpy.where outpaces map, but the list comprehension is king (about 50% faster than numpy.where).

@Mappi 2019-01-01 06:42:24

Can the list comprehension method be used if the condition needs information from multiple columns? I am looking for something like this (this does not work): df['color'] = ['red' if (x['Set'] == 'Z') & (x['Type'] == 'B') else 'green' for x in df]

@cheekybastard 2019-01-14 01:38:23

Add iterrows to the dataframe, then you can access multiple columns via row: ['red' if (row['Set'] == 'Z') & (row['Type'] == 'B') else 'green' for index, row in in df.iterrows()]

@blacksite 2017-02-15 21:43:16

Here's yet another way to skin this cat, using a dictionary to map new values onto the keys in the list:

def map_values(row, values_dict):
    return values_dict[row]

values_dict = {'A': 1, 'B': 2, 'C': 3, 'D': 4}

df = pd.DataFrame({'INDICATOR': ['A', 'B', 'C', 'D'], 'VALUE': [10, 9, 8, 7]})

df['NEW_VALUE'] = df['INDICATOR'].apply(map_values, args = (values_dict,))

What's it look like:

df
Out[2]: 
  INDICATOR  VALUE  NEW_VALUE
0         A     10          1
1         B      9          2
2         C      8          3
3         D      7          4

This approach can be very powerful when you have many ifelse-type statements to make (i.e. many unique values to replace).

And of course you could always do this:

df['NEW_VALUE'] = df['INDICATOR'].map(values_dict)

But that approach is more than three times as slow as the apply approach from above, on my machine.

And you could also do this, using dict.get:

df['NEW_VALUE'] = [values_dict.get(v, None) for v in df['INDICATOR']]

@Monica Heddneck 2018-06-21 00:16:24

I like this answer because it shows how to do multiple replacements of values

@acharuva 2014-06-06 04:43:52

Another way in which this could be achieved is

df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

Related Questions

Sponsored Content

9 Answered Questions

[SOLVED] Select rows from a DataFrame based on values in a column in pandas

18 Answered Questions

[SOLVED] Add one row to pandas DataFrame

22 Answered Questions

[SOLVED] Renaming columns in pandas

23 Answered Questions

[SOLVED] Adding new column to existing DataFrame in Python pandas

17 Answered Questions

[SOLVED] How to iterate over rows in a DataFrame in Pandas?

14 Answered Questions

[SOLVED] "Large data" work flows using pandas

7 Answered Questions

[SOLVED] Change data type of columns in Pandas

17 Answered Questions

[SOLVED] Selecting multiple columns in a pandas dataframe

14 Answered Questions

[SOLVED] Delete column from pandas DataFrame

21 Answered Questions

Sponsored Content