By user7289


2013-11-11 18:52:06 8 Comments

I have a dataframe along the lines of the below:

    Type       Set
1    A          Z
2    B          Z           
3    B          X
4    C          Y

I want to add another column to the dataframe (or generate a series) of the same length as the dataframe (= equal number of records/rows) which sets a colour green if Set = 'Z' and 'red' if Set = otherwise.

What's the best way to do this?

8 comments

@Hossein 2019-03-06 17:51:55

Maybe this has been possible with newer updates of Pandas (tested with pandas=1.0.5), but I think the following is the shortest and maybe best answer for the question, so far. You can use the .loc method and use one condition or several depending on your need.

Code Summary:

df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))
df['Color'] = "red"
df.loc[(df['Set']=="Z"), 'Color'] = "green"

#practice!
df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"

Explanation:

df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))

# df so far: 
  Type Set  
0    A   Z 
1    B   Z 
2    B   X 
3    C   Y

add a 'color' column and set all values to "red"

df['Color'] = "red"

Apply your single condition:

df.loc[(df['Set']=="Z"), 'Color'] = "green"


# df: 
  Type Set  Color
0    A   Z  green
1    B   Z  green
2    B   X    red
3    C   Y    red

or multiple conditions if you want:

df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"

You can read on Pandas logical operators and conditional selection here: Logical operators for boolean indexing in Pandas

@Salvador Vigo 2019-08-26 15:00:40

The best one so far. You could probably add for more conditions that would be the code df.loc[(df['Set']=="Z") & (df['Type']=="A"), 'Color'] = "green"

@AMC 2019-10-16 14:42:01

This should be the accepted answer. Actually idiomatic and extensible.

@unutbu 2013-11-11 19:03:15

If you only have two choices to select from:

df['color'] = np.where(df['Set']=='Z', 'green', 'red')

For example,

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
print(df)

yields

  Set Type  color
0   Z    A  green
1   Z    B  green
2   X    B    red
3   Y    C    red

If you have more than two conditions then use np.select. For example, if you want color to be

  • yellow when (df['Set'] == 'Z') & (df['Type'] == 'A')
  • otherwise blue when (df['Set'] == 'Z') & (df['Type'] == 'B')
  • otherwise purple when (df['Type'] == 'B')
  • otherwise black,

then use

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
conditions = [
    (df['Set'] == 'Z') & (df['Type'] == 'A'),
    (df['Set'] == 'Z') & (df['Type'] == 'B'),
    (df['Type'] == 'B')]
choices = ['yellow', 'blue', 'purple']
df['color'] = np.select(conditions, choices, default='black')
print(df)

which yields

  Set Type   color
0   Z    A  yellow
1   Z    B    blue
2   X    B  purple
3   Y    C   black

@Amol Sharma 2016-01-18 13:34:43

doesn't work if i put two conditions inside where clause with and

@denson 2016-10-19 16:48:02

df['color'] = list(np.where(df['Set']=='Z', 'green', 'red')) will suppress the pandas warning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

@Alejandro 2017-07-25 18:26:16

'green' and 'red' can also be replaced with column arithmetic. e.g., df['foo'] = np.where(df['Set']=='Z', df['Set'], df['Type'].shift(1))

@vvv 2018-02-13 01:17:22

does np.where create a new column? I used this code and when I do df.color.head() I get: 'numpy.ndarray' object has no attribute 'head'

@unutbu 2018-02-13 01:25:29

Did you use the assignment df.color = np.where(...) to create a new column? If you did, then no new column was created. Instead, df gets an attribute named color which is assigned to a NumPy array. In contrast, df['color'] = np.where(...) create a new DataFrame column. Then df.color would return the Series df['color']. This asymmetry in the behavior of df['color'] vs df.color is (I believe) a reason never to use the dot syntax to access columns. Saving a few keystrokes is not worth this mental confusion.

@unutbu 2018-02-13 01:29:27

But anyway, you are correct: np.where always returns a NumPy array. When you assign values to a DataFrame column using df['color'] = ..., and later retrieve it using df['color'] you get back a Series instead.

@unutbu 2018-02-13 01:33:53

<rant>The other reason I find the dot syntax distasteful is you can't access columns named last or mean (to name just a few) the same as foo because last and mean are method names. df.last rightly returns the method. So if you use the dot syntax you have to have perfect recall off all DataFrame attribute names or else you'll constantly be writing buggy code, possibly overwriting methods with column values....</rant> :)

@user989762 2018-08-02 09:46:29

This is a great solution. I wonder though, is there also a "pure" pandas solution rather than a solution that mixes numpy and pandas? Would that use df.Set.map(lambda x: ...) as proposed in the answer by @cheekybastard below?

@unutbu 2018-08-02 12:02:13

@user989762: Pandas is built on top of NumPy, so while df['Set'].map(lambda x: ...) is "pure Pandas" I don't really see the attraction in replacing np.where with a lambda function. (Under the hood, map has to call the lambda function in a loop. Multiple calls to a lambda function tend to be slower than 1 call to np.where.) To me, the really interesting part of cheekbastard's answer is that a pure Python list comprehension beats np.where for a wide range of inputs.

@KMFR 2019-01-06 15:32:30

Is np.where also good for situations where I want to check if the value CONTAINS a string but doesn't need to match it (which is the case in OP's example)? Or should I use something else?

@unutbu 2019-01-06 15:36:37

To check if a Pandas Series contains a string, use Series.str.contains.

@Vishal 2019-04-21 01:43:59

Lovely answer. Side question -- PyCharm warns about "Expected type 'Union[int, float, complex, None]' got 'str' instead" when I use np.select(conditions, choices, default="foo"). Why is PyCharm warning if str is a valid np.ScalarType?

@unutbu 2019-04-21 10:09:11

@Vishal: PyCharm attempts to guess what type np.select will return, but made a mistake in this case. I don't use PyCharm so I can't test this, but you might want to look here for ways to suppress this warning.

@Lena 2019-06-27 11:06:51

I try this with conditions being whether a certain value in a row of a column is also found in a certain list: conditions = [(mydata['column'] in someList), (mydata['column'] in someOtherList)] However, then I get the following error: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). Why would my solution not work for me?

@unutbu 2019-06-27 20:23:41

The Python language specification demands that the in operator return a single True/False value. Presumably you want (mydata['column'] in someList) to return an array-like of True/False values -- one boolean for each value in mydata['column']. For this, use the Series.isin method: conditions = [(mydata['column'].isin(someList)), (mydata['column'].isin(someOtherList))]

@unutbu 2019-06-27 20:30:17

For more on why operations that try to map an entire Series to a single boolean value raise ValueError: The truth value of an array is ambiguous see this post.

@Harper 2019-08-15 09:26:10

It's a shame i can't upvote this multiple times. One upvote doesn't seem enough.

@Kaisar 2019-09-19 21:03:39

Just want to add the extreme case: If you have so many conditions that coding them with np.select() gets intractable, you can create a csv with the columns: CONDITION_A, CONDITION_B,..., CONDITION_Z, CHOICE. Then read this table as a dataframe and merge it on the right of your original dataframe.

@Yaakov Bressler 2019-12-08 18:42:01

If you're working with massive data, a memoized approach would be best:

# First create a dictionary of manually stored values
color_dict = {'Z':'red'}

# Second, build a dictionary of "other" values
color_dict_other = {x:'green' for x in df['Set'].unique() if x not in color_dict.keys()}

# Next, merge the two
color_dict.update(color_dict_other)

# Finally, map it to your column
df['color'] = df['Set'].map(color_dict)

This approach will be fastest when you have many repeated values. My general rule of thumb is to memoize when: data_size > 10**4 & n_distinct < data_size/4

E.x. Memoize in a case 10,000 rows with 2,500 or fewer distinct values.

@AMC 2020-02-10 02:31:57

Alright, so with only 2 distinct values to map, 100,000,000 rows, it takes 6.67 seconds to run without "memoization", and 9.86 seconds with.

@AMC 2020-02-10 02:48:11

100,000,000 rows, 52 distinct values, where 1 of those maps to the first output value, and the other 51 all correspond to the other: 7.99 seconds without memoization, 11.1 seconds with.

@Yaakov Bressler 2020-02-10 04:14:54

Are your values in random order? Or are they back to back? High speed of pandas could be due to caching @AMC

@AMC 2020-02-10 22:56:03

Are your values in random order? Or are they back to back? Values are random, selected using random.choices().

@Jaroslav Bezděk 2019-10-10 14:30:03

One liner with .apply() method is following:

df['color'] = df['Set'].apply(lambda set_: 'green' if set_=='Z' else 'red')

After that, df data frame looks like this:

>>> print(df)
  Type Set  color
0    A   Z  green
1    B   Z  green
2    B   X    red
3    C   Y    red

@bli 2017-02-08 13:04:29

The following is slower than the approaches timed here, but we can compute the extra column based on the contents of more than one column, and more than two values can be computed for the extra column.

Simple example using just the "Set" column:

def set_color(row):
    if row["Set"] == "Z":
        return "red"
    else:
        return "green"

df = df.assign(color=df.apply(set_color, axis=1))

print(df)
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C  green

Example with more colours and more columns taken into account:

def set_color(row):
    if row["Set"] == "Z":
        return "red"
    elif row["Type"] == "C":
        return "blue"
    else:
        return "green"

df = df.assign(color=df.apply(set_color, axis=1))

print(df)
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C   blue

Edit (21/06/2019): Using plydata

It is also possible to use plydata to do this kind of things (this seems even slower than using assign and apply, though).

from plydata import define, if_else

Simple if_else:

df = define(df, color=if_else('Set=="Z"', '"red"', '"green"'))

print(df)
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C  green

Nested if_else:

df = define(df, color=if_else(
    'Set=="Z"',
    '"red"',
    if_else('Type=="C"', '"green"', '"blue"')))

print(df)                            
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B   blue
3   Y    C  green

@cheekybastard 2015-07-02 00:27:06

List comprehension is another way to create another column conditionally. If you are working with object dtypes in columns, like in your example, list comprehensions typically outperform most other methods.

Example list comprehension:

df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]

%timeit tests:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
%timeit df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit df['color'] = np.where(df['Set']=='Z', 'green', 'red')
%timeit df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

1000 loops, best of 3: 239 µs per loop
1000 loops, best of 3: 523 µs per loop
1000 loops, best of 3: 263 µs per loop

@blacksite 2017-04-20 16:45:28

Note that, with much larger dataframes (think pd.DataFrame({'Type':list('ABBC')*100000, 'Set':list('ZZXY')*100000})-size), numpy.where outpaces map, but the list comprehension is king (about 50% faster than numpy.where).

@Mappi 2019-01-01 06:42:24

Can the list comprehension method be used if the condition needs information from multiple columns? I am looking for something like this (this does not work): df['color'] = ['red' if (x['Set'] == 'Z') & (x['Type'] == 'B') else 'green' for x in df]

@cheekybastard 2019-01-14 01:38:23

Add iterrows to the dataframe, then you can access multiple columns via row: ['red' if (row['Set'] == 'Z') & (row['Type'] == 'B') else 'green' for index, row in in df.iterrows()]

@Paul Rougieux 2019-09-17 15:28:34

Note this nice solution will not work if you need to take replacement values from another series in the data frame, such as df['color_type'] = np.where(df['Set']=='Z', 'green', df['Type'])

@AMC 2020-02-10 01:51:14

@cheekybastard Or don't, since .iterrows() is notoriously sluggish and the DataFrame shouldn't be modified while iterating.

@PythonWizard 2020-04-02 07:21:15

For me, df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']] is working.

@blacksite 2017-02-15 21:43:16

Here's yet another way to skin this cat, using a dictionary to map new values onto the keys in the list:

def map_values(row, values_dict):
    return values_dict[row]

values_dict = {'A': 1, 'B': 2, 'C': 3, 'D': 4}

df = pd.DataFrame({'INDICATOR': ['A', 'B', 'C', 'D'], 'VALUE': [10, 9, 8, 7]})

df['NEW_VALUE'] = df['INDICATOR'].apply(map_values, args = (values_dict,))

What's it look like:

df
Out[2]: 
  INDICATOR  VALUE  NEW_VALUE
0         A     10          1
1         B      9          2
2         C      8          3
3         D      7          4

This approach can be very powerful when you have many ifelse-type statements to make (i.e. many unique values to replace).

And of course you could always do this:

df['NEW_VALUE'] = df['INDICATOR'].map(values_dict)

But that approach is more than three times as slow as the apply approach from above, on my machine.

And you could also do this, using dict.get:

df['NEW_VALUE'] = [values_dict.get(v, None) for v in df['INDICATOR']]

@Monica Heddneck 2018-06-21 00:16:24

I like this answer because it shows how to do multiple replacements of values

@AMC 2020-02-10 02:05:33

But that approach is more than three times as slow as the apply approach from above, on my machine. How did you benchmark these? From my quick measurements, the .map() solution is ~10 times faster than .apply().

@AMC 2020-02-10 02:18:59

Update: On 100,000,000 rows, 52 string values, .apply() takes 47 seconds, versus only 5.91 seconds for .map().

@acharuva 2014-06-06 04:43:52

Another way in which this could be achieved is

df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

@Yaakov Bressler 2019-12-08 18:25:42

Good approach, this can be memoized for faster efficiency (in larger datasets), though would require an additional step.

Related Questions

Sponsored Content

27 Answered Questions

[SOLVED] Add one row to pandas DataFrame

17 Answered Questions

[SOLVED] Selecting multiple columns in a pandas dataframe

27 Answered Questions

[SOLVED] Renaming columns in pandas

15 Answered Questions

[SOLVED] Delete column from pandas DataFrame

24 Answered Questions

[SOLVED] Adding new column to existing DataFrame in Python pandas

22 Answered Questions

[SOLVED] How to iterate over rows in a DataFrame in Pandas

9 Answered Questions

[SOLVED] Change data type of columns in Pandas

26 Answered Questions

[SOLVED] Does Python have a ternary conditional operator?

5 Answered Questions

[SOLVED] Creating an empty Pandas DataFrame, then filling it?

10 Answered Questions

[SOLVED] How to select rows from a DataFrame based on column values?

Sponsored Content