By TheChymera


2013-11-27 18:56:58 8 Comments

I have a dictionary which looks like this: di = {1: "A", 2: "B"}

I would like to apply it to the "col1" column of a dataframe similar to:

     col1   col2
0       w      a
1       1      2
2       2    NaN

to get:

     col1   col2
0       w      a
1       A      2
2       B    NaN

How can I best do this? For some reason googling terms relating to this only shows me links about how to make columns from dicts and vice-versa :-/

8 comments

@dorien 2019-05-17 09:14:54

A nice complete solution that keeps a map of your class labels:

labels = features['col1'].unique()
labels_dict = dict(zip(labels, range(len(labels))))
features = features.replace({"col1": labels_dict})

This way, you can at any point refer to the original class label from labels_dict.

@JohnE 2017-01-16 14:40:56

map can be much faster than replace

If your dictionary has more than a couple of keys, using map can be much faster than replace. There are two versions of this approach, depending on whether your dictionary exhaustively maps all possible values (and also whether you want non-matches to keep their values or be converted to NaNs):

Exhaustive Mapping

In this case, the form is very simple:

df['col1'].map(di)       # note: if the dictionary does not exhaustively map all
                         # entries then non-matched entries are changed to NaNs

Although map most commonly takes a function as its argument, it can alternatively take a dictionary or series: Documentation for Pandas.series.map

Non-Exhaustive Mapping

If you have a non-exhaustive mapping and wish to retain the existing variables for non-matches, you can add fillna:

df['col1'].map(di).fillna(df['col1'])

as in @jpp's answer here: Replace values in a pandas series via dictionary efficiently

Benchmarks

Using the following data with pandas version 0.23.1:

di = {1: "A", 2: "B", 3: "C", 4: "D", 5: "E", 6: "F", 7: "G", 8: "H" }
df = pd.DataFrame({ 'col1': np.random.choice( range(1,9), 100000 ) })

and testing with %timeit, it appears that map is approximately 10x faster than replace.

Note that your speedup with map will vary with your data. The largest speedup appears to be with large dictionaries and exhaustive replaces. See @jpp answer (linked above) for more extensive benchmarks and discussion.

@griffinc 2017-05-11 03:29:52

The last block of code for this answer is certainly not the most elegant, but this answer deserves some credit. It is orders of magnitude faster for large dictionaries and doesn't use up all of my RAM. It remapped a 10,000 line file using a dictionary that had about 9 million entries in half a minute. The df.replace function, while tidy and useful for small dicts, crashed after running for 20 minutes or so.

@JohnE 2018-07-03 10:40:52

@griffinc Thanks for the feedback and note that I have since updated this answer with a much simpler way to do the non-exhaustive case (thanks to @jpp)

@StatsScared 2018-08-27 18:14:40

Is it correct to state that map unlike replace changes all values in a column and not only those you are interested in changing?

@JohnE 2018-08-27 18:48:51

@StatsScared Either one will only change the value if it matches the key in the dictionary. Note that map + fillna as shown above behaves exactly the same as replace

@Max Ghenis 2019-03-15 16:42:46

map also works on an index where I couldn't figure out a way to do that with replace

@DSM 2013-11-27 19:06:53

You can use .replace. For example:

>>> df = pd.DataFrame({'col2': {0: 'a', 1: 2, 2: np.nan}, 'col1': {0: 'w', 1: 1, 2: 2}})
>>> di = {1: "A", 2: "B"}
>>> df
  col1 col2
0    w    a
1    1    2
2    2  NaN
>>> df.replace({"col1": di})
  col1 col2
0    w    a
1    A    2
2    B  NaN

or directly on the Series, i.e. df["col1"].replace(di, inplace=True).

@Pengju Zhao 2017-08-02 04:54:25

It doesn't work for me when if col```` is tuple. The error info is Cannot compare types 'ndarray(dtype=object)' and 'tuple'```

@PrestonH 2017-11-21 17:01:23

It looks like this doesn't work anymore at all, which isn't surprising given the answer was from 4 years ago. This question needs a new answer given how general the operation is...

@Dan 2017-12-06 09:47:55

@PrestonH It works perfectly for me. Running: '3.6.1 |Anaconda custom (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)]'

@famargar 2018-01-03 10:52:56

It works for me. But how if I want to replace values in ALL columns?

@Dirigo 2018-03-09 18:17:11

The only method that worked for me of the answers shown was to do a direct replacement on the Series. Thanks!

@GuSuku 2018-12-14 05:27:18

@PengjuZhao is right - replace does not work for me when the value to replace are tuples - neither on df nor directly on series. Only map worked for me in this case.

@U9-Forward 2018-09-16 00:31:56

Or do apply:

df['col1'].apply(lambda x: {1: "A", 2: "B"}.get(x,x))

Demo:

>>> df['col1']=df['col1'].apply(lambda x: {1: "A", 2: "B"}.get(x,x))
>>> df
  col1 col2
0    w    a
1    1    2
2    2  NaN
>>> 

@FaCoffee 2019-02-12 12:53:27

What happens when your di dict is a dict of lists? How can you map only one value in the list?

@wordsforthewise 2018-08-08 16:43:57

DSM has the accepted answer, but the coding doesn't seem to work for everyone. Here is one that works with the current version of pandas (0.23.4 as of 8/2018):

import pandas as pd

df = pd.DataFrame({'col1': [1, 2, 2, 3, 1],
            'col2': ['negative', 'positive', 'neutral', 'neutral', 'positive']})

conversion_dict = {'negative': -1, 'neutral': 0, 'positive': 1}
df['converted_column'] = df['col2'].replace(conversion_dict)

print(df.head())

You'll see it looks like:

   col1      col2  converted_column
0     1  negative                -1
1     2  positive                 1
2     2   neutral                 0
3     3   neutral                 0
4     1  positive                 1

The docs for pandas.DataFrame.replace are here.

@JohnE 2018-08-31 08:42:06

I never had a problem getting DSM's answer to run and I'd guess given the high vote total most other people didn't either. You might want to be more specific about the problem you are having. Maybe it has to do with your sample data which is different than DSM's?

@wordsforthewise 2018-09-02 01:10:53

Hmm, perhaps a versioning issue. Nevertheless, both answers are here now.

@Amirhos Imani 2017-12-29 17:34:35

A more native pandas approach is to apply a replace function as below:

def multiple_replace(dict, text):
  # Create a regular expression  from the dictionary keys
  regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))

  # For each match, look-up corresponding value in dictionary
  return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) 

Once you defined the function, you can apply it to your dataframe.

di = {1: "A", 2: "B"}
df['col1'] = df.apply(lambda row: multiple_replace(di, row['col1']), axis=1)

@Nico Coallier 2017-12-06 18:37:45

Adding to this question if you ever have more than one columns to remap in a data dataframe:

def remap(data,dict_labels):
    """
    This function take in a dictionnary of labels : dict_labels 
    and replace the values (previously labelencode) into the string.

    ex: dict_labels = {{'col1':{1:'A',2:'B'}}

    """
    for field,values in dict_labels.items():
        print("I am remapping %s"%field)
        data.replace({field:values},inplace=True)
    print("DONE")

    return data

Hope it can be useful to someone.

Cheers

@unutbu 2013-11-27 19:04:34

There is a bit of ambiguity in your question. There are at least three two interpretations:

  1. the keys in di refer to index values
  2. the keys in di refer to df['col1'] values
  3. the keys in di refer to index locations (not the OP's question, but thrown in for fun.)

Below is a solution for each case.


Case 1: If the keys of di are meant to refer to index values, then you could use the update method:

df['col1'].update(pd.Series(di))

For example,

import pandas as pd
import numpy as np

df = pd.DataFrame({'col1':['w', 10, 20],
                   'col2': ['a', 30, np.nan]},
                  index=[1,2,0])
#   col1 col2
# 1    w    a
# 2   10   30
# 0   20  NaN

di = {0: "A", 2: "B"}

# The value at the 0-index is mapped to 'A', the value at the 2-index is mapped to 'B'
df['col1'].update(pd.Series(di))
print(df)

yields

  col1 col2
1    w    a
2    B   30
0    A  NaN

I've modified the values from your original post so it is clearer what update is doing. Note how the keys in di are associated with index values. The order of the index values -- that is, the index locations -- does not matter.


Case 2: If the keys in di refer to df['col1'] values, then @DanAllan and @DSM show how to achieve this with replace:

import pandas as pd
import numpy as np

df = pd.DataFrame({'col1':['w', 10, 20],
                   'col2': ['a', 30, np.nan]},
                  index=[1,2,0])
print(df)
#   col1 col2
# 1    w    a
# 2   10   30
# 0   20  NaN

di = {10: "A", 20: "B"}

# The values 10 and 20 are replaced by 'A' and 'B'
df['col1'].replace(di, inplace=True)
print(df)

yields

  col1 col2
1    w    a
2    A   30
0    B  NaN

Note how in this case the keys in di were changed to match values in df['col1'].


Case 3: If the keys in di refer to index locations, then you could use

df['col1'].put(di.keys(), di.values())

since

df = pd.DataFrame({'col1':['w', 10, 20],
                   'col2': ['a', 30, np.nan]},
                  index=[1,2,0])
di = {0: "A", 2: "B"}

# The values at the 0 and 2 index locations are replaced by 'A' and 'B'
df['col1'].put(di.keys(), di.values())
print(df)

yields

  col1 col2
1    A    a
2   10   30
0    B  NaN

Here, the first and third rows were altered, because the keys in di are 0 and 2, which with Python's 0-based indexing refer to the first and third locations.

@Dan Allan 2013-11-27 19:06:58

replace is equally good, and maybe a better word for what is happening here.

@Dan Allan 2013-11-27 19:11:10

@DSM took care of it.

@DSM 2013-11-27 20:35:32

Doesn't the OP's posted target dataframe eliminate the ambiguity? Still, this answer is useful, so +1.

@unutbu 2013-11-27 20:47:38

@DSM: Oops, you are right there is no possibility of Case3, but I don't think the OP's target dataframe distinguishes Case1 from Case2 since the index values equal the column values.

@DSM 2013-11-27 20:59:05

@unutbu: ah, point taken.

Related Questions

Sponsored Content

34 Answered Questions

[SOLVED] How do I sort a dictionary by value?

33 Answered Questions

[SOLVED] Renaming columns in pandas

9 Answered Questions

[SOLVED] How can I replace all the NaN values with Zero's in a column of a pandas dataframe

17 Answered Questions

[SOLVED] How to iterate over rows in a DataFrame in Pandas?

14 Answered Questions

[SOLVED] Select rows from a DataFrame based on values in a column in pandas

15 Answered Questions

[SOLVED] Delete column from pandas DataFrame by column name

13 Answered Questions

[SOLVED] "Large data" work flows using pandas

13 Answered Questions

[SOLVED] Convert Python dict into a dataframe

19 Answered Questions

49 Answered Questions

[SOLVED] Sort a Map<Key, Value> by values

Sponsored Content