By LondonRob


2013-11-13 17:11:07 8 Comments

How can I achieve the equivalents of SQL's IN and NOT IN?

I have a list with the required values. Here's the scenario:

df = pd.DataFrame({'countries':['US','UK','Germany','China']})
countries = ['UK','China']

# pseudo-code:
df[df['countries'] not in countries]

My current way of doing this is as follows:

df = pd.DataFrame({'countries':['US','UK','Germany','China']})
countries = pd.DataFrame({'countries':['UK','China'], 'matched':True})

# IN
df.merge(countries,how='inner',on='countries')

# NOT IN
not_in = df.merge(countries,how='left',on='countries')
not_in = not_in[pd.isnull(not_in['matched'])]

But this seems like a horrible kludge. Can anyone improve on it?

6 comments

@Sam Henderson 2017-07-13 03:12:17

I wanted to filter out dfbc rows that had a BUSINESS_ID that was also in the BUSINESS_ID of dfProfilesBusIds

dfbc = dfbc[~dfbc['BUSINESS_ID'].isin(dfProfilesBusIds['BUSINESS_ID'])]

@cricket_007 2017-07-19 12:17:45

You can negate the isin (as done in the accepted answer) rather than comparing to False

@MaxU 2017-07-19 12:19:40

Alternative solution that uses .query() method:

In [5]: df.query("countries in @countries")
Out[5]:
  countries
1        UK
3     China

In [6]: df.query("countries not in @countries")
Out[6]:
  countries
0        US
2   Germany

@LondonRob 2017-07-19 14:49:26

Note that this is currently marked as "experimental" in the docs...

@cs95 2019-04-07 00:43:44

How to implement in and not in for a pandas DataFrame?

Pandas offers two methods: Series.isin and DataFrame.isin for Series and DataFrames, respectively. Here is the mapping of the titular python operators to their equivalent pandas operations.

╒════════╤══════════════════════╤══════════════════════╕
│        │ Python               │ Pandas               │
╞════════╪══════════════════════╪══════════════════════╡
│ in     │ item in sequence     │ sequence.isin(item)  │
├────────┼──────────────────────┼──────────────────────┤
│ not in │ item not in sequence │ ~sequence.isin(item) │
╘════════╧══════════════════════╧══════════════════════╛

To implement "not in", you must invert the result of isin.

Also note that, in the pandas case, "sequence" can either refer to a Series or DataFrame, and "item" can itself be an iterable (more on this soon).


Filter DataFrame Based on ONE Column (also applies to Series)

The most common scenario is applying an isin condition on a specific column to filter rows in a DataFrame.

df = pd.DataFrame({'countries': ['US', 'UK', 'Germany', np.nan, 'China']})
df
  countries
0        US
1        UK
2   Germany
3     China

c1 = ['UK', 'China']             # list
c2 = {'Germany'}                 # set
c3 = pd.Series(['China', 'US'])  # Series
c4 = np.array(['US', 'UK'])      # array

Series.isin accepts various types as inputs. The following are all valid ways of getting what you want:

df['countries'].isin(c1)

0    False
1     True
2    False
3    False
4     True
Name: countries, dtype: bool

# `in` operation
df[df['countries'].isin(c1)]

  countries
1        UK
4     China

# `not in` operation
df[~df['countries'].isin(c1)]

  countries
0        US
2   Germany
3       NaN

# Filter with `set` (tuples work too)
df[df['countries'].isin(c2)]

  countries
2   Germany

# Filter with another Series
df[df['countries'].isin(c3)]

  countries
0        US
4     China

# Filter with array
df[df['countries'].isin(c4)]

  countries
0        US
1        UK

Filter on MANY Columns

Sometimes, you will want to apply an 'in' membership check with some search terms over multiple columns,

df2 = pd.DataFrame({
    'A': ['x', 'y', 'z', 'q'], 'B': ['w', 'a', np.nan, 'x'], 'C': np.arange(4)})
df2

   A    B  C
0  x    w  0
1  y    a  1
2  z  NaN  2
3  q    x  3

c1 = ['x', 'w', 'p']

To apply the isin condition to both columns "A" and "B", use DataFrame.isin:

df2[['A', 'B']].isin(c1)

      A      B
0   True   True
1  False  False
2  False  False
3  False   True

From this, to retain rows where at least one column is True, we can use any along the first axis:

df2[['A', 'B']].isin(c1).any(axis=1)

0     True
1    False
2    False
3     True
dtype: bool

df2[df2[['A', 'B']].isin(c1).any(axis=1)]

   A  B  C
0  x  w  0
3  q  x  3

Similarly, to retain rows where ALL columns are True, use all in the same manner as before.

df2[df2[['A', 'B']].isin(c1).all(axis=1)]

   A  B  C
0  x  w  0

Notable Mentions: numpy.isin, query, list comprehensions (string data)

In addition to the methods described above, you can also use the numpy equivalent: numpy.isin.

# `in` operation
df[np.isin(df['countries'], c1)]

  countries
1        UK
4     China

# `not in` operation
df[np.isin(df['countries'], c1, invert=True)]

  countries
0        US
2   Germany
3       NaN

Why is it worth considering? NumPy functions are usually a bit faster than their pandas equivalents because of lower overhead. Since this is an elementwise operation that does not depend on index alignment, there are very few situations where this method is not an appropriate replacement for pandas' isin.

Pandas routines are usually iterative when working with strings, because string operations are hard to vectorise. There is a lot of evidence to suggest that list comprehensions will be faster here.. We resort to an in check now.

c1_set = set(c1) # Using `in` with `sets` is a constant time operation... 
                 # This doesn't matter for pandas because the implementation differs.
# `in` operation
df[[x in c1_set for x in df['countries']]]

  countries
1        UK
4     China

# `not in` operation
df[[x not in c1_set for x in df['countries']]]

  countries
0        US
2   Germany
3       NaN

It is a lot more unwieldy to specify, however, so don't use it unless you know what you're doing.

Lastly, there's also DataFrame.query which has been covered in this answer. numexpr FTW!

@WeNYoBen 2019-04-07 01:45:11

Nice answer and do not why got down vote

@LondonRob 2019-04-17 11:42:52

This is pretty hard to beat for completeness. Enjoyed the discussion of the different methods and, crucially, when each might be efficient.

@DSM 2013-11-13 17:13:39

You can use pd.Series.isin.

For "IN": somewhere.isin(something) (read: is something there in somewhere?)

Or for "NOT IN": ~somewhere.isin(something)

As an example:

>>> df
  countries
0        US
1        UK
2   Germany
3     China
>>> countries
['UK', 'China']
>>> df.countries.isin(countries)
0    False
1     True
2    False
3     True
Name: countries, dtype: bool
>>> df[df.countries.isin(countries)]
  countries
1        UK
3     China
>>> df[~df.countries.isin(countries)]
  countries
0        US
2   Germany

@Kos 2013-11-13 17:15:09

isin is not inverse sin()? :D

@TomAugspurger 2013-11-13 18:07:15

Just an FYI, the @LondonRob had his as a DataFrame and yours is a Series. DataFrame's isin was added in .13.

@LondonRob 2013-11-13 18:41:17

Any suggestions for how to do this with pandas 0.12.0? It's the current released version. (Maybe I should just wait for 0.13?!)

@TomAugspurger 2013-11-13 19:41:07

If you're actually dealing with 1-dimensional arrays (like in you're example) then on you're first line use a Series instead of a DataFrame, like @DSM used: df = pd.Series({'countries':['US','UK','Germany','China']})

@DSM 2013-11-14 16:10:41

@TomAugspurger: like usual, I'm probably missing something. df, both mine and his, is a DataFrame. countries is a list. df[~df.countries.isin(countries)] produces a DataFrame, not a Series, and seems to work even back in 0.11.0.dev-14a04dd.

@Arthur D. Howland 2018-04-09 17:39:46

is it possible to use "is not in" for IF statements, example: if '*' ~in filename:

@ifly6 2018-05-18 22:20:18

This answer is confusing because you keep reusing the countries variable. Well, the OP does it, and that's inherited, but that something is done badly before does not justify doing it badly now.

@Vlady Veselinov 2018-05-28 16:05:03

if only Python libraries had regard for its own naming standards @kos

@Mr_and_Mrs_D 2018-09-21 20:53:44

Should countries = ['UK','China'] be a set here? Does it make any difference?

@Ioannis Nasios 2018-04-04 11:51:01

df = pd.DataFrame({'countries':['US','UK','Germany','China']})
countries = ['UK','China']

implement in:

df[df.countries.isin(countries)]

implement not in as in of rest countries:

df[df.countries.isin([x for x in np.unique(df.countries) if x not in countries])]

@Kos 2013-11-13 17:14:32

I've been usually doing generic filtering over rows like this:

criterion = lambda row: row['countries'] not in countries
not_in = df[df.apply(criterion, axis=1)]

@Jeff 2013-11-13 17:47:58

FYI, this is much slower than @DSM soln which is vectorized

@Kos 2013-11-14 07:42:50

@Jeff I'd expect that, but that's what I fall back to when I need to filter over something unavailable in pandas directly. (I was about to say "like .startwith or regex matching, but just found out about Series.str that has all of that!)

Related Questions

Sponsored Content

16 Answered Questions

[SOLVED] Selecting multiple columns in a pandas dataframe

33 Answered Questions

[SOLVED] Renaming columns in pandas

18 Answered Questions

[SOLVED] Get list from pandas DataFrame column headers

38 Answered Questions

[SOLVED] How do I check whether a file exists without exceptions?

17 Answered Questions

[SOLVED] How to iterate over rows in a DataFrame in Pandas?

14 Answered Questions

[SOLVED] Select rows from a DataFrame based on values in a column in pandas

39 Answered Questions

[SOLVED] How to merge two dictionaries in a single expression?

23 Answered Questions

[SOLVED] Adding new column to existing DataFrame in Python pandas

18 Answered Questions

[SOLVED] Add one row to pandas DataFrame

15 Answered Questions

[SOLVED] Delete column from pandas DataFrame by column name

Sponsored Content