By euforia


2012-07-05 18:57:34 8 Comments

I have a DataFrame with 4 columns of which 2 contain string values. I was wondering if there was a way to select rows based on a partial string match against a particular column?

In other words, a function or lambda function that would do something like

re.search(pattern, cell_in_question) 

returning a boolean. I am familiar with the syntax of df[df['A'] == "hello world"] but can't seem to find a way to do the same with a partial string match say 'hello'.

Would someone be able to point me in the right direction?

7 comments

@cs95 2019-03-25 09:58:20

How do I select by partial string from a pandas DataFrame?

This post is meant for readers who want to

  • search for a substring in a string column (the simplest case)
  • search for multiple substrings (similar to isin)
  • match a whole word from text (e.g., "blue" should match "the sky is blue" but not "bluejay")
  • match multiple whole words

...and would like to know more about what methods should be preferred over others.

(P.S.: I've seen a lot of questions on similar topics, I thought it would be good to leave this here.)


Basic Substring Search

df1 = pd.DataFrame({'col': ['foo', 'foobar', 'bar', 'baz']})
df1

      col
0     foo
1  foobar
2     bar
3     baz

To select all rows containing "foo", use str.contains:

df1[df1['col'].str.contains('foo')]

      col
0     foo
1  foobar

Note that this is a pure substring search, so you can safely disable regex based matching.

df1[df1['col'].str.contains('foo', regex=False)]

      col
0     foo
1  foobar

Performance wise, this does make a difference.

df2 = pd.concat([df1] * 1000, ignore_index=True)

%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]

6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Avoid using regex-based search if you don't need it.

Note
Partial substring searches that are anchored at the start or end of strings can be done using str.startswith or str.endswith respectively.

Additionally, for regex based searches anchored at the start, use str.match.

Regex-based Search
Most str methods support regular expressions. For example, to find rows in df1 which contain "foo" followed by something else, we can use

df1[df1['col'].str.contains(r'foo(?!$)')]

      col
1  foobar

Multiple Substring Search

This is most easily achieved through a regex search using the regex OR pipe.

# Slightly modified example.
df4 = pd.DataFrame({'col': ['foo abc', 'foobar xyz', 'bar32', 'baz 45']})
df4

          col
0     foo abc
1  foobar xyz
2       bar32
3      baz 45

df4[df4['col'].str.contains(r'foo|baz')]

          col
0     foo abc
1  foobar xyz
3      baz 45

You can also create a list of terms, then join them:

terms = ['foo', 'baz']
df4[df4['col'].str.contains('|'.join(terms))]

          col
0     foo abc
1  foobar xyz
3      baz 45

Sometimes, it is wise to escape your terms in case they have characters that can be interpreted as regex metacharacters. If your terms contain any of the following characters...

. ^ $ * + ? { } [ ] \ | ( )

Then, you'll need to use re.escape to escape them:

import re
df4[df4['col'].str.contains('|'.join(map(re.escape, terms)))]

          col
0     foo abc
1  foobar xyz
3      baz 45

re.escape has the effect of escaping the special characters so they're treated literally.

re.escape(r'.foo^')
# '\\.foo\\^'

Matching Entire Word(s)

By default, the substring search searches for the specified substring/pattern regardless of whether it is full word or not. To only match full words, we will need to make use of regular expressions here—in particular, our pattern will need to specify word boundaries (\b).

For example,

df3 = pd.DataFrame({'col': ['the sky is blue', 'bluejay by the window']})
df3

                     col
0        the sky is blue
1  bluejay by the window

Now consider,

df3[df3['col'].str.contains('blue')]

                     col
0        the sky is blue
1  bluejay by the window

v/s

df3[df3['col'].str.contains(r'\bblue\b')]

               col
0  the sky is blue

Multiple Whole Word Search

Similar to the above, except we add a word boundary (\b) to the joined pattern.

p = r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df4[df4['col'].str.contains(p)]

       col
0  foo abc
3   baz 45

Where p looks like this,

p
# '\\b(?:foo|baz)\\b'

A Great Alternative: Use List Comprehensions!

Because you can! And you should! They are usually a little bit faster than string methods, because string methods are hard to vectorise and usually have loopy implementations.

Instead of,

df1[df1['col'].str.contains('foo', regex=False)]

Use the in operator inside a list comp,

df1[['foo' in x for x in df1['col']]]

       col
0  foo abc
1   foobar

Instead of,

regex_pattern = r'foo(?!$)'
df1[df1['col'].str.contains(regex_pattern)]

Use re.compile (to cache your regex) + Pattern.search inside a list comp,

p = re.compile(regex_pattern, flags=re.IGNORECASE)
df1[[bool(p.search(x)) for x in df1['col']]]

      col
1  foobar

If "col" has NaNs, then instead of

df1[df1['col'].str.contains(regex_pattern, na=False)]

Use,

def try_search(p, x):
    try:
        return bool(p.search(x))
    except TypeError:
        return False

p = re.compile(regex_pattern)
df1[[try_search(p, x) for x in df1['col']]]

      col
1  foobar

More Options for Partial String Matching: np.char.find, np.vectorize, DataFrame.query.

In addition to str.contains and list comprehensions, you can also use the following alternatives.

np.char.find
Supports substring searches (read: no regex) only.

df4[np.char.find(df4['col'].values.astype(str), 'foo') > -1]

          col
0     foo abc
1  foobar xyz

np.vectorize
This is a wrapper around a loop, but with lesser overhead than most pandas str methods.

f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df1['col'], 'foo')
# array([ True,  True, False, False])

df1[f(df1['col'], 'foo')]

       col
0  foo abc
1   foobar

Regex solutions possible:

regex_pattern = r'foo(?!$)'
p = re.compile(regex_pattern)
f = np.vectorize(lambda x: pd.notna(x) and bool(p.search(x)))
df1[f(df1['col'])]

      col
1  foobar

DataFrame.query
Supports string methods through the python engine. This offers no visible performance benefits, but is nonetheless useful to know if you need to dynamically generate your queries.

df1.query('col.str.contains("foo")', engine='python')

      col
0     foo
1  foobar

More information on query and eval family of methods can be found at Dynamic Expression Evaluation in pandas using pd.eval().


Recommended Usage Precedence

  1. (First) str.contains, for its simplicity
  2. List comprehensions, for its performance
  3. np.vectorize
  4. (Last) df.query

@Philipp Schwarz 2016-10-12 21:04:32

If anyone wonders how to perform a related problem: "Select column by partial string"

Use:

df.filter(like='hello')  # select columns which contain the word hello

And to select rows by partial string matching, pass axis=0 to filter:

# selects rows which contain the word hello in their index label
df.filter(like='hello', axis=0)  

@elPastor 2017-06-17 21:53:57

This can be distilled to: df.loc[:, df.columns.str.contains('a')]

@Ted Petrou 2017-10-25 02:57:07

which can be further distilled to df.filter(like='a')

@Garrett 2012-07-17 21:52:18

Based on github issue #620, it looks like you'll soon be able to do the following:

df[df['A'].str.contains("hello")]

Update: vectorized string methods (i.e., Series.str) are available in pandas 0.8.1 and up.

@Wes McKinney 2012-08-08 01:57:55

This is implemented now

@LonelySoul 2013-06-27 16:41:13

How do we go about "Hello" and "Britain" if I want to find them with "OR" condition.

@Garrett 2013-06-27 19:20:45

Since str.* methods treat the input pattern as a regular expression, you can use df[df['A'].str.contains("Hello|Britain")]

@zyxue 2017-03-01 17:25:43

Is it possible to convert .str.contains to use .query() api?

@Franck Dernoncourt 2017-07-05 18:01:43

@François Leblanc 2018-02-13 20:22:43

df[df['value'].astype(str).str.contains('1234.+')] for filtering out non-string-type columns.

@NeStack 2018-11-30 17:32:40

What about "AND" condition - how do you look for multiple strings on the same time, so not just "hello", but ["hello", "hey", "hi"]?

@Garrett 2018-11-30 19:35:34

to "AND" substrings when the order of substrings is important/known, you could use df[df.A.str.contains("STR1.*STR2")]. if order is unimportant/unknown, df[df.A.str.contains("STR1") & df.A.str.contains("STR2")]

@cs95 2019-03-25 20:20:10

@NeStack I've added more information about multiple substring searches here.

@Mike 2014-11-10 19:26:27

Say you have the following DataFrame:

>>> df = pd.DataFrame([['hello', 'hello world'], ['abcd', 'defg']], columns=['a','b'])
>>> df
       a            b
0  hello  hello world
1   abcd         defg

You can always use the in operator in a lambda expression to create your filter.

>>> df.apply(lambda x: x['a'] in x['b'], axis=1)
0     True
1    False
dtype: bool

The trick here is to use the axis=1 option in the apply to pass elements to the lambda function row by row, as opposed to column by column.

@ComplexData 2016-10-18 20:23:28

How do I modify above to say that x['a'] exists only in beginning of x['b']?

@cs95 2019-03-25 10:27:59

apply is a bad idea here in terms of performance and memory. See this answer.

@sharon 2014-11-10 17:05:17

I am using pandas 0.14.1 on macos in ipython notebook. I tried the proposed line above:

df[df['A'].str.contains("Hello|Britain")]

and got an error:

"cannot index with vector containing NA / NaN values"

but it worked perfectly when an "==True" condition was added, like this:

df[df['A'].str.contains("Hello|Britain")==True]

@joshlk 2015-07-02 14:00:41

Or you can do: df[df['A'].str.contains("Hello|Britain", na=False)]

@Dirk 2019-05-20 13:27:27

Both with and without ==True worked fine for me.

@Christian 2014-04-10 15:36:14

Quick note: if you want to do selection based on a partial string contained in the index, try the following:

df['stridx']=df.index
df[df['stridx'].str.contains("Hello|Britain")]

@Yury Bayda 2015-05-08 21:27:16

You can just df[df.index.to_series().str.contains('LLChit')]

@euforia 2012-07-06 17:08:46

Here's what I ended up doing for partial string matches. If anyone has a more efficient way of doing this please let me know.

def stringSearchColumn_DataFrame(df, colName, regex):
    newdf = DataFrame()
    for idx, record in df[colName].iteritems():

        if re.search(regex, record):
            newdf = concat([df[df[colName] == record], newdf], ignore_index=True)

    return newdf

@MarkokraM 2014-04-10 13:56:53

Should be 2x to 3x faster if you compile regex before loop: regex = re.compile(regex) and then if regex.search(record)

@Teepeemm 2018-06-20 19:36:04

@MarkokraM docs.python.org/3.6/library/re.html#re.compile says that the most recent regexs are cached for you, so you don't need to compile yourself.

@cs95 2019-03-25 10:26:10

Do not use iteritems to iterate over a DataFrame. It ranks last in terms of pandorability and performance

Related Questions

Sponsored Content

53 Answered Questions

[SOLVED] How to replace all occurrences of a string in JavaScript

3 Answered Questions

16 Answered Questions

[SOLVED] Selecting multiple columns in a pandas dataframe

10 Answered Questions

[SOLVED] Does Python have a string 'contains' substring method?

33 Answered Questions

[SOLVED] Renaming columns in pandas

60 Answered Questions

[SOLVED] How do I read / convert an InputStream into a String in Java?

61 Answered Questions

[SOLVED] What is the difference between String and string in C#?

17 Answered Questions

[SOLVED] How to iterate over rows in a DataFrame in Pandas?

14 Answered Questions

[SOLVED] Select rows from a DataFrame based on values in a column in pandas

15 Answered Questions

[SOLVED] Delete column from pandas DataFrame by column name

Sponsored Content