By szli


2013-06-12 17:42:05 8 Comments

How to select rows from a DataFrame based on values in some column in Python Pandas?

In SQL, I would use:

SELECT *
FROM table
WHERE colume_name = some_value

I tried to look at pandas documentation but did not immediately find the answer.

10 comments

@Erfan 2019-08-03 12:05:15

More flexibility using .query with pandas >= 0.25.0:

August 2019 updated answer

Since pandas >= 0.25.0 we can use the query method to filter dataframes with pandas methods and even column names which have spaces. Normally the spaces in column names would give an error, but now we can solve that using a backtick (`) see GitHub:

# Example dataframe
df = pd.DataFrame({'Sender email':['[email protected]', "[email protected]", "[email protected]"]})

     Sender email
0  [email protected]
1  [email protected]
2    [email protected]

Using .query with method str.endswith:

df.query('`Sender email`.str.endswith("@shop.com")')

Output

     Sender email
1  [email protected]
2    [email protected]

Also we can use local variables by prefixing it with an @ in our query:

domain = 'shop.com'
df.query('`Sender email`.str.endswith(@domain)')

Output

     Sender email
1  [email protected]
2    [email protected]

@piRSquared 2017-09-11 22:14:28

There are several ways to select rows from a pandas data frame:

  1. Boolean indexing (df[df['col'] == value] )
  2. Positional indexing (df.iloc[...])
  3. Label indexing (df.xs(...))
  4. df.query(...) API

Below I show you examples of each, with advice when to use certain techniques. Assume our criterion is column 'A' == 'foo'

(Note on performance: For each base type, we can keep things simple by using the pandas API or we can venture outside the API, usually into numpy, and speed things up.)


Setup
The first thing we'll need is to identify a condition that will act as our criterion for selecting rows. We'll start with the OP's case column_name == some_value, and include some other common use cases.

Borrowing from @unutbu:

import pandas as pd, numpy as np

df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})

1. Boolean indexing

... Boolean indexing requires finding the true value of each row's 'A' column being equal to 'foo', then using those truth values to identify which rows to keep. Typically, we'd name this series, an array of truth values, mask. We'll do so here as well.

mask = df['A'] == 'foo'

We can then use this mask to slice or index the data frame

df[mask]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

This is one of the simplest ways to accomplish this task and if performance or intuitiveness isn't an issue, this should be your chosen method. However, if performance is a concern, then you might want to consider an alternative way of creating the mask.


2. Positional indexing

Positional indexing (df.iloc[...]) has its use cases, but this isn't one of them. In order to identify where to slice, we first need to perform the same boolean analysis we did above. This leaves us performing one extra step to accomplish the same task.

mask = df['A'] == 'foo'
pos = np.flatnonzero(mask)
df.iloc[pos]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

3. Label indexing

Label indexing can be very handy, but in this case, we are again doing more work for no benefit

df.set_index('A', append=True, drop=False).xs('foo', level=1)

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

4. df.query() API

pd.DataFrame.query is a very elegant/intuitive way to perform this task, but is often slower. However, if you pay attention to the timings below, for large data, the query is very efficient. More so than the standard approach and of similar magnitude as my best suggestion.

df.query('A == "foo"')

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

My preference is to use the Boolean mask

Actual improvements can be made by modifying how we create our Boolean mask.

mask alternative 1
Use the underlying numpy array and forgo the overhead of creating another pd.Series

mask = df['A'].values == 'foo'

I'll show more complete time tests at the end, but just take a look at the performance gains we get using the sample data frame. First, we look at the difference in creating the mask

%timeit mask = df['A'].values == 'foo'
%timeit mask = df['A'] == 'foo'

5.84 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
166 µs ± 4.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Evaluating the mask with the numpy array is ~ 30 times faster. This is partly due to numpy evaluation often being faster. It is also partly due to the lack of overhead necessary to build an index and a corresponding pd.Series object.

Next, we'll look at the timing for slicing with one mask versus the other.

mask = df['A'].values == 'foo'
%timeit df[mask]
mask = df['A'] == 'foo'
%timeit df[mask]

219 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
239 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The performance gains aren't as pronounced. We'll see if this holds up over more robust testing.


mask alternative 2
We could have reconstructed the data frame as well. There is a big caveat when reconstructing a dataframe—you must take care of the dtypes when doing so!

Instead of df[mask] we will do this

pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)

If the data frame is of mixed type, which our example is, then when we get df.values the resulting array is of dtype object and consequently, all columns of the new data frame will be of dtype object. Thus requiring the astype(df.dtypes) and killing any potential performance gains.

%timeit df[m]
%timeit pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)

216 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.43 ms ± 39.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

However, if the data frame is not of mixed type, this is a very useful way to do it.

Given

np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('ABCDE'))

d1

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
2  0  2  0  4  9
3  7  3  2  4  3
4  3  6  7  7  4
5  5  3  7  5  9
6  8  7  6  4  7
7  6  2  6  6  5
8  2  8  7  5  8
9  4  7  6  1  5    

%%timeit
mask = d1['A'].values == 7
d1[mask]

179 µs ± 8.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Versus

%%timeit
mask = d1['A'].values == 7
pd.DataFrame(d1.values[mask], d1.index[mask], d1.columns)

87 µs ± 5.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

We cut the time in half.


mask alternative 3
@unutbu also shows us how to use pd.Series.isin to account for each element of df['A'] being in a set of values. This evaluates to the same thing if our set of values is a set of one value, namely 'foo'. But it also generalizes to include larger sets of values if needed. Turns out, this is still pretty fast even though it is a more general solution. The only real loss is in intuitiveness for those not familiar with the concept.

mask = df['A'].isin(['foo'])
df[mask]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

However, as before, we can utilize numpy to improve performance while sacrificing virtually nothing. We'll use np.in1d

mask = np.in1d(df['A'].values, ['foo'])
df[mask]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

Timing
I'll include other concepts mentioned in other posts as well for reference.
Code Below

Each Column in this table represents a different length data frame over which we test each function. Each column shows relative time taken, with the fastest function given a base index of 1.0.

res.div(res.min())

                         10        30        100       300       1000      3000      10000     30000
mask_standard         2.156872  1.850663  2.034149  2.166312  2.164541  3.090372  2.981326  3.131151
mask_standard_loc     1.879035  1.782366  1.988823  2.338112  2.361391  3.036131  2.998112  2.990103
mask_with_values      1.010166  1.000000  1.005113  1.026363  1.028698  1.293741  1.007824  1.016919
mask_with_values_loc  1.196843  1.300228  1.000000  1.000000  1.038989  1.219233  1.037020  1.000000
query                 4.997304  4.765554  5.934096  4.500559  2.997924  2.397013  1.680447  1.398190
xs_label              4.124597  4.272363  5.596152  4.295331  4.676591  5.710680  6.032809  8.950255
mask_with_isin        1.674055  1.679935  1.847972  1.724183  1.345111  1.405231  1.253554  1.264760
mask_with_in1d        1.000000  1.083807  1.220493  1.101929  1.000000  1.000000  1.000000  1.144175

You'll notice that fastest times seem to be shared between mask_with_values and mask_with_in1d

res.T.plot(loglog=True)

enter image description here

Functions

def mask_standard(df):
    mask = df['A'] == 'foo'
    return df[mask]

def mask_standard_loc(df):
    mask = df['A'] == 'foo'
    return df.loc[mask]

def mask_with_values(df):
    mask = df['A'].values == 'foo'
    return df[mask]

def mask_with_values_loc(df):
    mask = df['A'].values == 'foo'
    return df.loc[mask]

def query(df):
    return df.query('A == "foo"')

def xs_label(df):
    return df.set_index('A', append=True, drop=False).xs('foo', level=-1)

def mask_with_isin(df):
    mask = df['A'].isin(['foo'])
    return df[mask]

def mask_with_in1d(df):
    mask = np.in1d(df['A'].values, ['foo'])
    return df[mask]

Testing

res = pd.DataFrame(
    index=[
        'mask_standard', 'mask_standard_loc', 'mask_with_values', 'mask_with_values_loc',
        'query', 'xs_label', 'mask_with_isin', 'mask_with_in1d'
    ],
    columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    dtype=float
)

for j in res.columns:
    d = pd.concat([df] * j, ignore_index=True)
    for i in res.index:a
        stmt = '{}(d)'.format(i)
        setp = 'from __main__ import d, {}'.format(i)
        res.at[i, j] = timeit(stmt, setp, number=50)

Special Timing
Looking at the special case when we have a single non-object dtype for the entire data frame. Code Below

spec.div(spec.min())

                     10        30        100       300       1000      3000      10000     30000
mask_with_values  1.009030  1.000000  1.194276  1.000000  1.236892  1.095343  1.000000  1.000000
mask_with_in1d    1.104638  1.094524  1.156930  1.072094  1.000000  1.000000  1.040043  1.027100
reconstruct       1.000000  1.142838  1.000000  1.355440  1.650270  2.222181  2.294913  3.406735

Turns out, reconstruction isn't worth it past a few hundred rows.

spec.T.plot(loglog=True)

enter image description here

Functions

np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('ABCDE'))

def mask_with_values(df):
    mask = df['A'].values == 'foo'
    return df[mask]

def mask_with_in1d(df):
    mask = np.in1d(df['A'].values, ['foo'])
    return df[mask]

def reconstruct(df):
    v = df.values
    mask = np.in1d(df['A'].values, ['foo'])
    return pd.DataFrame(v[mask], df.index[mask], df.columns)

spec = pd.DataFrame(
    index=['mask_with_values', 'mask_with_in1d', 'reconstruct'],
    columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    dtype=float
)

Testing

for j in spec.columns:
    d = pd.concat([df] * j, ignore_index=True)
    for i in spec.index:
        stmt = '{}(d)'.format(i)
        setp = 'from __main__ import d, {}'.format(i)
        spec.at[i, j] = timeit(stmt, setp, number=50)

@posdef 2018-03-06 13:49:43

Fantastic answer! 2 questions though, i) how would .iloc(numpy.where(..)) compare in this scheme? ii) would you expect the rankings to be the same when using multiple conditions?

@jpp 2018-06-17 19:08:55

For performance of pd.Series.isin, note it does use np.in1d under the hood in a specific scenario, uses khash in others, and implicitly applies a trade-off between cost of hashing versus performance in specific situations. This answer has more detail.

@smci 2019-07-13 00:13:59

At 9 screenfuls, this is way overload for a new or even intermediate user. You can and should self-summarize the tl;dr in the first paragraph.

@user3666197 2019-08-06 10:01:20

@piRSquared Scaling matters Would you mind, @piRSquared, to also post your experience about how well the actual [{P|EXP}TIME]- and [{C|P|EXP}SPACE]-costs of using the above proposed forms of block-syntax ( processing top-down the whole dataframes at once ) grow, namely when scaled to some ~1E6, ~1E9, ~1E12 row counts? Thanks for showing us the whole picture, Sir. Quantitative benchmark readings with [min, Avg, MAX, StDev] are always welcome, as both the min and MAX values accompany the Mean/StDev relief of the batch.

@unutbu 2013-06-12 17:44:20

To select rows whose column value equals a scalar, some_value, use ==:

df.loc[df['column_name'] == some_value]

To select rows whose column value is in an iterable, some_values, use isin:

df.loc[df['column_name'].isin(some_values)]

Combine multiple conditions with &:

df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]

Note the parentheses. Due to Python's operator precedence rules, & binds more tightly than <= and >=. Thus, the parentheses in the last example are necessary. Without the parentheses

df['column_name'] >= A & df['column_name'] <= B

is parsed as

df['column_name'] >= (A & df['column_name']) <= B

which results in a Truth value of a Series is ambiguous error.


To select rows whose column value does not equal some_value, use !=:

df.loc[df['column_name'] != some_value]

isin returns a boolean Series, so to select rows whose value is not in some_values, negate the boolean Series using ~:

df.loc[~df['column_name'].isin(some_values)]

For example,

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
#      A      B  C   D
# 0  foo    one  0   0
# 1  bar    one  1   2
# 2  foo    two  2   4
# 3  bar  three  3   6
# 4  foo    two  4   8
# 5  bar    two  5  10
# 6  foo    one  6  12
# 7  foo  three  7  14

print(df.loc[df['A'] == 'foo'])

yields

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

If you have multiple values you want to include, put them in a list (or more generally, any iterable) and use isin:

print(df.loc[df['B'].isin(['one','three'])])

yields

     A      B  C   D
0  foo    one  0   0
1  bar    one  1   2
3  bar  three  3   6
6  foo    one  6  12
7  foo  three  7  14

Note, however, that if you wish to do this many times, it is more efficient to make an index first, and then use df.loc:

df = df.set_index(['B'])
print(df.loc['one'])

yields

       A  C   D
B              
one  foo  0   0
one  bar  1   2
one  foo  6  12

or, to include multiple values from the index use df.index.isin:

df.loc[df.index.isin(['one','two'])]

yields

       A  C   D
B              
one  foo  0   0
one  bar  1   2
two  foo  2   4
two  foo  4   8
two  bar  5  10
one  foo  6  12

@szli 2013-06-12 18:12:41

In fact, df[df['colume_name']==some_value] also works. But my first attempt, df.where(df['colume_name']==some_value) does not work... not sure why...

@unutbu 2013-06-12 18:19:53

When you use df.where(condition), the condition has to have the same shape as df.

@tremendows 2014-05-27 07:32:27

@Shane 2015-08-01 00:18:45

FYI: If you want to select a row based upon two (or more) labels (either requiring both or either), see stackoverflow.com/questions/31756340/…

@Cedric Zoppolo 2016-11-01 15:04:14

For older pandas (like version 0.9.0) you have to use df.ix[df['column_name'] == some_value] instead

@Aditya 2016-11-29 07:10:36

Is it possible to specify a regex here for the value that is checked. Suppose the value in a row for a particular column in the table is 'hello world foo bar' and I need to return this row if the string 'foo' is present in the column.

@unutbu 2016-11-29 10:11:48

@aditya: Use str.contains: e.g. df.loc[df['A'].str.contains('foo')].

@Prof Mo 2017-04-01 13:23:27

isin only works on DataFrame and not Series. So if you do x = df.loc[mask], x could be a DataFrame or a Series. Therefore, x[x['column_'name].isin(A)] could raise AttributeError: 'str' object has no attribute 'isin'

@lucianopaz 2017-12-14 11:29:37

This answer is from June 2013. In January 2014, pandas introduced the query method that works very much the same as queries in SQL. I think this answer should be updated to mention this. Actually, @fredcallaway's answer shows how the query method could be used.

@wwii 2018-04-13 14:31:04

Seems there are many Pandas ways of doing things: current Pandas 0.22.0 docs DataFrame.isin() example shows specifying column criteria with a dictionary combined with all(1) (or .any(1)) , used as an index to select rows. For this answer's example: df[df.isin({'B':['one','three']}).any(1)]. The isin example in this answer does still work.

@qqqwww 2018-06-20 21:05:21

Since df[df['column_name'] == some_value] works, why do we need add .loc here?

@unutbu 2018-06-20 21:20:08

df[[col1, col2]] returns the sub-DataFrame with columns col1 and col2. Right? Well, not always. If col1 and col2 are booleans, then df[[col1, col2]] selects rows from df. I dislike one syntax having two very different behaviors. So to make it easy for me to always know what this code is doing (just by looking at the syntax and not its values), I like to reserve df[...] for selecting columns, and df.loc[boolean_mask] for selecting rows. Since we're selecting rows here, I chose to use df.loc[...].

@unutbu 2018-06-20 21:29:03

df[boolean_mask] dispatches to df.loc[boolean_mask]. So writing df.loc[boolean_mask] is ever so slightly faster to execute than df[boolean_mask]. So there is even a minuscule performance advantage to using df.loc. But that is not the main reason I use it. It is the cognitive dissonance (my own mental foible) I wish to avoid.

@unutbu 2018-06-22 14:40:00

See this post for an example when df[boolean_mask] raises a ValueError (and fails to select rows) but df.loc[boolean_mask] works.

@L S 2018-06-30 15:09:58

I was trying to use an expression like df['x'] in ['a', 'b'], which didn't work. It's a shame pandas doesn't support regular Python operators like in and and, etc.

@unutbu 2018-06-30 15:47:22

@LS: It's not that Pandas does not support Python's in and and -- it's that Python's language specification for in and and require that they return a single boolean value. Since we want df['x'].isin(['a', 'b']) to return a Series of booleans, Pandas is forced to use a syntax other than df['x'] in ['a', 'b'].

@L S 2018-07-12 13:15:30

Thanks, @unutbu. I was under the impression that operators like in and and could be overridden (overloaded?) to work this way. That is, if Pandas' definition of DataFrame included code to override them. OTOH, since those operators are meant to return a single Boolean, doing this would break the principle of least astonishment. (As well as probably be very un-Pythonic.)

@unutbu 2018-09-13 21:29:54

@BowenLiu: In each of the expressions shown above, df.loc[mask] is used with mask being boolean-valued. To use df.iloc we would need an integer-valued index, idx, of positional (ordinal) indices. Given mask, we could obtain idx using idx = np.flatnonzero(mask), then select rows using df.iloc[idx]. Given that we are selecting rows based on some_value, I don't think there is any direct route to obtaining idx without first obtaining mask. So while it is possible, it just makes every expression longer.

@unutbu 2018-09-14 16:27:33

@BowenLiu: Correction: you could use idx = [i for i,val in enumerate(df['column_name']) if val == some_value] and then select rows with df.iloc[idx]. For small-to-medium sized DataFrames, this can be faster than using df.loc[df['column_name'] == som_value], but is slower as the length of the DataFrame increases. Since performance is mainly an issue for larger DataFrames, there is usually no point in optimizing for small dataframes. Since, in addition, df.loc[df['column_name'] == som_value] is more readable, using df.loc would be my preferred approach.

@asmgx 2019-03-09 10:11:18

Can we do it in the other way around i mean something like this df.iloc['some value'.isin(df[:])] ?

@unutbu 2019-03-09 12:35:13

@asmgx: No, because the inputs and output don't match.'some value' is a string. It has no isin method. Even if it did, Series.isin expects an iterable and returns a boolean Series. df[:] is iterable, but it just returns the column labels, not the contents of df. And df.iloc expects ordinal integers as input, not booleans. Constructing expressions is like fitting pipes together. These pipes don't fit.

@unutbu 2019-03-09 12:40:52

@asmgx: I agree with you that 'some value'.isin(df) reads more like English, but in Python everything is an object and each object has its own methods. The isin method is implemented by Series and DataFrames, but not strings, so it must be called with df[col].isin(...) or df.isin(...), not the other way around.

@cs95 2019-06-30 05:58:01

@qqqwww Because if you're not careful, you might run into the SettingWithCopyWarning.

@user3666197 2019-08-06 09:55:40

@unutbu Scaling matters Would you mind, @unutbu, to also post your experience about how well the actual [{P|EXP}TIME]- and [{C|P|EXP}SPACE]-costs of using this proposed block-syntax ( processing top-down the whole dataframes at once ) grow, namely when scaled to some ~1E6, ~1E9, ~1E12 row counts? Thanks for showing us the whole picture, Sir.

@Parth Tamane 2019-09-13 21:17:14

Is there a way to do this when the name of the column is not known but only it's index is known?

@unutbu 2019-09-13 22:56:26

@ParthTamane: Yes, use df.iloc[...] to select rows and/or columns by integer position.

@Vahidn 2018-12-07 17:38:58

You can also use .apply:

df.apply(lambda row: row[df['B'].isin(['one','three'])])

It actually works row-wise (i.e., applies the function to each row).

The output is

   A      B  C   D
0  foo    one  0   0
1  bar    one  1   2
3  bar  three  3   6
6  foo    one  6  12
7  foo  three  7  14

The results is the same as using as mentioned by @unutbu

df[[df['B'].isin(['one','three'])]]

@imolit 2015-07-08 15:17:38

tl;dr

The pandas equivalent to

select * from table where column_name = some_value

is

table[table.column_name == some_value]

Multiple conditions:

table[(table.column_name == some_value) | (table.column_name2 == some_value2)]

or

table.query('column_name == some_value | column_name2 == some_value2')

Code example

import pandas as pd

# Create data set
d = {'foo':[100, 111, 222], 
     'bar':[333, 444, 555]}
df = pd.DataFrame(d)

# Full dataframe:
df

# Shows:
#    bar   foo 
# 0  333   100
# 1  444   111
# 2  555   222

# Output only the row(s) in df where foo is 222:
df[df.foo == 222]

# Shows:
#    bar  foo
# 2  555  222

In the above code it is the line df[df.foo == 222] that gives the rows based on the column value, 222 in this case.

Multiple conditions are also possible:

df[(df.foo == 222) | (df.bar == 444)]
#    bar  foo
# 1  444  111
# 2  555  222

But at that point I would recommend using the query function, since it's less verbose and yields the same result:

df.query('foo == 222 | bar == 444')

@Berk U. 2018-04-23 17:26:14

query is the only answer here that is compatible with method chaining. It seems like it's the pandas analog to filter in dplyr.

@user2739472 2018-06-28 12:40:05

Hi, in your third example (multiple columns) I think you need square brackets [ not round brackets ( on the outside.

@O95 2018-11-07 09:32:01

at first I thought that | was for AND, but of course it is OR-operator...

@Ritwik 2019-08-09 17:42:34

for multiple conditions using AND, one can do df[condition1][condition2]

@SP001 2017-12-07 10:39:50

For selecting only specific columns out of multiple columns for a given value in pandas:

select col_name1, col_name2 from table where column_name = some_value.

Options:

df.loc[df['column_name'] == some_value][[col_name1, col_name2]]

or

df.query['column_name' == 'some_value'][[col_name1, col_name2]]

@shivsn 2017-07-05 16:34:57

Faster results can be achieved using numpy.where.

For example, with unubtu's setup -

In [76]: df.iloc[np.where(df.A.values=='foo')]
Out[76]: 
     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

Timing comparisons:

In [68]: %timeit df.iloc[np.where(df.A.values=='foo')]  # fastest
1000 loops, best of 3: 380 µs per loop

In [69]: %timeit df.loc[df['A'] == 'foo']
1000 loops, best of 3: 745 µs per loop

In [71]: %timeit df.loc[df['A'].isin(['foo'])]
1000 loops, best of 3: 562 µs per loop

In [72]: %timeit df[df.A=='foo']
1000 loops, best of 3: 796 µs per loop

In [74]: %timeit df.query('(A=="foo")')  # slowest
1000 loops, best of 3: 1.71 ms per loop

@TuanDT 2016-11-18 12:10:42

To append to this famous question (though a bit too late): You can also do df.groupby('column_name').get_group('column_desired_value').reset_index() to make a new data frame with specified column having a particular value. E.g.

import pandas as pd
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split()})
print("Original dataframe:")
print(df)

b_is_two_dataframe = pd.DataFrame(df.groupby('B').get_group('two').reset_index()).drop('index', axis = 1) 
#NOTE: the final drop is to remove the extra index column returned by groupby object
print('Sub dataframe where B is two:')
print(b_is_two_dataframe)

Run this gives:

Original dataframe:
     A      B
0  foo    one
1  bar    one
2  foo    two
3  bar  three
4  foo    two
5  bar    two
6  foo    one
7  foo  three
Sub dataframe where B is two:
     A    B
0  foo  two
1  foo  two
2  bar  two

@fredcallaway 2016-02-09 01:36:49

I find the syntax of the previous answers to be redundant and difficult to remember. Pandas introduced the query() method in v0.13 and I much prefer it. For your question, you could do df.query('col == val')

Reproduced from http://pandas.pydata.org/pandas-docs/version/0.17.0/indexing.html#indexing-query

In [167]: n = 10

In [168]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))

In [169]: df
Out[169]: 
          a         b         c
0  0.687704  0.582314  0.281645
1  0.250846  0.610021  0.420121
2  0.624328  0.401816  0.932146
3  0.011763  0.022921  0.244186
4  0.590198  0.325680  0.890392
5  0.598892  0.296424  0.007312
6  0.634625  0.803069  0.123872
7  0.924168  0.325076  0.303746
8  0.116822  0.364564  0.454607
9  0.986142  0.751953  0.561512

# pure python
In [170]: df[(df.a < df.b) & (df.b < df.c)]
Out[170]: 
          a         b         c
3  0.011763  0.022921  0.244186
8  0.116822  0.364564  0.454607

# query
In [171]: df.query('(a < b) & (b < c)')
Out[171]: 
          a         b         c
3  0.011763  0.022921  0.244186
8  0.116822  0.364564  0.454607

You can also access variables in the environment by prepending an @.

exclude = ('red', 'orange')
df.query('color not in @exclude')

@MERose 2016-03-13 09:16:24

You only need package numexpr installed.

@smerlung 2017-08-10 18:34:20

In my case I needed quotation because val is a string. df.query('col == "val"')

@DataByDavid 2013-06-13 11:49:00

Here is a simple example

from pandas import DataFrame

# Create data set
d = {'Revenue':[100,111,222], 
     'Cost':[333,444,555]}
df = DataFrame(d)


# mask = Return True when the value in column "Revenue" is equal to 111
mask = df['Revenue'] == 111

print mask

# Result:
# 0    False
# 1     True
# 2    False
# Name: Revenue, dtype: bool


# Select * FROM df WHERE Revenue = 111
df[mask]

# Result:
#    Cost    Revenue
# 1  444     111

Related Questions

Sponsored Content

19 Answered Questions

[SOLVED] How to iterate over rows in a DataFrame in Pandas?

23 Answered Questions

[SOLVED] Adding new column to existing DataFrame in Python pandas

10 Answered Questions

[SOLVED] Why is "1000000000000000 in range(1000000000000001)" so fast in Python 3?

13 Answered Questions

[SOLVED] Delete column from pandas DataFrame

19 Answered Questions

[SOLVED] Add one row to pandas DataFrame

18 Answered Questions

[SOLVED] Get list from pandas DataFrame column headers

15 Answered Questions

[SOLVED] Selecting multiple columns in a pandas dataframe

14 Answered Questions

[SOLVED] "Large data" work flows using pandas

13 Answered Questions

[SOLVED] How to join (merge) data frames (inner, outer, left, right)

Sponsored Content