By Roman


2013-05-10 07:04:49 8 Comments

I have a DataFrame from pandas:

import pandas as pd
inp = [{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}]
df = pd.DataFrame(inp)
print df

Output:

   c1   c2
0  10  100
1  11  110
2  12  120

Now I want to iterate over the rows of this frame. For every row I want to be able to access its elements (values in cells) by the name of the columns. For example:

for row in df.rows:
   print row['c1'], row['c2']

Is it possible to do that in pandas?

I found this similar question. But it does not give me the answer I need. For example, it is suggested there to use:

for date, row in df.T.iteritems():

or

for row in df.iterrows():

But I do not understand what the row object is and how I can work with it.

17 comments

@cs95 2019-04-07 10:03:54

Q: How to iterate over rows in a DataFrame in Pandas?

Don't!

Iteration in pandas is an anti-pattern, and is something you should only want to do when you have exhausted every other option possible. You should not consider using any function with "iter" in its name for anything more than a few thousand rows or you will have to get used to a lot of waiting.

Do you want to print a DataFrame? Use DataFrame.to_string().

Do you want to compute something? In that case, search for methods in this order (list modified from here):

  1. Vectorization
  2. Cython routines
  3. List Comprehensions (for loop)
  4. DataFrame.apply()
    i.  Reductions that can be performed in cython
    ii. Iteration in python space
  5. DataFrame.itertuples() and iteritems()
  6. DataFrame.iterrows()

iterrows and itertuples (both receiving many votes in answers to this question) should be used in very rare circumstances, such as generating row objects/nametuples for sequential processing, which is really the only thing these functions are useful for.

Appeal to Authority
The docs page on iteration has a huge red warning box that says:

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed [...].


Faster than Looping: Vectorization, Cython

A good number of basic operations and computations are "vectorised" by pandas (either through NumPy, or through Cythonized functions). This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations. Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem.

If none exists, feel free to write your own using custom cython extensions.


Next Best Thing: List Comprehensions

If you are iterating because there is no vectorized solution available, and performance important (but not important enough to go through the hassle of cythonizing your code) use a list comprehension, as the next best/simplest option.

To iterate over rows using a single column, use

result = [f(x) for x in df['col']]

To iterate over rows using multiple columns, you can use

# two column format
result = [f(x, y) for x, y in zip(df['col1'], df['col2'])]

# many column format
result = [f(row[0], ..., row[n]) for row in df[['col1', ...,'coln']].values]

If you need an integer row index while iterating, use enumerate:

result = [f(...) for i, row in enumerate(df[...].values)]

(where df.index[i] gets you the index label.)

If you can turn it into a function, you can use list comprehension. You can make arbitrarily complex things work through the simplicity and speed of raw python.

@Grag2015 2017-11-02 10:33:40

 for ind in df.index:
     print df['c1'][ind], df['c2'][ind]

@Bazyli Debowski 2018-09-10 12:41:05

how is the performance of this option when used on a large dataframe (millions of rows for example)?

@Grag2015 2018-10-25 13:52:28

Honestly, I don’t know exactly, I think that in comparison with the best answer, the elapsed time will be about the same, because both cases use "for"-construction. But the memory may be different in some cases.

@cs95 2019-04-18 23:19:43

This is chained indexing. Do not use this!

@Justin Malinchak 2018-07-10 15:05:42

Why complicate things?

Simple.

import pandas as pd
import numpy as np

# Here is an example dataframe
df_existing = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

for idx,row in df_existing.iterrows():
    print row['A'],row['B'],row['C'],row['D']

@moi 2018-07-30 07:39:53

How is this different than the accepted answer??

@Justin Malinchak 2018-11-02 18:21:13

I guess I prefer when coder can quickly just snip the entire code block run it, and it parses fine. Accepted answer requires piecing together blocks. Timesaver

@Zach 2018-06-27 18:48:28

Sometimes a useful pattern is:

# Borrowing @KutalmisB df example
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b'])
# The to_dict call results in a list of dicts
# where each row_dict is a dictionary with k:v pairs of columns:value for that row
for row_dict in df.to_dict(orient='records'):
    print(row_dict)

Which results in:

{'col1':1.0, 'col2':0.1}
{'col1':2.0, 'col2':0.2}

@mjr2000 2019-03-16 22:33:02

This example uses iloc to isolate each digit in the data frame.

import pandas as pd

 a = [1, 2, 3, 4]
 b = [5, 6, 7, 8]

 mjr = pd.DataFrame({'a':a, 'b':b})

 size = mjr.shape

 for i in range(size[0]):
     for j in range(size[1]):
         print(mjr.iloc[i, j])

@HKRC 2019-02-27 00:29:49

For both viewing and modifying values, I would use iterrows(). In a for loop and by using tuple unpacking (see the example: i, row), I use the row for only viewing the value and use i with the loc method when I want to modify values. As stated in previous answers, here you should not modify something you are iterating over.

for i, row in df.iterrows():
    if row['A'] == 'Old_Value':
        df.loc[i,'A'] = 'New_value'  

Here the row in the loop is a copy of that row, and not a view of it. Therefore, you should NOT write something like row['A'] = 'New_Value', it will not modify the DataFrame. However, you can use i and loc and specify the DataFrame to do the work.

@shubham ranjan 2019-01-19 06:53:51

There are so many ways to iterate over the rows in pandas dataframe. One very simple and intuitive way is :

df=pd.DataFrame({'A':[1,2,3], 'B':[4,5,6],'C':[7,8,9]})
print(df)
for i in range(df.shape[0]):
    # For printing the second column
    print(df.iloc[i,1])
    # For printing more than one columns
    print(df.iloc[i,[0,2]])

@waitingkuo 2013-05-10 07:07:58

DataFrame.iterrows is a generator which yield both index and row

for index, row in df.iterrows():
    print(row['c1'], row['c2'])

Output: 
   10 100
   11 110
   12 120

@viddik13 2016-12-07 16:24:19

Note: "Because iterrows returns a Series for each row, it does not preserve dtypes across the rows." Also, "You should never modify something you are iterating over." According to pandas 0.19.1 docs

@Aziz Alto 2017-09-05 16:30:10

@viddik13 that's a great note thanks. Because of that I ran into a case where numerical values like 431341610650 where read as 4.31E+11. Is there a way around preserving the dtypes?

@Axel 2017-09-07 11:45:50

@AzizAlto use itertuples, as explained below. See also pandas.pydata.org/pandas-docs/stable/generated/…

@Prateek Agrawal 2017-10-05 18:22:13

How does the row object change if we dont use the index variable while iterating?? We have to use row[0],row[1] instead of the labels in that case?

@James L. 2017-12-01 16:14:20

Do not use iterrows. Itertuples is faster and preserves data type. More info

@beep_check 2018-05-03 16:55:25

if you don't need to preserve the datatype, iterrows is fine. @waitingkuo's tip to separate the index makes it much easier to parse.

@KutalmisB 2018-04-23 14:53:49

To loop all rows in a dataframe and use values of each row conveniently, namedtuples can be converted to ndarrays. For example:

df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b'])

Iterating over the rows:

for row in df.itertuples(index=False, name='Pandas'):
    print np.asarray(row)

results in:

[ 1.   0.1]
[ 2.   0.2]

Please note that if index=True, the index is added as the first element of the tuple, which may be undesirable for some applications.

@Lucas B 2018-01-17 09:41:29

I was looking for How to iterate on rows AND columns and ended here so :

for i, row in df.iterrows():
    for j, column in row.iteritems():
        print(column)

@cs95 2019-04-09 18:37:07

Iterating over rows is bad enough. Why on Earth would you want to do this?

@viddik13 2016-12-07 16:41:28

To iterate through DataFrame's row in pandas one can use:

itertuples() is supposed to be faster than iterrows()

But be aware, according to the docs (pandas 0.21.1 at the moment):

  • iterrows: dtype might not match from row to row

    Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames).

  • iterrows: Do not modify rows

    You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

    Use DataFrame.apply() instead:

    new_df = df.apply(lambda x: x * 2)
    
  • itertuples:

    The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore. With a large number of columns (>255), regular tuples are returned.

@Raul Guarini 2018-01-26 13:16:04

Just a small question from someone reading this thread so long after its completion: how df.apply() compares to itertuples in terms of efficiency?

@Brian Burns 2018-06-29 07:29:55

Note: you can also say something like for row in df[['c1','c2']].itertuples(index=True, name=None): to include only certain columns in the row iterator.

@viraptor 2018-08-13 06:20:31

Instead of getattr(row, "c1"), you can use just row.c1.

@Noctiphobia 2018-08-24 10:34:12

I am about 90% sure that if you use getattr(row, "c1") instead of row.c1, you lose any performance advantage of itertuples, and if you actually need to get to the property via a string, you should use iterrows instead.

@Marlo 2018-12-06 05:39:21

When I tried this it only printed the column values but not the headers. Are the column headers excluded from the row attributes?

@James L. 2017-12-01 17:49:50

You can also do numpy indexing for even greater speed ups. It's not really iterating but works much better than iteration for certain applications.

subset = row['c1'][0:5]
all = row['c1'][:]

You may also want to cast it to an array. These indexes/selections are supposed to act like Numpy arrays already but I ran into issues and needed to cast

np.asarray(all)
imgs[:] = cv2.resize(imgs[:], (224,224) ) #resize every image in an hdf5 file

@piRSquared 2017-11-07 04:15:19

You can write your own iterator that implements namedtuple

from collections import namedtuple

def myiter(d, cols=None):
    if cols is None:
        v = d.values.tolist()
        cols = d.columns.values.tolist()
    else:
        j = [d.columns.get_loc(c) for c in cols]
        v = d.values[:, j].tolist()

    n = namedtuple('MyTuple', cols)

    for line in iter(v):
        yield n(*line)

This is directly comparable to pd.DataFrame.itertuples. I'm aiming at performing the same task with more efficiency.


For the given dataframe with my function:

list(myiter(df))

[MyTuple(c1=10, c2=100), MyTuple(c1=11, c2=110), MyTuple(c1=12, c2=120)]

Or with pd.DataFrame.itertuples:

list(df.itertuples(index=False))

[Pandas(c1=10, c2=100), Pandas(c1=11, c2=110), Pandas(c1=12, c2=120)]

A comprehensive test
We test making all columns available and subsetting the columns.

def iterfullA(d):
    return list(myiter(d))

def iterfullB(d):
    return list(d.itertuples(index=False))

def itersubA(d):
    return list(myiter(d, ['col3', 'col4', 'col5', 'col6', 'col7']))

def itersubB(d):
    return list(d[['col3', 'col4', 'col5', 'col6', 'col7']].itertuples(index=False))

res = pd.DataFrame(
    index=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    columns='iterfullA iterfullB itersubA itersubB'.split(),
    dtype=float
)

for i in res.index:
    d = pd.DataFrame(np.random.randint(10, size=(i, 10))).add_prefix('col')
    for j in res.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        res.at[i, j] = timeit(stmt, setp, number=100)

res.groupby(res.columns.str[4:-1], axis=1).plot(loglog=True);

enter image description here

enter image description here

@James L. 2017-12-01 16:06:25

For people who don't want to read the code: blue line is intertuples, orange line is a list of an iterator thru a yield block. interrows is not compared.

@Pedro Lobito 2017-03-11 22:44:39

To loop all rows in a dataframe you can use:

for x in range(len(date_example.index)):
    print date_example['Date'].iloc[x]

@cs95 2019-04-18 23:20:05

This is chained indexing. I do not recommend doing this.

@Pedro Lobito 2019-04-19 01:42:15

@cs95 What would you recommend instead?

@cs95 2019-04-19 01:57:05

If you want to make this work, call df.columns.get_loc to get the integer index position of the date column (outside the loop), then use a single iloc indexing call inside.

@PJay 2016-09-07 12:56:04

You can use the df.iloc function as follows:

for i in range(0, len(df)):
    print df.iloc[i]['c1'], df.iloc[i]['c2']

@Pedro Lobito 2017-04-06 08:51:22

Using 0 in range is pointless, you can omit it.

@rocarvaj 2017-10-05 14:50:50

I know that one should avoid this in favor of iterrows or itertuples, but it would be interesting to know why. Any thoughts?

@Ken Williams 2018-01-18 19:22:11

This is the only valid technique I know of if you want to preserve the data types, and also refer to columns by name. itertuples preserves data types, but gets rid of any name it doesn't like. iterrows does the opposite.

@Sean Anderson 2018-09-19 12:13:47

Spent hours trying to wade through the idiosyncrasies of pandas data structures to do something simple AND expressive. This results in readable code.

@Kim Miller 2018-12-14 18:18:13

While for i in range(df.shape[0]) might speed this approach up a bit, it's still about 3.5x slower than the iterrows() approach above for my application.

@Bastiaan 2019-01-03 22:07:53

On large Datafrmes this seems better as my_iter = df.itertuples() takes double the memory and a lot of time to copy it. same for iterrows().

@cs95 2019-04-18 23:20:59

This is chained indexing. Do not use!

@e9t 2015-09-20 13:52:48

While iterrows() is a good option, sometimes itertuples() can be much faster:

df = pd.DataFrame({'a': randn(1000), 'b': randn(1000),'N': randint(100, 1000, (1000)), 'x': 'x'})

%timeit [row.a * 2 for idx, row in df.iterrows()]
# => 10 loops, best of 3: 50.3 ms per loop

%timeit [row[1] * 2 for row in df.itertuples()]
# => 1000 loops, best of 3: 541 µs per loop

@Alex 2015-09-20 17:00:05

Much of the time difference in your two examples seems like it is due to the fact that you appear to be using label-based indexing for the .iterrows() command and integer-based indexing for the .itertuples() command.

@harbun 2015-10-19 13:03:55

For a finance data based dataframe(timestamp, and 4x float), itertuples is 19,57 times faster then iterrows on my machine. Only for a,b,c in izip(df["a"],df["b"],df["c"]: is almost equally fast.

@Abe Miessler 2017-01-10 22:05:47

Can you explain why it's faster?

@miradulo 2017-02-13 17:30:29

@AbeMiessler iterrows() boxes each row of data into a Series, whereas itertuples()does not.

@Brian Burns 2017-11-05 17:29:14

Note that the order of the columns is actually indeterminate, because df is created from a dictionary, so row[1] could refer to any of the columns. As it turns out though the times are roughly the same for the integer vs the float columns.

@Alex 2018-09-28 21:57:29

@jeffhale the times you cite are exactly the same, how is that possible? Also, I meant something like row.iat[1] when I was referring to integer-based indexing.

@jeffhale 2018-09-28 23:33:16

@Alex that does look suspicious. I just reran it a few times and itertuples took 3x longer than iterrows. With pandas 0.23.4. Will delete the other comment to avoid confusion.

@jeffhale 2018-09-28 23:40:16

Then running on a much larger DataFrame, more like a real-world situation, itertuples was 100x faster than iterrows. Itertuples for the win.

@Ajasja 2018-11-07 20:53:59

I get a >50 times increase as well i.stack.imgur.com/HBe9o.png (while changing to attr accessor in the second run).

@cheekybastard 2015-06-01 06:24:44

You can also use df.apply() to iterate over rows and access multiple columns for a function.

docs: DataFrame.apply()

def valuation_formula(x, y):
    return x * y * 0.5

df['price'] = df.apply(lambda row: valuation_formula(row['x'], row['y']), axis=1)

@SRS 2015-07-01 17:55:54

Is the df['price'] refers to a column name in the data frame? I am trying to create a dictionary with unique values from several columns in a csv file. I used your logic to create a dictionary with unique keys and values and got an error stating TypeError: ("'Series' objects are mutable, thus they cannot be hashed", u'occurred at index 0')

@SRS 2015-07-01 17:57:17

Code: df['Workclass'] = df.apply(lambda row: dic_update(row), axis=1) end of line id = 0 end of line def dic_update(row): if row not in dic: dic[row] = id id = id + 1

@SRS 2015-07-01 19:06:51

Never mind, I got it. Changed the function call line to df_new = df['Workclass'].apply(same thing)

@zthomas.nc 2017-11-29 23:58:47

Having the axis default to 0 is the worst

@gented 2018-04-04 13:44:53

Notice that apply doesn't "iteratite" over rows, rather it applies a function row-wise. The above code wouldn't work if you really do need iterations and indeces, for instance when comparing values across different rows (in that case you can do nothing but iterating).

Related Questions

Sponsored Content

16 Answered Questions

[SOLVED] Selecting multiple columns in a pandas dataframe

33 Answered Questions

[SOLVED] Renaming columns in pandas

18 Answered Questions

[SOLVED] Set value for particular cell in pandas DataFrame using index

18 Answered Questions

[SOLVED] Get list from pandas DataFrame column headers

14 Answered Questions

[SOLVED] Iterating over dictionaries using 'for' loops

38 Answered Questions

[SOLVED] How do I check whether a file exists without exceptions?

14 Answered Questions

[SOLVED] Select rows from a DataFrame based on values in a column in pandas

18 Answered Questions

[SOLVED] Add one row to pandas DataFrame

15 Answered Questions

[SOLVED] Delete column from pandas DataFrame by column name

13 Answered Questions

[SOLVED] "Large data" work flows using pandas

Sponsored Content