By Roman


2013-05-10 07:04:49 8 Comments

I have a DataFrame from Pandas:

import pandas as pd
inp = [{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}]
df = pd.DataFrame(inp)
print df

Output:

   c1   c2
0  10  100
1  11  110
2  12  120

Now I want to iterate over the rows of this frame. For every row I want to be able to access its elements (values in cells) by the name of the columns. For example:

for row in df.rows:
   print row['c1'], row['c2']

Is it possible to do that in Pandas?

I found this similar question. But it does not give me the answer I need. For example, it is suggested there to use:

for date, row in df.T.iteritems():

or

for row in df.iterrows():

But I do not understand what the row object is and how I can work with it.

22 comments

@waitingkuo 2013-05-10 07:07:58

DataFrame.iterrows is a generator which yields both the index and row (as a Series):

import pandas as pd
import numpy as np

df = pd.DataFrame({'c1': [10, 11, 12], 'c2': [100, 110, 120]})

for index, row in df.iterrows():
    print(row['c1'], row['c2'])
10 100
11 110
12 120

@viddik13 2016-12-07 16:24:19

Note: "Because iterrows returns a Series for each row, it does not preserve dtypes across the rows." Also, "You should never modify something you are iterating over." According to pandas 0.19.1 docs

@Aziz Alto 2017-09-05 16:30:10

@viddik13 that's a great note thanks. Because of that I ran into a case where numerical values like 431341610650 where read as 4.31E+11. Is there a way around preserving the dtypes?

@Axel 2017-09-07 11:45:50

@AzizAlto use itertuples, as explained below. See also pandas.pydata.org/pandas-docs/stable/generated/…

@Prateek Agrawal 2017-10-05 18:22:13

How does the row object change if we dont use the index variable while iterating?? We have to use row[0],row[1] instead of the labels in that case?

@James L. 2017-12-01 16:14:20

Do not use iterrows. Itertuples is faster and preserves data type. More info

@beep_check 2018-05-03 16:55:25

if you don't need to preserve the datatype, iterrows is fine. @waitingkuo's tip to separate the index makes it much easier to parse.

@cs95 2019-05-28 05:00:44

From the documentation: "Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed[...]". Your answer is correct (in the context of the question) but does not mention this anywhere, so it isn't a very good one.

@dbaumann 2020-01-01 20:32:43

I voted for this answer because I think it answers the original question as directly as possible. This particular approach may not be such a bad idea. softwareengineering.stackexchange.com/a/80092

@AMC 2020-01-09 23:33:53

@dbaumann Are you saying that worrying about the performance of .iterrows() is premature optimization? Performance aside, what about writing code which is straightforward and idiomatic?

@AMC 2020-01-11 02:59:10

@dbaumann and I think most of the other answers sacrifice readability for performance without actually measuring where the performance difference starts to become noticeable They may “sacrifice readability” on this trivial example, but on anything more complex the tools provided by Pandas will be far simpler. I would like to see some of those examples which sacrifice readability you’re referring to, actually, since at a glance most of the answer here seem alright.

@dbaumann 2020-01-11 03:01:41

@AMC Yes that's what I'm saying. Most of the other answers would be correct if the question was "What's the fastest way to iterate over rows in pandas?". To start by choosing the most efficient method is definitely the wrong approach in my opinion.

@AMC 2020-01-11 03:05:52

@dbaumann To start by choosing the most efficient method is definitely the wrong approach in my opinion. Hmmmm. When you consider the fact that you aren’t supposed to modify the rows during iteration with .iterrows(), I can’t think of a realistic example where the more idiomatic methods would be less readable.

@Golden Lion 2020-06-05 11:31:43

You can use df.iloc[index,1] as equivalent with for loop. iloc and loc are equivalent to iteration

@cs95 2020-07-03 21:10:12

@GoldenLion Or better still, search for a method that solves your problem and doesn't involve a loop.

@Golden Lion 2020-07-06 14:45:58

the question implied looping when he asked about iteration

@cs95 2019-04-07 10:03:54

How to iterate over rows in a DataFrame in Pandas?

Answer: DON'T*!

Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. You should not use any function with "iter" in its name for more than a few thousand rows or you will have to get used to a lot of waiting.

Do you want to print a DataFrame? Use DataFrame.to_string().

Do you want to compute something? In that case, search for methods in this order (list modified from here):

  1. Vectorization
  2. Cython routines
  3. List Comprehensions (vanilla for loop)
  4. DataFrame.apply(): i)  Reductions that can be performed in Cython, ii) Iteration in Python space
  5. DataFrame.itertuples() and iteritems()
  6. DataFrame.iterrows()

iterrows and itertuples (both receiving many votes in answers to this question) should be used in very rare circumstances, such as generating row objects/nametuples for sequential processing, which is really the only thing these functions are useful for.

Appeal to Authority

The documentation page on iteration has a huge red warning box that says:

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed [...].

* It's actually a little more complicated than "don't". df.iterrows() is the correct answer to this question, but "vectorize your ops" is the better one. I will concede that there are circumstances where iteration cannot be avoided (for example, some operations where the result depends on the value computed for the previous row). However, it takes some familiarity with the library to know when. If you're not sure whether you need an iterative solution, you probably don't. PS: To know more about my rationale for writing this answer, skip to the very bottom.


Faster than Looping: Vectorization, Cython

A good number of basic operations and computations are "vectorised" by pandas (either through NumPy, or through Cythonized functions). This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations. Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem.

If none exists, feel free to write your own using custom Cython extensions.


Next Best Thing: List Comprehensions*

List comprehensions should be your next port of call if 1) there is no vectorized solution available, 2) performance is important, but not important enough to go through the hassle of cythonizing your code, and 3) you're trying to perform elementwise transformation on your code. There is a good amount of evidence to suggest that list comprehensions are sufficiently fast (and even sometimes faster) for many common Pandas tasks.

The formula is simple,

# Iterating over one column - `f` is some function that processes your data
result = [f(x) for x in df['col']]
# Iterating over two columns, use `zip`
result = [f(x, y) for x, y in zip(df['col1'], df['col2'])]
# Iterating over multiple columns - same data type
result = [f(row[0], ..., row[n]) for row in df[['col1', ...,'coln']].to_numpy()]
# Iterating over multiple columns - differing data type
result = [f(row[0], ..., row[n]) for row in zip(df['col1'], ..., df['coln'])]

If you can encapsulate your business logic into a function, you can use a list comprehension that calls it. You can make arbitrarily complex things work through the simplicity and speed of raw Python code.

Caveats

List comprehensions assume that your data is easy to work with - what that means is your data types are consistent and you don't have NaNs, but this cannot always be guaranteed.

  1. The first one is more obvious, but when dealing with NaNs, prefer in-built pandas methods if they exist (because they have much better corner-case handling logic), or ensure your business logic includes appropriate NaN handling logic.
  2. When dealing with mixed data types you should iterate over zip(df['A'], df['B'], ...) instead of df[['A', 'B']].to_numpy() as the latter implicitly upcasts data to the most common type. As an example if A is numeric and B is string, to_numpy() will cast the entire array to string, which may not be what you want. Fortunately zipping your columns together is the most straightforward workaround to this.

*Your mileage may vary for the reasons outlined in the Caveats section above.


An Obvious Example

Let's demonstrate the difference with a simple example of adding two pandas columns A + B. This is a vectorizable operaton, so it will be easy to contrast the performance of the methods discussed above.

Benchmarking code, for your reference. The line at the bottom measures a function written in numpandas, a style of Pandas that mixes heavily with NumPy to squeeze out maximum performance. Writing numpandas code should be avoided unless you know what you're doing. Stick to the API where you can (i.e., prefer vec over vec_numpy).

I should mention, however, that it isn't always this cut and dry. Sometimes the answer to "what is the best method for an operation" is "it depends on your data". My advice is to test out different approaches on your data before settling on one.


Further Reading

* Pandas string methods are "vectorized" in the sense that they are specified on the series but operate on each element. The underlying mechanisms are still iterative, because string operations are inherently hard to vectorize.


Why I Wrote this Answer

A common trend I notice from new users is to ask questions of the form "How can I iterate over my df to do X?". Showing code that calls iterrows() while doing something inside a for loop. Here is why. A new user to the library who has not been introduced to the concept of vectorization will likely envision the code that solves their problem as iterating over their data to do something. Not knowing how to iterate over a DataFrame, the first thing they do is Google it and end up here, at this question. They then see the accepted answer telling them how to, and they close their eyes and run this code without ever first questioning if iteration is not the right thing to do.

The aim of this answer is to help new users understand that iteration is not necessarily the solution to every problem, and that better, faster and more idiomatic solutions could exist, and that it is worth investing time in exploring them. I'm not trying to start a war of iteration vs. vectorization, but I want new users to be informed when developing solutions to their problems with this library.

@viddik13 2019-05-30 11:56:06

Note that there are important caveats with iterrows and itertuples. See this answer and pandas docs for more details.

@LinkBerest 2019-05-30 14:26:45

This is the only answer that focuses on the idiomatic techniques one should use with pandas, making it the best answer for this question. Learning to get the right answer with the right code (instead of the right answer with the wrong code - i.e. inefficient, doesn't scale, too fit to specific data) is a big part of learning pandas (and data in general).

@Imperishable Night 2019-06-24 00:58:23

I think you are being unfair to the for loop, though, seeing as they are only a bit slower than list comprehension in my tests. The trick is to loop over zip(df['A'], df['B']) instead of df.iterrows().

@sdbbs 2019-11-20 13:57:34

Ok, I get what you're saying, but if I need to print each row (with numeric data) of a table, sorted ascending - I guess there is no other way but to loop through the rows, right?

@cs95 2019-11-20 15:37:17

@sdbbs there is, use sort_values to sort your data, then call to_string() on the result.

@David Wasserman 2020-01-16 20:44:39

Under List Comprehensions, the "iterating over multiple columns" example needs a caveat: DataFrame.values will convert every column to a common data type. DataFrame.to_numpy() does this too. Fortunately we can use zip with any number of columns.

@cs95 2020-01-16 20:52:44

@DavidWasserman that's a fantastic remark, thanks for your comments. Indeed that is something to watch out for with mixed columns unless you convert to object first (which, why would you)!

@c z 2020-01-29 18:00:34

Interesting, since iterrows, apply and list comprehension all seem to tend towards O(n) scalability I'd avoid any micro-optimisations and go with the most readable. A dataset too slow with any method is more likely in need of time spent finding solution that isn't Pandas, rather than trying to shave milliseconds off a for-loop.

@cs95 2020-01-29 19:08:58

@cz the plot is logarithmic. The difference in perf for larger datasets is in order of seconds and minutes, not milliseconds.

@bug_spray 2020-03-24 05:24:46

I know I'm late to the answering party, but if you convert the dataframe to a numpy array and then use vectorization, it's even faster than pandas dataframe vectorization, (and that includes the time to turn it back into a dataframe series). For example: def np_vectorization(df): np_arr = df.to_numpy() return pd.Series(np_arr[:,0] + np_arr[:,1], index=df.index) And... def just_np_vectorization(df): np_arr = df.to_numpy() return np_arr[:,0] + np_arr[:,1]

@cs95 2020-03-24 06:01:55

@AndreRicardo why not post that in an answer where it becomes more visible?

@Mike_K 2020-05-11 02:05:03

This is actually what I was having a hard time finding going down the google path described in the answer. Thanks for it!

@Aleksandr Panzin 2020-05-20 23:20:36

Unfortunately some of us don't have an option to follow your suggestion. Because some libraries just force use of DataFrame, unnecessarily. (I got here trying to iterate over a parquet file in Python without Spark and transform the data to JSON. And I'm forced to use DataFrame) If you write libraries - please remember to not push Pandas on us.

@cs95 2020-07-26 04:46:23

@Dean I get this response quite often and it honestly confuses me. It's all about forming good habits. "My data is small and performance doesn't matter so my use of this antipattern can be excused" ..? When performance actually does matter one day, you'll thank yourself for having prepared the right tools in advance.

@Dean 2020-07-27 05:34:13

@cs95 I thank you already (actually I deleted my comment because I thought I was nitpicking). The reason you get this kind of response too often is that the question was not about forming a good habit. If you don't want critical response, perhaps change "DON'T" to "do it, but keep in mind it's a bad habit."

@artoby 2020-06-01 16:22:44

In short

  • Use vectorization if possible
  • If operation can't be vectorized - use list comprehensions
  • If you need a single object representing entire row - use itertuples
  • If the above is too slow - try swifter.apply
  • If it's still too slow - try Cython routine

Benchmark Benchmark of iteration over rows in a pandas DataFrame

@Romain Capron 2019-12-19 16:02:14

How to iterate efficiently

If you really have to iterate a Pandas dataframe, you will probably want to avoid using iterrows(). There are different methods and the usual iterrows() is far from being the best. itertuples() can be 100 times faster.

In short:

  • As a general rule, use df.itertuples(name=None). In particular, when you have a fixed number columns and less than 255 columns. See point (3)
  • Otherwise, use df.itertuples() except if your columns have special characters such as spaces or '-'. See point (2)
  • It is possible to use itertuples() even if your dataframe has strange columns by using the last example. See point (4)
  • Only use iterrows() if you cannot the previous solutions. See point (1)

Different methods to iterate over rows in a Pandas dataframe:

Generate a random dataframe with a million rows and 4 columns:

    df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list('ABCD'))
    print(df)

1) The usual iterrows() is convenient, but damn slow:

start_time = time.clock()
result = 0
for _, row in df.iterrows():
    result += max(row['B'], row['C'])

total_elapsed_time = round(time.clock() - start_time, 2)
print("1. Iterrows done in {} seconds, result = {}".format(total_elapsed_time, result))

2) The default itertuples() is already much faster, but it doesn't work with column names such as My Col-Name is very Strange (you should avoid this method if your columns are repeated or if a column name cannot be simply converted to a Python variable name).:

start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
    result += max(row.B, row.C)

total_elapsed_time = round(time.clock() - start_time, 2)
print("2. Named Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))

3) The default itertuples() using name=None is even faster but not really convenient as you have to define a variable per column.

start_time = time.clock()
result = 0
for(_, col1, col2, col3, col4) in df.itertuples(name=None):
    result += max(col2, col3)

total_elapsed_time = round(time.clock() - start_time, 2)
print("3. Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))

4) Finally, the named itertuples() is slower than the previous point, but you do not have to define a variable per column and it works with column names such as My Col-Name is very Strange.

start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
    result += max(row[df.columns.get_loc('B')], row[df.columns.get_loc('C')])

total_elapsed_time = round(time.clock() - start_time, 2)
print("4. Polyvalent Itertuples working even with special characters in the column name done in {} seconds, result = {}".format(total_elapsed_time, result))

Output:

         A   B   C   D
0       41  63  42  23
1       54   9  24  65
2       15  34  10   9
3       39  94  82  97
4        4  88  79  54
...     ..  ..  ..  ..
999995  48  27   4  25
999996  16  51  34  28
999997   1  39  61  14
999998  66  51  27  70
999999  51  53  47  99

[1000000 rows x 4 columns]

1. Iterrows done in 104.96 seconds, result = 66151519
2. Named Itertuples done in 1.26 seconds, result = 66151519
3. Itertuples done in 0.94 seconds, result = 66151519
4. Polyvalent Itertuples working even with special characters in the column name done in 2.94 seconds, result = 66151519

This article is a very interesting comparison between iterrows and itertuples

@shubham ranjan 2019-01-19 06:53:51

There are so many ways to iterate over the rows in Pandas dataframe. One very simple and intuitive way is:

df = pd.DataFrame({'A':[1, 2, 3], 'B':[4, 5, 6], 'C':[7, 8, 9]})
print(df)
for i in range(df.shape[0]):
    # For printing the second column
    print(df.iloc[i, 1])

    # For printing more than one columns
    print(df.iloc[i, [0, 2]])

@Lucas B 2018-01-17 09:41:29

I was looking for How to iterate on rows and columns and ended here so:

for i, row in df.iterrows():
    for j, column in row.iteritems():
        print(column)

@Romain Capron 2020-07-20 09:00:51

When possible, you should avoid using iterrows(). I explain why in the answer How to iterate efficiently

@James L. 2017-12-01 17:49:50

You can also do NumPy indexing for even greater speed ups. It's not really iterating but works much better than iteration for certain applications.

subset = row['c1'][0:5]
all = row['c1'][:]

You may also want to cast it to an array. These indexes/selections are supposed to act like NumPy arrays already, but I ran into issues and needed to cast

np.asarray(all)
imgs[:] = cv2.resize(imgs[:], (224,224) ) # Resize every image in an hdf5 file

@bug_spray 2020-03-24 17:57:16

cs95 shows that Pandas vectorization far outperforms other Pandas methods for computing stuff with dataframes.

I wanted to add that if you first convert the dataframe to a NumPy array and then use vectorization, it's even faster than Pandas dataframe vectorization, (and that includes the time to turn it back into a dataframe series).

If you add the following functions to cs95's benchmark code, this becomes pretty evident:

def np_vectorization(df):
    np_arr = df.to_numpy()
    return pd.Series(np_arr[:,0] + np_arr[:,1], index=df.index)

def just_np_vectorization(df):
    np_arr = df.to_numpy()
    return np_arr[:,0] + np_arr[:,1]

Enter image description here

@viddik13 2016-12-07 16:41:28

First consider if you really need to iterate over rows in a DataFrame. See this answer for alternatives.

If you still need to iterate over rows, you can use methods below. Note some important caveats which are not mentioned in any of the other answers.

itertuples() is supposed to be faster than iterrows()

But be aware, according to the docs (pandas 0.24.2 at the moment):

  • iterrows: dtype might not match from row to row

    Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally much faster than iterrows()

  • iterrows: Do not modify rows

    You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

    Use DataFrame.apply() instead:

    new_df = df.apply(lambda x: x * 2)
    
  • itertuples:

    The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore. With a large number of columns (>255), regular tuples are returned.

See pandas docs on iteration for more details.

@Raul Guarini 2018-01-26 13:16:04

Just a small question from someone reading this thread so long after its completion: how df.apply() compares to itertuples in terms of efficiency?

@Brian Burns 2018-06-29 07:29:55

Note: you can also say something like for row in df[['c1','c2']].itertuples(index=True, name=None): to include only certain columns in the row iterator.

@viraptor 2018-08-13 06:20:31

Instead of getattr(row, "c1"), you can use just row.c1.

@Noctiphobia 2018-08-24 10:34:12

I am about 90% sure that if you use getattr(row, "c1") instead of row.c1, you lose any performance advantage of itertuples, and if you actually need to get to the property via a string, you should use iterrows instead.

@Marlo 2018-12-06 05:39:21

When I tried this it only printed the column values but not the headers. Are the column headers excluded from the row attributes?

@viddik13 2019-05-30 12:32:52

I have stumbled upon this question because, although I knew there's split-apply-combine, I still really needed to iterate over a DataFrame (as the question states). Not everyone has the luxury to improve with numba and cython (the same docs say that "It’s always worth optimising in Python first"). I wrote this answer to help others avoid (sometimes frustrating) issues as none of the other answers mention these caveats. Misleading anyone or telling "that's the right thing to do" was never my intention. I have improved the answer.

@Confounded 2019-12-16 17:36:36

And what if I want to loop through a dataframe with a step size greater than 1, e.g. select only every 3rd row? Thank you

@viddik13 2019-12-16 22:39:30

@Confounded A quick google reveals that you can use iloc to preprocess the dataframe: df.iloc[::5, :] will give you each 5th row. See this question for more details.

@Hossein 2019-02-27 00:29:49

For both viewing and modifying values, I would use iterrows(). In a for loop and by using tuple unpacking (see the example: i, row), I use the row for only viewing the value and use i with the loc method when I want to modify values. As stated in previous answers, here you should not modify something you are iterating over.

for i, row in df.iterrows():
    df_column_A = df.loc[i, 'A']
    if df_column_A == 'Old_Value':
        df_column_A = 'New_value'  

Here the row in the loop is a copy of that row, and not a view of it. Therefore, you should NOT write something like row['A'] = 'New_Value', it will not modify the DataFrame. However, you can use i and loc and specify the DataFrame to do the work.

@Wes McKinney 2012-05-24 14:24:52

You should use df.iterrows(). Though iterating row-by-row is not especially efficient since Series objects have to be created.

@vgoklani 2012-10-07 12:26:26

Is this faster than converting the DataFrame to a numpy array (via .values) and operating on the array directly? I have the same problem, but ended up converting to a numpy array and then using cython.

@Phillip Cloud 2013-06-15 21:06:43

@vgoklani If iterating row-by-row is inefficient and you have a non-object numpy array then almost surely using the raw numpy array will be faster, especially for arrays with many rows. you should avoid iterating over rows unless you absolutely have to

@Richard Wong 2015-12-16 11:41:15

I have done a bit of testing on the time consumption for df.iterrows(), df.itertuples(), and zip(df['a'], df['b']) and posted the result in the answer of another question: stackoverflow.com/a/34311080/2142098

@morganics 2019-12-10 09:36:45

Some libraries (e.g. a Java interop library that I use) require values to be passed in a row at a time, for example, if streaming data. To replicate the streaming nature, I 'stream' my dataframe values one by one, I wrote the below, which comes in handy from time to time.

class DataFrameReader:
  def __init__(self, df):
    self._df = df
    self._row = None
    self._columns = df.columns.tolist()
    self.reset()
    self.row_index = 0

  def __getattr__(self, key):
    return self.__getitem__(key)

  def read(self) -> bool:
    self._row = next(self._iterator, None)
    self.row_index += 1
    return self._row is not None

  def columns(self):
    return self._columns

  def reset(self) -> None:
    self._iterator = self._df.itertuples()

  def get_index(self):
    return self._row[0]

  def index(self):
    return self._row[0]

  def to_dict(self, columns: List[str] = None):
    return self.row(columns=columns)

  def tolist(self, cols) -> List[object]:
    return [self.__getitem__(c) for c in cols]

  def row(self, columns: List[str] = None) -> Dict[str, object]:
    cols = set(self._columns if columns is None else columns)
    return {c : self.__getitem__(c) for c in self._columns if c in cols}

  def __getitem__(self, key) -> object:
    # the df index of the row is at index 0
    try:
        if type(key) is list:
            ix = [self._columns.index(key) + 1 for k in key]
        else:
            ix = self._columns.index(key) + 1
        return self._row[ix]
    except BaseException as e:
        return None

  def __next__(self) -> 'DataFrameReader':
    if self.read():
        return self
    else:
        raise StopIteration

  def __iter__(self) -> 'DataFrameReader':
    return self

Which can be used:

for row in DataFrameReader(df):
  print(row.my_column_name)
  print(row.to_dict())
  print(row['my_column_name'])
  print(row.tolist())

And preserves the values/ name mapping for the rows being iterated. Obviously, is a lot slower than using apply and Cython as indicated above, but is necessary in some circumstances.

@Zeitgeist 2019-10-17 15:26:30

There is a way to iterate throw rows while getting a DataFrame in return, and not a Series. I don't see anyone mentioning that you can pass index as a list for the row to be returned as a DataFrame:

for i in range(len(df)):
    row = df.iloc[[i]]

Note the usage of double brackets. This returns a DataFrame with a single row.

@Jason Harrison 2019-12-03 05:23:02

This was very helpful for getting the nth largest row in a data frame after sorting. Thanks!

@Grag2015 2017-11-02 10:33:40

 for ind in df.index:
     print df['c1'][ind], df['c2'][ind]

@Bazyli Debowski 2018-09-10 12:41:05

how is the performance of this option when used on a large dataframe (millions of rows for example)?

@Grag2015 2018-10-25 13:52:28

Honestly, I don’t know exactly, I think that in comparison with the best answer, the elapsed time will be about the same, because both cases use "for"-construction. But the memory may be different in some cases.

@cs95 2019-04-18 23:19:43

This is chained indexing. Do not use this!

@Zach 2018-06-27 18:48:28

Sometimes a useful pattern is:

# Borrowing @KutalmisB df example
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b'])
# The to_dict call results in a list of dicts
# where each row_dict is a dictionary with k:v pairs of columns:value for that row
for row_dict in df.to_dict(orient='records'):
    print(row_dict)

Which results in:

{'col1':1.0, 'col2':0.1}
{'col1':2.0, 'col2':0.2}

@mjr2000 2019-03-16 22:33:02

This example uses iloc to isolate each digit in the data frame.

import pandas as pd

 a = [1, 2, 3, 4]
 b = [5, 6, 7, 8]

 mjr = pd.DataFrame({'a':a, 'b':b})

 size = mjr.shape

 for i in range(size[0]):
     for j in range(size[1]):
         print(mjr.iloc[i, j])

@Herpes Free Engineer 2018-04-23 14:53:49

To loop all rows in a dataframe and use values of each row conveniently, namedtuples can be converted to ndarrays. For example:

df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b'])

Iterating over the rows:

for row in df.itertuples(index=False, name='Pandas'):
    print np.asarray(row)

results in:

[ 1.   0.1]
[ 2.   0.2]

Please note that if index=True, the index is added as the first element of the tuple, which may be undesirable for some applications.

@piRSquared 2017-11-07 04:15:19

You can write your own iterator that implements namedtuple

from collections import namedtuple

def myiter(d, cols=None):
    if cols is None:
        v = d.values.tolist()
        cols = d.columns.values.tolist()
    else:
        j = [d.columns.get_loc(c) for c in cols]
        v = d.values[:, j].tolist()

    n = namedtuple('MyTuple', cols)

    for line in iter(v):
        yield n(*line)

This is directly comparable to pd.DataFrame.itertuples. I'm aiming at performing the same task with more efficiency.


For the given dataframe with my function:

list(myiter(df))

[MyTuple(c1=10, c2=100), MyTuple(c1=11, c2=110), MyTuple(c1=12, c2=120)]

Or with pd.DataFrame.itertuples:

list(df.itertuples(index=False))

[Pandas(c1=10, c2=100), Pandas(c1=11, c2=110), Pandas(c1=12, c2=120)]

A comprehensive test
We test making all columns available and subsetting the columns.

def iterfullA(d):
    return list(myiter(d))

def iterfullB(d):
    return list(d.itertuples(index=False))

def itersubA(d):
    return list(myiter(d, ['col3', 'col4', 'col5', 'col6', 'col7']))

def itersubB(d):
    return list(d[['col3', 'col4', 'col5', 'col6', 'col7']].itertuples(index=False))

res = pd.DataFrame(
    index=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    columns='iterfullA iterfullB itersubA itersubB'.split(),
    dtype=float
)

for i in res.index:
    d = pd.DataFrame(np.random.randint(10, size=(i, 10))).add_prefix('col')
    for j in res.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        res.at[i, j] = timeit(stmt, setp, number=100)

res.groupby(res.columns.str[4:-1], axis=1).plot(loglog=True);

enter image description here

enter image description here

@James L. 2017-12-01 16:06:25

For people who don't want to read the code: blue line is intertuples, orange line is a list of an iterator thru a yield block. interrows is not compared.

@CONvid19 2017-03-11 22:44:39

To loop all rows in a dataframe you can use:

for x in range(len(date_example.index)):
    print date_example['Date'].iloc[x]

@cs95 2019-04-18 23:20:05

This is chained indexing. I do not recommend doing this.

@CONvid19 2019-04-19 01:42:15

@cs95 What would you recommend instead?

@cs95 2019-04-19 01:57:05

If you want to make this work, call df.columns.get_loc to get the integer index position of the date column (outside the loop), then use a single iloc indexing call inside.

@PJay 2016-09-07 12:56:04

You can use the df.iloc function as follows:

for i in range(0, len(df)):
    print df.iloc[i]['c1'], df.iloc[i]['c2']

@rocarvaj 2017-10-05 14:50:50

I know that one should avoid this in favor of iterrows or itertuples, but it would be interesting to know why. Any thoughts?

@Ken Williams 2018-01-18 19:22:11

This is the only valid technique I know of if you want to preserve the data types, and also refer to columns by name. itertuples preserves data types, but gets rid of any name it doesn't like. iterrows does the opposite.

@Sean Anderson 2018-09-19 12:13:47

Spent hours trying to wade through the idiosyncrasies of pandas data structures to do something simple AND expressive. This results in readable code.

@Kim Miller 2018-12-14 18:18:13

While for i in range(df.shape[0]) might speed this approach up a bit, it's still about 3.5x slower than the iterrows() approach above for my application.

@Bastiaan 2019-01-03 22:07:53

On large Datafrmes this seems better as my_iter = df.itertuples() takes double the memory and a lot of time to copy it. same for iterrows().

@e9t 2015-09-20 13:52:48

While iterrows() is a good option, sometimes itertuples() can be much faster:

df = pd.DataFrame({'a': randn(1000), 'b': randn(1000),'N': randint(100, 1000, (1000)), 'x': 'x'})

%timeit [row.a * 2 for idx, row in df.iterrows()]
# => 10 loops, best of 3: 50.3 ms per loop

%timeit [row[1] * 2 for row in df.itertuples()]
# => 1000 loops, best of 3: 541 µs per loop

@Alex 2015-09-20 17:00:05

Much of the time difference in your two examples seems like it is due to the fact that you appear to be using label-based indexing for the .iterrows() command and integer-based indexing for the .itertuples() command.

@harbun 2015-10-19 13:03:55

For a finance data based dataframe(timestamp, and 4x float), itertuples is 19,57 times faster then iterrows on my machine. Only for a,b,c in izip(df["a"],df["b"],df["c"]: is almost equally fast.

@Abe Miessler 2017-01-10 22:05:47

Can you explain why it's faster?

@miradulo 2017-02-13 17:30:29

@AbeMiessler iterrows() boxes each row of data into a Series, whereas itertuples()does not.

@Brian Burns 2017-11-05 17:29:14

Note that the order of the columns is actually indeterminate, because df is created from a dictionary, so row[1] could refer to any of the columns. As it turns out though the times are roughly the same for the integer vs the float columns.

@Alex 2018-09-28 21:57:29

@jeffhale the times you cite are exactly the same, how is that possible? Also, I meant something like row.iat[1] when I was referring to integer-based indexing.

@jeffhale 2018-09-28 23:33:16

@Alex that does look suspicious. I just reran it a few times and itertuples took 3x longer than iterrows. With pandas 0.23.4. Will delete the other comment to avoid confusion.

@jeffhale 2018-09-28 23:40:16

Then running on a much larger DataFrame, more like a real-world situation, itertuples was 100x faster than iterrows. Itertuples for the win.

@Ajasja 2018-11-07 20:53:59

I get a >50 times increase as well i.stack.imgur.com/HBe9o.png (while changing to attr accessor in the second run).

@cheekybastard 2015-06-01 06:24:44

You can also use df.apply() to iterate over rows and access multiple columns for a function.

docs: DataFrame.apply()

def valuation_formula(x, y):
    return x * y * 0.5

df['price'] = df.apply(lambda row: valuation_formula(row['x'], row['y']), axis=1)

@SRS 2015-07-01 17:55:54

Is the df['price'] refers to a column name in the data frame? I am trying to create a dictionary with unique values from several columns in a csv file. I used your logic to create a dictionary with unique keys and values and got an error stating TypeError: ("'Series' objects are mutable, thus they cannot be hashed", u'occurred at index 0')

@SRS 2015-07-01 17:57:17

Code: df['Workclass'] = df.apply(lambda row: dic_update(row), axis=1) end of line id = 0 end of line def dic_update(row): if row not in dic: dic[row] = id id = id + 1

@SRS 2015-07-01 19:06:51

Never mind, I got it. Changed the function call line to df_new = df['Workclass'].apply(same thing)

@zthomas.nc 2017-11-29 23:58:47

Having the axis default to 0 is the worst

@gented 2018-04-04 13:44:53

Notice that apply doesn't "iteratite" over rows, rather it applies a function row-wise. The above code wouldn't work if you really do need iterations and indeces, for instance when comparing values across different rows (in that case you can do nothing but iterating).

@cs95 2019-06-29 20:54:54

@gented ...where did you see the word "iteratite" here?

@dhruvm 2020-07-25 20:14:39

this is the appropriate answer for pandas

Related Questions

Sponsored Content

14 Answered Questions

[SOLVED] How do I get the row count of a pandas DataFrame?

13 Answered Questions

[SOLVED] Iterating over dictionaries using 'for' loops

10 Answered Questions

[SOLVED] How to select rows from a DataFrame based on column values?

19 Answered Questions

[SOLVED] Get list from pandas DataFrame column headers

27 Answered Questions

[SOLVED] Renaming columns in pandas

18 Answered Questions

[SOLVED] Selecting multiple columns in a pandas dataframe

27 Answered Questions

[SOLVED] Add one row to pandas DataFrame

15 Answered Questions

[SOLVED] Delete column from pandas DataFrame

16 Answered Questions

[SOLVED] "Large data" work flows using pandas

20 Answered Questions

[SOLVED] Set value for particular cell in pandas DataFrame using index

Sponsored Content