By PhE


2012-05-23 08:12:31 8 Comments

I understand that pandas is designed to load fully populated DataFrame but I need to create an empty DataFrame then add rows, one by one. What is the best way to do this ?

I successfully created an empty DataFrame with :

res = DataFrame(columns=('lib', 'qty1', 'qty2'))

Then I can add a new row and fill a field with :

res = res.set_value(len(res), 'qty1', 10.0)

It works but seems very odd :-/ (it fails for adding string value)

How can I add a new row to my DataFrame (with different columns type) ?

19 comments

@RockStar 2019-10-21 07:26:08

You can use generator object to create Dataframe, which will be more memory efficient over the list.

num = 10

# Generator function to generate generator object
def numgen_func(num):
    for i in range(num):
        yield ('name_{}'.format(i), (i*i), (i*i*i))

# Generator expression to generate generator object (Only once data get populated, can not be re used)
numgen_expression = (('name_{}'.format(i), (i*i), (i*i*i)) for i in range(num) )

df = pd.DataFrame(data=numgen_func(num), columns=('lib', 'qty1', 'qty2'))

To add raw to existing DataFrame you can use append method.

df = df.append([{ 'lib': "name_20", 'qty1': 20, 'qty2': 400  }])

@Armali 2019-08-22 12:39:32

We often see the construct df.loc[subscript] = … to assign to one DataFrame row. Mikhail_Sam posted benchmarks containing, among others, this construct as well as the method using dict and create DataFrame in the end. He found the latter to be the fastest by far. But if we replace the df3.loc[i] = … (with preallocated DataFrame) in his code with df3.values[i] = …, the outcome changes significantly, in that that method performs similar to the one using dict. So we should more often take the use of df.values[subscript] = … into consideration. However note that .values takes a zero-based subscript, which may be different from the DataFrame.index.

@Mikhail_Sam 2017-12-26 14:02:37

It's been a long time, but I faced the same problem too. And found here a lot of interesting answers. So I was confused what method to use.

In the case of adding a lot of rows to dataframe I interested in speed performance. So I tried 4 most popular methods and checked their speed.

UPDATED IN 2019 using new versions of packages. Also updated after @FooBar comment

SPEED PERFORMANCE

  1. Using .append (NPE's answer)
  2. Using .loc (fred's answer)
  3. Using .loc with preallocating (FooBar's answer)
  4. Using dict and create DataFrame in the end (ShikharDua's answer)

Results (in secs):

|------------|-------------|-------------|-------------|
|  Approach  |  1000 rows  |  5000 rows  | 10 000 rows |
|------------|-------------|-------------|-------------|
| .append    |    0.69     |    3.39     |    6.78     |
|------------|-------------|-------------|-------------|
| .loc w/o   |    0.74     |    3.90     |    8.35     |
| prealloc   |             |             |             |
|------------|-------------|-------------|-------------|
| .loc with  |    0.24     |    2.58     |    8.70     |
| prealloc   |             |             |             |
|------------|-------------|-------------|-------------|
|  dict      |    0.012    |   0.046     |   0.084     |
|------------|-------------|-------------|-------------|

Also thanks to @krassowski for useful comment - I updated the code.

So I use addition through the dictionary for myself.


Code:

import pandas as pd
import numpy as np
import time

del df1, df2, df3, df4
numOfRows = 1000
# append
startTime = time.perf_counter()
df1 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows-4):
    df1 = df1.append( dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']), ignore_index=True)
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df1.shape)

# .loc w/o prealloc
startTime = time.perf_counter()
df2 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows):
    df2.loc[i]  = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df2.shape)

# .loc with prealloc
df3 = pd.DataFrame(index=np.arange(0, numOfRows), columns=['A', 'B', 'C', 'D', 'E'] )
startTime = time.perf_counter()
for i in range( 1,numOfRows):
    df3.loc[i]  = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df3.shape)

# dict
startTime = time.perf_counter()
row_list = []
for i in range (0,5):
    row_list.append(dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']))
for i in range( 1,numOfRows-4):
    dict1 = dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E'])
    row_list.append(dict1)

df4 = pd.DataFrame(row_list, columns=['A','B','C','D','E'])
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df4.shape)

P.S. I believe, my realization isn't perfect, and maybe there is some optimization.

@krassowski 2019-01-23 20:44:29

The use of df2.index.max() for .loc needlessly increases computational complexity. Simple df2.loc[i] = ... would do. For me it reduced the time from 10s to 8.64s

@FooBar 2019-07-29 18:27:58

Please remove my name from the list, since you're not following my approach in your test: You're not preallocating the memory by providing an index of suitable size.

@Mikhail_Sam 2019-07-30 08:17:47

@FooBar Hi! I'm glad you as the author saw my answer :) you are right, I missed this important point. I prefer to add one more row for my result table as your approach show the different result!

@FabioSpaghetti 2019-08-11 12:36:36

@Mikhail_Sam How would you use pivot-table to write it on an excel file using the fastest method, dict ?

@rayzinnz 2019-08-12 08:33:58

Why use a list of dicts instead of just hstacking the numpy into a 2D array, and converting and adding column names at the end?

@Mikhail_Sam 2019-08-16 08:53:44

@FabioSpaghetti in the dict-methods we get df at the ens. So as in the all other methods. So there is no any problem to create pivot-table on it. Here, for example, you can find example of reading/writing to Excel: pivot-tables on Excel

@flow2k 2019-09-29 09:03:51

@Mikhail_Sam For the last, dict approach, what's the rationale behind using two loops, for i in range (0,5): and for i in range( 1,numOfRows-4):?

@Brian Burns 2017-10-13 12:16:23

You can also build up a list of lists and convert it to a dataframe -

import pandas as pd

columns = ['i','double','square']
rows = []

for i in range(6):
    row = [i, i*2, i*i]
    rows.append(row)

df = pd.DataFrame(rows, columns=columns)

giving

    i   double  square
0   0   0   0
1   1   2   1
2   2   4   4
3   3   6   9
4   4   8   16
5   5   10  25

@fred 2014-07-22 13:10:25

>>> import pandas as pd
>>> from numpy.random import randint

>>> df = pd.DataFrame(columns=['lib', 'qty1', 'qty2'])
>>> for i in range(5):
>>>     df.loc[i] = ['name' + str(i)] + list(randint(10, size=2))

>>> df
     lib qty1 qty2
0  name0    3    3
1  name1    2    4
2  name2    2    8
3  name3    2    1
4  name4    9    6

@FooBar 2014-07-23 14:22:36

Consider adding the index to preallocate memory (see my answer)

@fred 2014-10-23 19:17:30

@MaximG: I strongly recommend an upgrade. Current Pandas version is 0.15.0.

@hobs 2015-09-25 23:21:20

.loc is referencing the index column, so if you're working with a pre-existing DataFrame with an index that isn't a continous sequence of integers starting with 0 (as in your example), .loc will overwrite existing rows, or insert rows, or create gaps in your index. A more robust (but not fool-proof) approach for appending an existing nonzero-length dataframe would be: df.loc[df.index.max() + 1] = [randint(... or prepopulating the index as @FooBar suggested.

@fred 2015-09-28 13:18:25

@hobs: I completely agree with you. Thanks for your input. However, it's a different scenario from that proposed in the original question. If you know, a priori, the size of your data frame it's certainly faster to allocate memory.

@Rishabh Agrahari 2017-11-04 19:43:55

what if each row is flattened image and number of pixels may vary, i.e., different number of columns for each row, what to do in that case?

@Mohit Motwani 2018-08-01 10:37:48

Instead of [np.random.randint(-1,1) for n in range(3)], you could use np.random.randint(-1, 1, size =3).

@flow2k 2019-04-24 01:30:11

@hobs df.index.max() is nan when the DataFrame is empty.

@hobs 2019-04-24 21:31:02

@flow2k good catch! Only solution I can think of is a try accept (on the first row insertion only) with a pd.DataFrame() constructor call. Do you know any better ways?

@flow2k 2019-04-25 21:17:24

@hobs One solution I thought of is using the ternary operator: df.loc[0 if pd.isnull(df.index.max()) else df.index.max() + 1]

@tomatom 2017-10-13 17:48:52

This will take care of adding an item to an empty DataFrame. The issue is that df.index.max() == nan for the first index:

df = pd.DataFrame(columns=['timeMS', 'accelX', 'accelY', 'accelZ', 'gyroX', 'gyroY', 'gyroZ'])

df.loc[0 if math.isnan(df.index.max()) else df.index.max() + 1] = [x for x in range(7)]

@Shivam Agrawal 2018-09-05 19:30:54

Here is the way to add/append a row in pandas DataFrame

def add_row(df, row):
    df.loc[-1] = row
    df.index = df.index + 1  
    return df.sort_index()

add_row(df, [1,2,3]) 

It can be used to insert/append a row in empty or populated pandas DataFrame

@Parthiban Rajendran 2018-10-13 17:33:30

this is adding with index in descending order

@Qinsi 2018-08-30 03:19:43

Figured out a simple and nice way:

>>> df
     A  B  C
one  1  2  3
>>> df.loc["two"] = [4,5,6]
>>> df
     A  B  C
one  1  2  3
two  4  5  6

@waterproof 2019-07-25 16:43:28

Note that this will copy the entire DataFrame under the hood. The underlying arrays can't be extended so they have to be copied.

@ShikharDua 2013-07-05 20:38:13

In case you can get all data for the data frame upfront, there is a much faster approach than appending to a data frame:

  1. Create a list of dictionaries in which each dictionary corresponds to an input data row.
  2. Create a data frame from this list.

I had a similar task for which appending to a data frame row by row took 30 min, and creating a data frame from a list of dictionaries completed within seconds.

rows_list = []
for row in input_rows:

        dict1 = {}
        # get input row in dictionary format
        # key = col_name
        dict1.update(blah..) 

        rows_list.append(dict1)

df = pd.DataFrame(rows_list)               

@fantabolous 2014-08-13 12:19:38

I've moved to doing this as well for any situation where I can't get all the data up front. The speed difference is astonishing.

@thikonom 2015-12-25 22:01:27

Copying from pandas docs: It is worth noting however, that concat (and therefore append) makes a full copy of the data, and that constantly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension. (pandas.pydata.org/pandas-docs/stable/…)

@user5359531 2016-08-09 21:36:40

This works great! Except when I created the data frame, the columns names were all in the wrong order...

@ShikharDua 2016-08-10 20:31:56

@user5359531 You can use ordered dict in that case

@Marcello Grechi Lins 2017-01-27 22:26:48

@user5359531 You can manually specify the columns and the order will be preserved. pd.DataFrame(rows_list, columns=['C1', 'C2','C3']) will do the trick

@avg 2018-01-11 10:31:01

@ShikharDua Can you explain how to structure the data, i.e. when you say # get input row in dictionary format # key = col_name what do you mean? my use case: I am fetching rows from a MySQL table and each row object is coming as a tuple containing 10 strings

@ShikharDua 2019-01-29 00:29:36

@avg I am guessing each tuple is supposed to be a row in pandas table. In that case, it can be an element if the row_list instead of dict1. And once all data is on form of [tuple1, tuple2, tuple 3, .....] then you can create a final dataset

@Eric Ed Lohmar 2019-06-17 16:10:03

@avg, it's probably better to use the read_sql method in that case. Then you don't have to worry about parsing it on your own.

@hkyi 2017-08-06 05:06:38

For the sake of Pythonic way, here add my answer:

res = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
res = res.append([{'qty1':10.0}], ignore_index=True)
print(res.head())

   lib  qty1  qty2
0  NaN  10.0   NaN

@Vineet Jain 2017-08-25 15:47:04

Make it simple. By taking list as input which will be appended as row in data-frame:-

import pandas as pd  
res = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))  
for i in range(5):  
    res_list = list(map(int, input().split()))  
    res = res.append(pd.Series(res_list,index=['lib','qty1','qty2']), ignore_index=True)

@Nasser Al-Wohaibi 2014-04-30 17:31:04

For efficient appending see How to add an extra row to a pandas dataframe and Setting With Enlargement.

Add rows through loc/ix on non existing key index data. e.g. :

In [1]: se = pd.Series([1,2,3])

In [2]: se
Out[2]: 
0    1
1    2
2    3
dtype: int64

In [3]: se[5] = 5.

In [4]: se
Out[4]: 
0    1.0
1    2.0
2    3.0
5    5.0
dtype: float64

Or:

In [1]: dfi = pd.DataFrame(np.arange(6).reshape(3,2),
   .....:                 columns=['A','B'])
   .....: 

In [2]: dfi
Out[2]: 
   A  B
0  0  1
1  2  3
2  4  5

In [3]: dfi.loc[:,'C'] = dfi.loc[:,'A']

In [4]: dfi
Out[4]: 
   A  B  C
0  0  1  0
1  2  3  2
2  4  5  4
In [5]: dfi.loc[3] = 5

In [6]: dfi
Out[6]: 
   A  B  C
0  0  1  0
1  2  3  2
2  4  5  4
3  5  5  5

@Guilherme Felipe Reis 2019-02-21 15:38:49

The users asked for implement (add a new row). Here we see how to add a row in a defined index or add a column.

@PirateApp 2019-03-06 17:15:52

any benchmarks on how this works out compared to the dict method

@waterproof 2019-07-25 16:41:39

this is not efficient as it actually copies the entire DataFrame.

@qed 2016-11-11 18:18:09

Another way to do it (probably not very performant):

# add a row
def add_row(df, row):
    colnames = list(df.columns)
    ncol = len(colnames)
    assert ncol == len(row), "Length of row must be the same as width of DataFrame: %s" % row
    return df.append(pd.DataFrame([row], columns=colnames))

You can also enhance the DataFrame class like this:

import pandas as pd
def add_row(self, row):
    self.loc[len(self.index)] = row
pd.DataFrame.add_row = add_row

@Jack Daniel 2016-07-18 09:54:03

Create a new record(data frame) and add to old_data_frame.
pass list of values and corresponding column names to create a new_record (data_frame)

new_record = pd.DataFrame([[0,'abcd',0,1,123]],columns=['a','b','c','d','e'])

old_data_frame = pd.concat([old_data_frame,new_record])

@user3250815 2016-07-13 09:49:18

This is not an answer to the OP question but a toy example to illustrate the answer of @ShikharDua above which I found very useful.

While this fragment is trivial, in the actual data I had 1,000s of rows, and many columns, and I wished to be able to group by different columns and then perform the stats below for more than one taget column. So having a reliable method for building the data frame one row at a time was a great convenience. Thank you @ShikharDua !

import pandas as pd 

BaseData = pd.DataFrame({ 'Customer' : ['Acme','Mega','Acme','Acme','Mega','Acme'],
                          'Territory'  : ['West','East','South','West','East','South'],
                          'Product'  : ['Econ','Luxe','Econ','Std','Std','Econ']})
BaseData

columns = ['Customer','Num Unique Products', 'List Unique Products']

rows_list=[]
for name, group in BaseData.groupby('Customer'):
    RecordtoAdd={} #initialise an empty dict 
    RecordtoAdd.update({'Customer' : name}) #
    RecordtoAdd.update({'Num Unique Products' : len(pd.unique(group['Product']))})      
    RecordtoAdd.update({'List Unique Products' : pd.unique(group['Product'])})                   

    rows_list.append(RecordtoAdd)

AnalysedData = pd.DataFrame(rows_list)

print('Base Data : \n',BaseData,'\n\n Analysed Data : \n',AnalysedData)

@W.P. McNeill 2016-02-23 16:43:07

You can append a single row as a dictionary using the ignore_index option.

>>> f = pandas.DataFrame(data = {'Animal':['cow','horse'], 'Color':['blue', 'red']})
>>> f
  Animal Color
0    cow  blue
1  horse   red
>>> f.append({'Animal':'mouse', 'Color':'black'}, ignore_index=True)
  Animal  Color
0    cow   blue
1  horse    red
2  mouse  black

@Blairg23 2016-05-28 03:57:05

You might also mention that f.append(<stuff>) creates a new object, rather than simply appending to the current object in place, so if you're trying to append to a dataframe in a script, you need to say f = f.append(<stuff>)

@lol 2016-11-08 03:48:15

is there a way to do this in place?

@waterproof 2019-07-25 16:42:48

@lol no. see github.com/pandas-dev/pandas/issues/2801 - the underlying arrays can't be extended so they have to be copied.

@Lydia 2015-06-24 21:06:31

mycolumns = ['A', 'B']
df = pd.DataFrame(columns=mycolumns)
rows = [[1,2],[3,4],[5,6]]
for row in rows:
    df.loc[len(df)] = row

@jhin 2016-03-09 00:00:04

This! I've been searching for quite a while, and this is the first post that really shows how to assign particular values to a row! Bonus question: Which is the syntax for column-name/value pairs? I guess it must be something using a dict, but I can't seem to get it right.

@waterproof 2019-07-25 16:42:15

this is not efficient as it actually copies the entire DataFrame when you extend it.

@FooBar 2014-07-23 14:21:45

If you know the number of entries ex ante, you should preallocate the space by also providing the index (taking the data example from a different answer):

import pandas as pd
import numpy as np
# we know we're gonna have 5 rows of data
numberOfRows = 5
# create dataframe
df = pd.DataFrame(index=np.arange(0, numberOfRows), columns=('lib', 'qty1', 'qty2') )

# now fill it up row by row
for x in np.arange(0, numberOfRows):
    #loc or iloc both work here since the index is natural numbers
    df.loc[x] = [np.random.randint(-1,1) for n in range(3)]
In[23]: df
Out[23]: 
   lib  qty1  qty2
0   -1    -1    -1
1    0     0     0
2   -1     0    -1
3    0    -1     0
4   -1     0     0

Speed comparison

In[30]: %timeit tryThis() # function wrapper for this answer
In[31]: %timeit tryOther() # function wrapper without index (see, for example, @fred)
1000 loops, best of 3: 1.23 ms per loop
100 loops, best of 3: 2.31 ms per loop

And - as from the comments - with a size of 6000, the speed difference becomes even larger:

Increasing the size of the array (12) and the number of rows (500) makes the speed difference more striking: 313ms vs 2.29s

@ely 2014-10-09 18:32:02

Great answer. This should be the norm so that row space doesn't have to allocated incrementally.

@Tickon 2015-04-02 10:55:51

Increasing the size of the array(12) and the number of rows(500) makes the speed difference more striking: 313ms vs 2.29s

@NPE 2012-05-23 08:14:43

You could use pandas.concat() or DataFrame.append(). For details and examples, see Merge, join, and concatenate.

@notilas 2014-08-20 22:52:43

Hi, so what is the answer for the methods using append() or concat(). I have the same problem, but still trying to figuring it out.

@patapouf_ai 2016-05-12 22:17:10

append doesnt work for me in python3.4

@jwg 2016-05-18 14:34:20

This is the right answer, but it isn't a very good answer (almost link only).

@Ken Williams 2017-03-16 16:03:54

I think @fred's answer is more correct. IIUC the problem with this answer is that it needlessly copies the entire DataFrame every time a row is appended. Using the .loc mechanism that can be avoided, especially if you're careful.

@StayFoolish 2017-09-08 12:46:38

But if you want to use DataFrame.append(), you have to make sure your row data is also a DataFrame in the first place, not a list.

Related Questions

Sponsored Content

22 Answered Questions

[SOLVED] Adding new column to existing DataFrame in Python pandas

10 Answered Questions

[SOLVED] How to select rows from a DataFrame based on column values?

23 Answered Questions

[SOLVED] Renaming columns in pandas

5 Answered Questions

[SOLVED] Creating an empty Pandas DataFrame, then filling it?

18 Answered Questions

[SOLVED] How to iterate over rows in a DataFrame in Pandas?

18 Answered Questions

[SOLVED] Get list from pandas DataFrame column headers

15 Answered Questions

[SOLVED] Selecting multiple columns in a pandas dataframe

14 Answered Questions

[SOLVED] "Large data" work flows using pandas

13 Answered Questions

[SOLVED] Delete column from pandas DataFrame

10 Answered Questions

[SOLVED] How do I get the row count of a pandas DataFrame?

Sponsored Content