By Rajkumar Kumawat


2014-09-21 17:46:43 8 Comments

I am trying to read a large csv file (aprox. 6 GB) in pandas and i am getting the following memory error:

MemoryError                               Traceback (most recent call last)
<ipython-input-58-67a72687871b> in <module>()
----> 1 data=pd.read_csv('aphro.csv',sep=';')

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format)
    450                     infer_datetime_format=infer_datetime_format)
    451 
--> 452         return _read(filepath_or_buffer, kwds)
    453 
    454     parser_f.__name__ = name

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)
    242         return parser
    243 
--> 244     return parser.read()
    245 
    246 _parser_defaults = {

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    693                 raise ValueError('skip_footer not supported for iteration')
    694 
--> 695         ret = self._engine.read(nrows)
    696 
    697         if self.options.get('as_recarray'):

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
   1137 
   1138         try:
-> 1139             data = self._reader.read(nrows)
   1140         except StopIteration:
   1141             if nrows is None:

C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader.read (pandas\parser.c:7145)()

C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader._read_low_memory (pandas\parser.c:7369)()

C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader._read_rows (pandas\parser.c:8194)()

C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader._convert_column_data (pandas\parser.c:9402)()

C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader._convert_tokens (pandas\parser.c:10057)()

C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10361)()

C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser._try_int64 (pandas\parser.c:17806)()

MemoryError: 

Any help on this??

13 comments

@Humpe 2018-10-24 08:40:25

You can read in the data as chunks and save each chunk as pickle.

import pandas as pd 
import pickle

in_path = "" #Path where the large file is
out_path = "" #Path to save the pickle files to
chunk_size = 400000 #size of chunks relies on your available memory
separator = "~"

reader = pd.read_csv(in_path,sep=separator,chunksize=chunk_size, 
                    low_memory=False)    


for i, chunk in enumerate(reader):
    out_file = out_path + "/data_{}.pkl".format(i+1)
    with open(out_file, "wb") as f:
        pickle.dump(chunk,f,pickle.HIGHEST_PROTOCOL)

In the next step you read in the pickles and append each pickle to your desired dataframe.

import glob
pickle_path = "" #Same Path as out_path i.e. where the pickle files are

data_p_files=[]
for name in glob.glob(pickle_path + "/data_*.pkl"):
   data_p_files.append(name)


df = pd.DataFrame([])
for i in range(len(data_p_files)):
    df = df.append(pd.read_pickle(data_p_files[i]),ignore_index=True)

@jpp 2019-02-19 09:15:51

If your final df fits entirely in memory (as implied) and contains the same amount of data as your input, surely you don't need to chunk at all?

@jonathask 2019-05-27 06:12:25

Here follows an example:

chunkTemp = []
queryTemp = []
query = pd.DataFrame()

for chunk in pd.read_csv(file, header=0, chunksize=<your_chunksize>, iterator=True, low_memory=False):

    #REPLACING BLANK SPACES AT COLUMNS' NAMES FOR SQL OPTIMIZATION
    chunk = chunk.rename(columns = {c: c.replace(' ', '') for c in chunk.columns})

    #YOU CAN EITHER: 
    #1)BUFFER THE CHUNKS IN ORDER TO LOAD YOUR WHOLE DATASET 
    chunkTemp.append(chunk)

    #2)DO YOUR PROCESSING OVER A CHUNK AND STORE THE RESULT OF IT
    query = chunk[chunk[<column_name>].str.startswith(<some_pattern>)]   
    #BUFFERING PROCESSED DATA
    queryTemp.append(query)

#!  NEVER DO pd.concat OR pd.DataFrame() INSIDE A LOOP
print("Database: CONCATENATING CHUNKS INTO A SINGLE DATAFRAME")
chunk = pd.concat(chunkTemp)
print("Database: LOADED")

#CONCATENATING PROCESSED DATA
query = pd.concat(queryTemp)
print(query)

@Jaskaran 2019-04-11 04:57:13

In case someone is still looking for something like this, I found that this new library called modin can help. It uses distributed computing that can help with the read. Here's a nice article comparing its functionality with pandas. It essentially uses the same functions as pandas.

import modin.pandas as pd
pd.read_csv(CSV_FILE_NAME)

@jpp 2019-04-12 18:17:38

Can you comment on how this new module modin compares with the well-established dask.dataframe? For example, see move from pandas to dask to utilize all local cpu cores.

@unutbu 2014-09-21 17:54:03

The error shows that the machine does not have enough memory to read the entire CSV into a DataFrame at one time. Assuming you do not need the entire dataset in memory all at one time, one way to avoid the problem would be to process the CSV in chunks (by specifying the chunksize parameter):

chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

The chucksize parameter specifies the number of rows per chunk. (The last chunk may contain fewer than chunksize rows, of course.)

@Jeff 2014-09-21 17:57:09

you generally need 2X the final memory to read in something (from csv, though other formats are better at having lower memory requirements). FYI this is true for trying to do almost anything all at once. Much better to chunk it (which has a constant memory usage).

@altabq 2016-02-17 16:02:44

By process(chunk) you mean creating an empty DF before and then appending chunk to it in the for loop, like DF.append(chunk)?

@unutbu 2016-02-17 18:29:43

@altabq: The problem here is that we don't have enough memory to build a single DataFrame holding all the data. The solution above tries to cope with this situation by reducing the chunks (e.g. by aggregating or extracting just the desired information) one chunk at a time -- thus saving memory. Whatever you do, DO NOT call DF.append(chunk) inside the loop. That will use O(N^2) copying operations. It is better to append the aggregated data to a list, and then build the DataFrame from the list with one call to pd.DataFrame or pd.concat (depending on the type of aggregated data).

@unutbu 2016-02-17 18:33:18

@altabq: Calling DF.append(chunk) in a loop requires O(N^2) copying operations where N is the size of the chunks, because each call to DF.append returns a new DataFrame. Calling pd.DataFrame or pd.concat once outside the loop reduces the amount of copying to O(N).

@Pyderman 2016-05-11 17:34:52

Should chunk and row be seen as one and the same, in these reads?

@Pyderman 2016-05-11 18:04:27

@unutbu The pandas documentation gives the example of a small csv, and with a chunk size of 4, 4 rows get read in per chunk. I'm wondering if this 1-to-1 correspondence holds true for a csv with millions of rows. (I also have 6GB csv that I'm currently trying to read in with a chunk size of 100000, it's taking its time).

@unutbu 2016-05-11 18:06:39

@Pyderman: Yes, the chunksize parameter refers to the number of rows per chunk. The last chunk may contain fewer than chunksize rows, of course.

@unutbu 2016-05-11 18:18:23

@Pyderman: You might want to test how long it takes to consume the iterator: for df in pd.read_csv(..., chunksize=10**5): pass without doing any processing. If that succeeds quickly, then you know it is something else inside the loop which is taking a lot of time. (Never call result_df = result_df.append(df) inside the loop, for example, since that requires quadratic copying.)

@Pyderman 2016-05-11 18:24:36

@unutbu Thanks, yes taking your advice from your ealier comment, I have created a list rather than a DataFrame, and appending each chunk to this list in each iteration (with a view to creating a DataFrame from the resulting list when all is done). Nothing else is done within the loop. This is what is running right now. That's the approach you were suggesting, correct?

@unutbu 2016-05-11 18:27:43

@Pyderman: Yes; calling pd.concat([list_of_dfs]) once after the loop is much faster than calling pd.concat or df.append many times within the loop. Of course, you'll need a considerable amount of memory to hold the entire 6GB csv as one DataFrame.

@CodeGeek123 2017-01-09 16:33:55

@unutbu : How would you handle this if you did not have enough memory to hold 6 GB Data? I have pytables / storing in a disk however if i had to append it all to a DF before this storage happens that will blow memory. Any suggestions?

@unutbu 2017-01-09 20:28:05

@CodeGeek123: If the chunks can be processed independently, you could use HDFStore.append to build the pytables result accretively. You might find some useful example code here too.

@Carl 2017-09-05 08:17:39

just following on from Jeff's comment: In my case, I need more than ~3x memory: I'm loading a 19gb csv on a 64 bit OS and have 64gb of ram, the vast majority of which is free. Using read_csv, I rapidly run out of memory and my M2 SSD disk swap file starts being used. After about ~10 - 15 minutes, my screen goes black and I have to reboot. I'm definitely not overheating. I believe the memory overheads are due to panda's column type detection, and my file has ~100+ columns.

@Carl 2017-09-05 13:08:59

Here is a follow-up question: there is an option to load only certain columns, that reduces both CPU and memory overheads associated with dtype detection, which I believe are quite substantial. Therefore, if you don't need all the columns in the csv, then this could also solve your problem.

@blacksheep 2018-12-05 08:25:26

Solution 1:

Using pandas with large data

Solution 2:

TextFileReader = pd.read_csv(path, chunksize=1000)  # the number of rows per chunk

dfList = []
for df in TextFileReader:
    dfList.append(df)

df = pd.concat(dfList,sort=False)

@debaonline4u 2018-12-20 09:08:04

Here again we are loading the 6 GB file totally to the memory, Is there any options, we can process the current chunk and then read the next chunk

@gokul_uf 2018-12-20 14:13:03

just don't do dfList.append, just process each chunk ( df ) separately

@citynorman 2018-10-14 22:44:23

In addition to the answers above, for those who want to process CSV and then export to csv, parquet or SQL, d6tstack is another good option. You can load multiple files and it deals with data schema changes (added/removed columns). Chunked out of core support is already built in.

def apply(dfg):
    # do stuff
    return dfg

c = d6tstack.combine_csv.CombinerCSV([bigfile.csv], apply_after_read=apply, sep=',', chunksize=1e6)

# or
c = d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), apply_after_read=apply, chunksize=1e6)

# output to various formats, automatically chunked to reduce memory consumption
c.to_csv_combine(filename='out.csv')
c.to_parquet_combine(filename='out.pq')
c.to_psql_combine('postgresql+psycopg2://usr:[email protected]/db', 'tablename') # fast for postgres
c.to_mysql_combine('mysql+mysqlconnector://usr:[email protected]/db', 'tablename') # fast for mysql
c.to_sql_combine('postgresql+psycopg2://usr:[email protected]/db', 'tablename') # slow but flexible

@Simbarashe Timothy Motsi 2018-04-17 11:21:24

For large data l recommend you use the library "dask"
e.g:

# Dataframes implement the Pandas API
import dask.dataframe as dd
df = dd.read_csv('s3://.../2018-*-*.csv')

@PirateApp 2018-04-21 11:38:25

Any benefits over pandas, could appreciate adding a few more pointers

@Simbarashe Timothy Motsi 2018-04-23 07:42:03

I haven't used Dask for very long but the main advantages in my use cases were that Dask can run parallel on multiple machines, it can also fit data as slices into memory.

@PirateApp 2018-04-28 12:30:32

thanks! is dask a replacement for pandas or does it work on top of pandas as a layer

@Simbarashe Timothy Motsi 2018-04-28 12:35:53

Welcome, it works as a wrapper for Numpy, Pandas, and Scikit-Learn.

@jpp 2018-01-23 17:45:01

Chunking shouldn't always be the first port of call for this problem.

1. Is the file large due to repeated non-numeric data or unwanted columns?

If so, you can sometimes see massive memory savings by reading in columns as categories and selecting required columns via pd.read_csv usecols parameter.

2. Does your workflow require slicing, manipulating, exporting?

If so, you can use dask.dataframe to slice, perform your calculations and export iteratively. Chunking is performed silently by dask, which also supports a subset of pandas API.

3. If all else fails, read line by line via chunks.

Chunk via pandas or via csv library as a last resort.

@paulg 2017-11-13 05:34:54

If you use pandas read large file into chunk and then yield row by row, here is what I have done

import pandas as pd

def chunck_generator(filename, header=False,chunk_size = 10 ** 5):
   for chunk in pd.read_csv(filename,delimiter=',', iterator=True, chunksize=chunk_size, parse_dates=[1] ): 
        yield (chunk)

def _generator( filename, header=False,chunk_size = 10 ** 5):
    chunk = chunck_generator(filename, header=False,chunk_size = 10 ** 5)
    for row in chunk:
        yield row

if __name__ == "__main__":
filename = r'file.csv'
        generator = generator(filename=filename)
        while True:
           print(next(generator))

@Rajkumar Kumawat 2014-09-24 12:46:23

I proceeded like this:

chunks=pd.read_table('aphro.csv',chunksize=1000000,sep=';',\
       names=['lat','long','rf','date','slno'],index_col='slno',\
       header=None,parse_dates=['date'])

df=pd.DataFrame()
%time df=pd.concat(chunk.groupby(['lat','long',chunk['date'].map(lambda x: x.year)])['rf'].agg(['sum']) for chunk in chunks)

@Pyderman 2016-05-09 22:06:52

Is there a reason you switched from read_csv to read_table?

@FaCoffee 2017-10-25 08:54:36

Got an error that says: "TypeError: ord() expected a character, but string of length 0 found". What does it mean?

@Tyrion W 2017-04-26 15:02:38

The function read_csv and read_table is almost the same. But you must assign the delimiter “,” when you use the function read_table in your program.

def get_from_action_data(fname, chunk_size=100000):
    reader = pd.read_csv(fname, header=0, iterator=True)
    chunks = []
    loop = True
    while loop:
        try:
            chunk = reader.get_chunk(chunk_size)[["user_id", "type"]]
            chunks.append(chunk)
        except StopIteration:
            loop = False
            print("Iteration is stopped")

    df_ac = pd.concat(chunks, ignore_index=True)

@nate_weldon 2017-04-26 15:44:54

It would help if stated what your question is in this post. Like "What is the difference between read_csv and read_table?" or "Why does read table need a delimiter?"

@Naufal 2018-06-10 18:46:29

It depends how your file looks. Some files have common delimiters such as "," or "|" or "\t" but you may see other files with delimiters such as 0x01, 0x02 (making this one up) etc. So read_table is more suited to uncommon delimiters but read_csv can do the same job just as good.

@NunodeSousa 2017-01-07 13:22:50

You can try sframe, that have the same syntax as pandas but allows you to manipulate files that are bigger than your RAM.

@ankostis 2017-03-23 13:36:59

@Danny Wang 2017-09-06 15:06:27

"The data in SFrame is stored column-wise on the GraphLab Server side" is it a service or a package?

@NunodeSousa 2017-10-01 18:55:32

It is a package. See the webpage: turi.com

@PlagTag 2014-09-23 08:44:49

The above answer is already satisfying the topic. Anyway, if you need all the data in memory - have a look at bcolz. Its compressing the data in memory. I have had really good experience with it. But its missing a lot of pandas features

Edit: I got compression rates at around 1/10 or orig size i think, of course depending of the kind of data. Important features missing were aggregates.

@smci 2016-11-26 15:14:43

Please improve this answer by telling us a) what compression ratios you get and b) what main features of pandas it's missing? Can it handle NAs? strings? categoricals? dates?

@PlagTag 2016-11-29 15:08:35

@smci done, have a look please

@smci 2016-11-29 15:23:47

Huh? Can it handle NAs? strings? categoricals? dates? These are the things that make pandas csv reading slow and flabby. NAs and objects like strings (even short ones) are a killer. Btw the .ipynb referenced from your blog is down.

@PlagTag 2016-11-30 10:59:34

@smci i was reading you note. but i suggest you have a look at the docs. i would need to read them myself.

@smci 2016-11-30 15:47:40

I read the docs twice, they make no mention of any of this, which is precisely why I asked. I strongly suspect it can't handle NAs, and probably not strings, Date or other objects.

@PlagTag 2016-12-07 16:11:59

afik it cant handle that. its a compressed structure which make handling dates (objects) not very easy. but you could of course always apply an .eval() on the vector to convert or use time formats which are numberic

@smci 2016-12-10 20:28:55

Ok so it can't handle NAs, strings, or dates. I doubt it can handle floats either.

@RockJake28 2017-06-19 11:32:36

I suppose you could preprocess with pandas using the chunks method mentioned, then use bcolz if you need all the data in memory to do an analysis. Just a thought.

Related Questions

Sponsored Content

60 Answered Questions

[SOLVED] How do you split a list into evenly sized chunks?

11 Answered Questions

[SOLVED] OSError: Initializing from file failed on csv in Pandas

28 Answered Questions

[SOLVED] How to read a file line-by-line into a list?

22 Answered Questions

[SOLVED] How do I list all files of a directory?

  • 2010-07-08 19:31:22
  • duhhunjonn
  • 3363287 View
  • 3474 Score
  • 22 Answer
  • Tags:   python directory

1 Answered Questions

[SOLVED] How do you open a csv file on Jupyter Notebook?

38 Answered Questions

[SOLVED] How do I check whether a file exists without exceptions?

2 Answered Questions

Google colab open csv file

2 Answered Questions

1 Answered Questions

[SOLVED] ParserError: Error tokenizing data. C error: Expected

2 Answered Questions

[SOLVED] open txt file using read_csv by pandas

  • 2017-06-24 04:44:19
  • Katsuya Obara
  • 622 View
  • 0 Score
  • 2 Answer
  • Tags:   python pandas

Sponsored Content