By 384X21


2011-11-04 13:26:29 8 Comments

I want to iterate over each line of an entire file. One way to do this is by reading the entire file, saving it to a list, then going over the line of interest. This method uses a lot of memory, so I am looking for an alternative.

My code so far:

for each_line in fileinput.input(input_file):
    do_something(each_line)

    for each_line_again in fileinput.input(input_file):
        do_something(each_line_again)

Executing this code gives an error message: device active.

Any suggestions?

The purpose is to calculate pair-wise string similarity, meaning for each line in file, I want to calculate the Levenshtein distance with every other line.

10 comments

@Katriel 2011-11-04 13:46:44

The correct, fully Pythonic way to read a file is the following:

with open(...) as f:
    for line in f:
        # Do something with 'line'

The with statement handles opening and closing the file, including if an exception is raised in the inner block. The for line in f treats the file object f as an iterable, which automatically uses buffered I/O and memory management so you don't have to worry about large files.

There should be one -- and preferably only one -- obvious way to do it.

@Simon Bergot 2011-11-04 13:54:38

yep, this is the best version with python 2.6 and above

@jldupont 2011-11-04 14:30:29

I personally prefer generators & coroutines for dealing with data pipelines.

@mfcabrera 2013-12-18 14:48:04

what would be the best strategy if a file is a huge text file but with one line and the idea is to process words?

@Katriel 2013-12-20 16:25:07

@mfcabrera you will have to deal with blocks yourself, in that case: Python only buffers lines. See the other answers.

@Jorge Vidinha 2014-05-06 21:40:13

@mfcabrera any pointings to examples for parsing huge one line text files, what does it means "deal with blocks" ?

@mfcabrera 2014-05-07 15:51:59

@JorgeVidinha This is how I end up doing it: gist.github.com/mfcabrera/14015179cdfd2dd2a2fa

@haccks 2015-02-09 13:56:27

Could someone explain how for line in f: is working? I mean, how iterating over file object is possible?

@Katriel 2015-02-10 12:14:52

If you iterate over an object, Python looks up in the list of object methods a special one called __iter__, which tells it what to do. File objects define this special method to return an iterator over the lines. (Roughly.)

@David Tamrazov 2015-11-13 20:44:30

Just a question in regards to this- does the above code return the "line" as a string? As in could I then do something like: "myString = line; (do something to myString)?

@Sardonic 2016-01-23 23:49:03

I have a quick question with the for loop: does it automatically take care of optimizing reading in chunks of data instead of one by one? so that it doesn't go on to the external file too often?

@Sardonic 2016-02-03 22:53:28

Ok, if anyone had the same question I did. It is not for loop that is taking care of the optimization but the basic unit of reading file is Page, which varies by different architecture. So even if I do one readline() call, it will read more than one line, but read in the chunk(or page) of the disk in to the RAM.

@Antonio López Ruiz 2016-07-24 04:35:15

When you try to split it to get a part, it only takes the first split. Any suggestions??

@TheCrazyProfessor 2017-06-01 16:12:08

why not just f = open()

@akn 2017-10-21 18:14:41

Is there a way to know when I'm on the last line in that solution?

@Katriel 2017-10-25 20:11:12

No, you won't know if you're on the last line until you try to read a line and there isn't one (someone might even write a new line to the file while you are still reading it). You could add an else to the loop to do something after the last line though!

@Dave 2017-12-04 21:59:51

With Python 2.7, opening a file with newlines that end in \r, I had to open('file.txt','rU') to enable universal newline support.

@Melvin 2018-01-09 06:38:30

Is it just me or is this answer no longer helpful, considering the edit in the original question? Running a with loop for every line in file means opening and screening through a (large) file for every line in that file. This is hopelessly inefficient.

@hajef 2019-06-24 10:31:49

@Melvin I don't think, that's what the answer suggests. As I read it, the final code would look like this with open(args*) as f:\n for line in f:\n for line in f:\n do_stuff(). I can't format it in a comment but you get the idea.

@Bob Stein 2015-09-15 15:07:52

To strip newlines:

with open(file_path, 'rU') as f:
    for line_terminated in f:
        line = line_terminated.rstrip('\n')
        ...

With universal newline support all text file lines will seem to be terminated with '\n', whatever the terminators in the file, '\r', '\n', or '\r\n'.

EDIT - To specify universal newline support:

  • Python 2 on Unix - open(file_path, mode='rU') - required [thanks @Dave]
  • Python 2 on Windows - open(file_path, mode='rU') - optional
  • Python 3 - open(file_path, newline=None) - optional

The newline parameter is only supported in Python 3 and defaults to None. The mode parameter defaults to 'r' in all cases. The U is deprecated in Python 3. In Python 2 on Windows some other mechanism appears to translate \r\n to \n.

Docs:

To preserve native line terminators:

with open(file_path, 'rb') as f:
    with line_native_terminated in f:
        ...

Binary mode can still parse the file into lines with in. Each line will have whatever terminators it has in the file.

Thanks to @katrielalex's answer, Python's open() doc, and iPython experiments.

@Dave 2017-12-04 21:57:23

On Python 2.7 I had to open(file_path, 'rU') to enable universal newlines.

@Anurag Misra 2017-08-24 07:02:13

Best way to read large file, line by line is to use python enumerate function

with open(file_name, "rU") as read_file:
    for i, row in enumerate(read_file, 1):
        #do something
        #i in line of that line
        #row containts all data of that line

@fuyas 2017-09-22 10:47:50

Why is using enumerate any better? The only benefit over the accepted answer is that you get an index, which OP doesn't need and you are making the code less readable.

@Srikar Appalaraju 2011-11-04 13:31:42

Two memory efficient ways in ranked order (first is best) -

  1. use of with - supported from python 2.5 and above
  2. use of yield if you really want to have control over how much to read

1. use of with

with is the nice and efficient pythonic way to read large files. advantages - 1) file object is automatically closed after exiting from with execution block. 2) exception handling inside the with block. 3) memory for loop iterates through the f file object line by line. internally it does buffered IO (to optimized on costly IO operations) and memory management.

with open("x.txt") as f:
    for line in f:
        do something with data

2. use of yield

Sometimes one might want more fine-grained control over how much to read in each iteration. In that case use iter & yield. Note with this method one explicitly needs close the file at the end.

def readInChunks(fileObj, chunkSize=2048):
    """
    Lazy function to read a file piece by piece.
    Default chunk size: 2kB.
    """
    while True:
        data = fileObj.read(chunkSize)
        if not data:
            break
        yield data

f = open('bigFile')
for chuck in readInChunks(f):
    do_something(chunk)
f.close()

Pitfalls and for the sake of completeness - below methods are not as good or not as elegant for reading large files but please read to get rounded understanding.

In Python, the most common way to read lines from a file is to do the following:

for line in open('myfile','r').readlines():
    do_something(line)

When this is done, however, the readlines() function (same applies for read() function) loads the entire file into memory, then iterates over it. A slightly better approach (the first mentioned two methods are the best) for large files is to use the fileinput module, as follows:

import fileinput

for line in fileinput.input(['myfile']):
    do_something(line)

the fileinput.input() call reads lines sequentially, but doesn't keep them in memory after they've been read or even simply so this, since file in python is iterable.

References

  1. Python with statement

@Katriel 2011-11-04 13:44:33

-1 It's basically never a good idea to do for line in open(...).readlines(): <do stuff>. Why would you?! You've just lost all the benefit of Python's clever buffered iterator IO for no benefit.

@Katriel 2011-11-04 14:21:16

@Srikar: there is a time and a place for giving all the possible solutions to a problem; teaching a beginner how to do file input is not it. Having the correct answer buried at the bottom of a long post full of wrong answers does not good teaching make.

@Katriel 2011-11-04 14:23:03

@Srikar: You could make your post significantly better by putting the right way at the top, then mentioning readlines and explaining why it's not a good thing to do (because it reads the file into memory), then explaining what the fileinput module does and why you might want to use it over the other methods, then explaining how chunking the file makes the IO better and giving an example of the chunking function (but mentioning that Python does this already for you so you don't need to). But just giving five ways to solve a simple problem, four of which are wrong in this case, is not good.

@m000 2013-09-20 22:03:38

Whatever you add for the sake of completeness, add it last, not first. First show the proper way.

@Srikar Appalaraju 2013-09-21 08:03:47

@katrielalex revisited my answer & found that it warrants restructuring. I can see how the earlier answer could cause confusion. Hopefully this would make it clear for future users.

@Davide Brunato 2017-07-24 12:13:27

Maybe a "1 + 2" solution? I mean enclose the "with" statement into a generator function and then yield lines.

@LabGecko 2019-05-27 14:52:59

To the experienced, please note that for a beginner seeing a list of pitfalls can be extremely useful, especially when going over other people's code and running into those same pitfalls. Speaking as one of those beginners, this was useful, as was pandas.pydata.org/pandas-docs/version/0.19/gotchas.html

@Geoffrey Anderson 2017-02-02 16:48:32

Some context up front as to where I am coming from. Code snippets are at the end.

When I can, I prefer to use an open source tool like H2O to do super high performance parallel CSV file reads, but this tool is limited in feature set. I end up writing a lot of code to create data science pipelines before feeding to H2O cluster for the supervised learning proper.

I have been reading files like 8GB HIGGS dataset from UCI repo and even 40GB CSV files for data science purposes significantly faster by adding lots of parallelism with the multiprocessing library's pool object and map function. For example clustering with nearest neighbor searches and also DBSCAN and Markov clustering algorithms requires some parallel programming finesse to bypass some seriously challenging memory and wall clock time problems.

I usually like to break the file row-wise into parts using gnu tools first and then glob-filemask them all to find and read them in parallel in the python program. I use something like 1000+ partial files commonly. Doing these tricks helps immensely with processing speed and memory limits.

The pandas dataframe.read_csv is single threaded so you can do these tricks to make pandas quite faster by running a map() for parallel execution. You can use htop to see that with plain old sequential pandas dataframe.read_csv, 100% cpu on just one core is the actual bottleneck in pd.read_csv, not the disk at all.

I should add I'm using an SSD on fast video card bus, not a spinning HD on SATA6 bus, plus 16 CPU cores.

Also, another technique that I discovered works great in some applications is parallel CSV file reads all within one giant file, starting each worker at different offset into the file, rather than pre-splitting one big file into many part files. Use python's file seek() and tell() in each parallel worker to read the big text file in strips, at different byte offset start-byte and end-byte locations in the big file, all at the same time concurrently. You can do a regex findall on the bytes, and return the count of linefeeds. This is a partial sum. Finally sum up the partial sums to get the global sum when the map function returns after the workers finished.

Following is some example benchmarks using the parallel byte offset trick:

I use 2 files: HIGGS.csv is 8 GB. It is from the UCI machine learning repository. all_bin .csv is 40.4 GB and is from my current project. I use 2 programs: GNU wc program which comes with Linux, and the pure python fastread.py program which I developed.

HP-Z820:/mnt/fastssd/fast_file_reader$ ls -l /mnt/fastssd/nzv/HIGGS.csv
-rw-rw-r-- 1 8035497980 Jan 24 16:00 /mnt/fastssd/nzv/HIGGS.csv

HP-Z820:/mnt/fastssd$ ls -l all_bin.csv
-rw-rw-r-- 1 40412077758 Feb  2 09:00 all_bin.csv

[email protected]:/mnt/fastssd$ time python fastread.py --fileName="all_bin.csv" --numProcesses=32 --balanceFactor=2
2367496

real    0m8.920s
user    1m30.056s
sys 2m38.744s

In [1]: 40412077758. / 8.92
Out[1]: 4530501990.807175

That’s some 4.5 GB/s, or 45 Gb/s, file slurping speed. That ain’t no spinning hard disk, my friend. That’s actually a Samsung Pro 950 SSD.

Below is the speed benchmark for the same file being line-counted by gnu wc, a pure C compiled program.

What is cool is you can see my pure python program essentially matched the speed of the gnu wc compiled C program in this case. Python is interpreted but C is compiled, so this is a pretty interesting feat of speed, I think you would agree. Of course, wc really needs to be changed to a parallel program, and then it would really beat the socks off my python program. But as it stands today, gnu wc is just a sequential program. You do what you can, and python can do parallel today. Cython compiling might be able to help me (for some other time). Also memory mapped files was not explored yet.

HP-Z820:/mnt/fastssd$ time wc -l all_bin.csv
2367496 all_bin.csv

real    0m8.807s
user    0m1.168s
sys 0m7.636s


HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=16 --balanceFactor=2
11000000

real    0m2.257s
user    0m12.088s
sys 0m20.512s

HP-Z820:/mnt/fastssd/fast_file_reader$ time wc -l HIGGS.csv
11000000 HIGGS.csv

real    0m1.820s
user    0m0.364s
sys 0m1.456s

Conclusion: The speed is good for a pure python program compared to a C program. However, it’s not good enough to use the pure python program over the C program, at least for linecounting purpose. Generally the technique can be used for other file processing, so this python code is still good.

Question: Does compiling the regex just one time and passing it to all workers will improve speed? Answer: Regex pre-compiling does NOT help in this application. I suppose the reason is that the overhead of process serialization and creation for all the workers is dominating.

One more thing. Does parallel CSV file reading even help? Is the disk the bottleneck, or is it the CPU? Many so-called top-rated answers on stackoverflow contain the common dev wisdom that you only need one thread to read a file, best you can do, they say. Are they sure, though?

Let’s find out:

HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=16 --balanceFactor=2
11000000

real    0m2.256s
user    0m10.696s
sys 0m19.952s

HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=1 --balanceFactor=1
11000000

real    0m17.380s
user    0m11.124s
sys 0m6.272s

Oh yes, yes it does. Parallel file reading works quite well. Well there you go!

Ps. In case some of you wanted to know, what if the balanceFactor was 2 when using a single worker process? Well, it’s horrible:

HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=1 --balanceFactor=2
11000000

real    1m37.077s
user    0m12.432s
sys 1m24.700s

Key parts of the fastread.py python program:

fileBytes = stat(fileName).st_size  # Read quickly from OS how many bytes are in a text file
startByte, endByte = PartitionDataToWorkers(workers=numProcesses, items=fileBytes, balanceFactor=balanceFactor)
p = Pool(numProcesses)
partialSum = p.starmap(ReadFileSegment, zip(startByte, endByte, repeat(fileName))) # startByte is already a list. fileName is made into a same-length list of duplicates values.
globalSum = sum(partialSum)
print(globalSum)


def ReadFileSegment(startByte, endByte, fileName, searchChar='\n'):  # counts number of searchChar appearing in the byte range
    with open(fileName, 'r') as f:
        f.seek(startByte-1)  # seek is initially at byte 0 and then moves forward the specified amount, so seek(5) points at the 6th byte.
        bytes = f.read(endByte - startByte + 1)
        cnt = len(re.findall(searchChar, bytes)) # findall with implicit compiling runs just as fast here as re.compile once + re.finditer many times.
    return cnt

The def for PartitionDataToWorkers is just ordinary sequential code. I left it out in case someone else wants to get some practice on what parallel programming is like. I gave away for free the harder parts: the tested and working parallel code, for your learning benefit.

Thanks to: The open-source H2O project, by Arno and Cliff and the H2O staff for their great software and instructional videos, which have provided me the inspiration for this pure python high performance parallel byte offset reader as shown above. H2O does parallel file reading using java, is callable by python and R programs, and is crazy fast, faster than anything on the planet at reading big CSV files.

@Geoffrey Anderson 2017-02-02 18:54:48

Parallel chunks is what this is, basically. Also, I expect SSD and Flash are the only compatible storage devices with this technique. Spinning HD is unlikely to be compatible.

@Simon Bergot 2011-11-04 13:33:37

this is a possible way of reading a file in python:

f = open(input_file)
for line in f:
    do_stuff(line)
f.close()

it does not allocate a full list. It iterates over the lines.

@Mast 2016-12-28 14:12:37

While this works, it's definitely not the canonical way. The canonical way is to use a context wrapper, like with open(input_file) as f:. This saves you the f.close() and makes sure you don't accidentally forget to close it. Prevents memory leaks and all, quite important when reading files.

@azuax 2017-01-03 02:17:05

As @Mast said, that is not the canonical way, so downvote for that.

@loxsat 2016-07-30 02:01:01

#Using a text file for the example
with open("yourFile.txt","r") as f:
    text = f.readlines()
for line in text:
    print line
  • Open your file for reading (r)
  • Read the whole file and save each line into a list (text)
  • Loop through the list printing each line.

If you want, for example, to check a specific line for a length greater than 10, work with what you already have available.

for line in text:
    if len(line) > 10:
        print line

@ntg 2016-11-24 10:17:07

Not best for this question, but this code is mainly useful in case what you are looking for.is "slurping" (reading the whole file at once). That was my case and google got me here. +1. Also, for atomicity, or if you do time consuming processing in the loop might endup faster to read whole file

@ntg 2016-11-24 10:25:11

Also, improved the code a bit: 1. close is not needed after with: (docs.python.org/2/tutorial/inputoutput.html, search for "It is good practice to use the with keyword...") 2. text can be processed after file is read (ouside of with loop....)

@John Haberstroh 2014-10-17 19:39:11

I would strongly recommend not using the default file loading as it is horrendously slow. You should look into the numpy functions and the IOpro functions (e.g. numpy.loadtxt()).

http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html

https://store.continuum.io/cshop/iopro/

Then you can break your pairwise operation into chunks:

import numpy as np
import math

lines_total = n    
similarity = np.zeros(n,n)
lines_per_chunk = m
n_chunks = math.ceil(float(n)/m)
for i in xrange(n_chunks):
    for j in xrange(n_chunks):
        chunk_i = (function of your choice to read lines i*lines_per_chunk to (i+1)*lines_per_chunk)
        chunk_j = (function of your choice to read lines j*lines_per_chunk to (j+1)*lines_per_chunk)
        similarity[i*lines_per_chunk:(i+1)*lines_per_chunk,
                   j*lines_per_chunk:(j+1)*lines_per_chunk] = fast_operation(chunk_i, chunk_j) 

It's almost always much faster to load data in chunks and then do matrix operations on it than to do it element by element!!

@KevinDTimm 2011-11-04 13:32:05

From the python documentation for fileinput.input():

This iterates over the lines of all files listed in sys.argv[1:], defaulting to sys.stdin if the list is empty

further, the definition of the function is:

fileinput.FileInput([files[, inplace[, backup[, mode[, openhook]]]]])

reading between the lines, this tells me that files can be a list so you could have something like:

for each_line in fileinput.input([input_file, input_file]):
  do_something(each_line)

See here for more information

@cfi 2011-11-04 14:09:14

Katrielalex provided the way to open & read one file.

However the way your algorithm goes it reads the whole file for each line of the file. That means the overall amount of reading a file - and computing the Levenshtein distance - will be done N*N if N is the amount of lines in the file. Since you're concerned about file size and don't want to keep it in memory, I am concerned about the resulting quadratic runtime. Your algorithm is in the O(n^2) class of algorithms which often can be improved with specialization.

I suspect that you already know the tradeoff of memory versus runtime here, but maybe you would want to investigate if there's an efficient way to compute multiple Levenshtein distances in parallel. If so it would be interesting to share your solution here.

How many lines do your files have, and on what kind of machine (mem & cpu power) does your algorithm have to run, and what's the tolerated runtime?

Code would look like:

with f_outer as open(input_file, 'r'):
    for line_outer in f_outer:
        with f_inner as open(input_file, 'r'):
            for line_inner in f_inner:
                compute_distance(line_outer, line_inner)

But the questions are how do you store the distances (matrix?) and can you gain an advantage of preparing e.g. the outer_line for processing, or caching some intermediate results for reuse.

@cfi 2011-11-16 06:47:22

@katriealex: What is your point? Please be constructive.

@Katriel 2011-11-16 09:00:22

My point is that this post does not contain an answer to the question, just some more questions! IMO it would be better suited as a comment.

@cfi 2011-11-16 09:30:25

@katriealex: Err. Strange. You did see the nested loops, expanding your own answer to fit the actual question? I can remove my questions here from my answer, and there's yet enough content to warrant providing this as a - albeit partial - answer. I could also accept if you'd edit your own answer to include the nested loop example - which was explicitly asked by the question - and then I can remove my own answer happily. But a downvote is something I don't get at all.

@Katriel 2011-11-16 17:32:46

Fair enough; I don't really see demonstrating the nested for loops as an answer to the question but I guess it's pretty strongly targeted at beginners. Downvote removed.

Related Questions

Sponsored Content

28 Answered Questions

[SOLVED] How do I check if a list is empty?

  • 2008-09-10 06:20:11
  • Ray Vega
  • 2514424 View
  • 3235 Score
  • 28 Answer
  • Tags:   python list

5 Answered Questions

[SOLVED] Catch multiple exceptions in one line (except block)

28 Answered Questions

[SOLVED] How to read a file line-by-line into a list?

21 Answered Questions

[SOLVED] How do I list all files of a directory?

  • 2010-07-08 19:31:22
  • duhhunjonn
  • 3704799 View
  • 3474 Score
  • 21 Answer
  • Tags:   python directory

41 Answered Questions

[SOLVED] How do I merge two dictionaries in a single expression?

25 Answered Questions

[SOLVED] How can I safely create a nested directory?

14 Answered Questions

[SOLVED] "Large data" work flows using pandas

37 Answered Questions

[SOLVED] How do I check whether a file exists without exceptions?

10 Answered Questions

34 Answered Questions

[SOLVED] How do I sort a dictionary by value?

Sponsored Content