By Gerenuk


2012-01-24 17:44:02 8 Comments

Can you think of a nice way (maybe with itertools) to split an iterator into chunks of given size?

Therefore l=[1,2,3,4,5,6,7] with chunks(l,3) becomes an iterator [1,2,3], [4,5,6], [7]

I can think of a small program to do that but not a nice way with maybe itertools.

9 comments

@Svein Lindal 2015-04-08 20:36:55

This will work on any iterable. It returns generator of generators (for full flexibility). I now realize that it's basically the same as @reclosedevs solution, but without the fluff. No need for try...except as the StopIteration propagates up, which is what we want.

The next(iterable) call is needed to raise the StopIteration when the iterable is empty, since islice will continue spawning empty generators forever if you let it.

It's better because it's only two lines long, yet easy to comprehend.

def grouper(iterable, n):
    while True:
        yield itertools.chain((next(iterable),), itertools.islice(iterable, n-1))

Note that next(iterable) is put into a tuple. Otherwise, if next(iterable) itself were iterable, then itertools.chain would flatten it out. Thanks to Jeremy Brown for pointing out this issue.

@deW1 2015-04-08 20:55:02

While that may answer the question including some part of explanation and description might help understand your approach and enlighten us as to why your answer stands out

@Artjom B. 2015-04-08 21:12:57

Don't just copy your answer to another question. If you need to do that, then it suggests that one is a duplicate of the other, which they are and I voted to close.

@Svein Lindal 2015-04-09 13:03:59

It's a duplicate. Saw this thread after. Which turns out has a variation of my answer.

@Jeremy Brown 2015-12-16 04:56:09

iterable.next() needs to be contained or yielded by an interator for the chain to work properly - eg. yield itertools.chain([iterable.next()], itertools.islice(iterable, n-1))

@Antti Haapala 2017-04-28 12:05:59

next(iterable), not iterable.next().

@Mateen Ulhaq 2018-11-24 04:53:35

It might make sense to prefix the while loop with the line iterable = iter(iterable) to turn your iterable into an iterator first. Iterables do not have a __next__ method.

@loutre 2019-03-01 14:04:21

Raising StopIteration in a generator function is deprecated since PEP479. So I prefer explicit return statement [email protected] solution.

@drevicko 2019-08-28 09:11:07

@loutre indeed in python 3.7 it raises an exception...

@Sven Marnach 2012-01-24 17:48:03

The grouper() recipe from the itertools documentation's recipes comes close to what you want:

def grouper(n, iterable, fillvalue=None):
    "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

It will fill up the last chunk with a fill value, though.

A less general solution that only works on sequences but does handle the last chunk as desired is

[my_list[i:i + chunk_size] for i in range(0, len(my_list), chunk_size)]

Finally, a solution that works on general iterators an behaves as desired is

def grouper(n, iterable):
    it = iter(iterable)
    while True:
       chunk = tuple(itertools.islice(it, n))
       if not chunk:
           return
       yield chunk

@Gerenuk 2012-01-25 09:52:22

Thanks for this and all other ideas! Sorry that I missed the numerious threads already discussing this question. I had tried islice but somehow I missed that it indeed soaks up the iterator as desired. Now I'm thinking of defining a custom iterator class which provides all sorts of functionality :)

@Capi Etheriel 2014-10-31 16:52:39

Would if chunk: yield chunk be acceptable? it shaves a line off and is as semantic as a single return.

@Sven Marnach 2014-10-31 17:57:47

@barraponto: No, it wouldn't be acceptable, since you would be left with an infinite loop.

@Jonathan Eunice 2015-04-24 00:02:19

I am surprised that this is such a highly-voted answer. The recipe works great for small n, but for large groups, is very inefficient. My n, e.g., is 200,000. Creating a temporary list of 200K items is...not ideal.

@Sven Marnach 2015-04-26 15:56:03

@JonathanEunice: In almost all cases, this is what people want (which is the reason why it is included in the Python documentation). Optimising for a particular special case is out of scope for this question, and even with the information you included in your comment, I can't tell what the best approach would be for you. If you want to chunk a list of numbers that fits into memory, you are probably best off using NumPy's .resize() message. If you want to chunk a general iterator, the second approach is already quite good -- it creates temporary tuples of size 200K, but that's not a big deal.

@Jonathan Eunice 2015-04-27 02:24:05

@SvenMarnach We'll have to disagree. I believe people want convenience, not gratuitous overhead. They get the overhead because the docs provide a needlessly bloated answer. With large data, temporary tuples/lists/etc. of 200K or 1M items make the program consume gigabytes of excess memory and take much longer to run. Why do that if you don't have to? At 200K, extra temp storage makes the overall program take 3.5x longer to run than with it removed. Just that one change. So it is a pretty big deal. NumPy won't work because the iterator is a database cursor, not a list of numbers.

@Sven Marnach 2015-04-28 09:19:20

@JonathanEunice: Sorry, when I said "the second approach" I actually meant the third one in my answer. There will only be a single 200K chunk at any given time, unless you store all of them (in which case you can't blame the code in this answer, but should blame your own code instead), and I can't see how this would use gigabytes of memory. That said, you are currently optimising along a very particular dimension, and all these optimisations have to be tailored to special cases. If you have a solution that you think is better for the general case, please enter an answer of your own.

@Sven Marnach 2015-04-28 09:20:53

@JonathanEunice: I think I still haven't understood your use case.

@juanpa.arrivillaga 2019-08-16 19:45:39

@JonathanEunice also, you are incorrect about the scale of the overhead. If you chunk a list using these methods, you are creating new objects for each chunk, the cost of that object already exists, so underneath the hood you just have to account for new points, so 200,000 * 8 * 1e-6 = 1.6 megabytes of overhead for a 200K size list. And about 5 times that for a million.

@Jonathan Eunice 2019-08-16 21:08:11

@juanpa.arrivillaga Note that I said "200K items." 200K items does of course consume ≫ 200K bytes, especially given Python not being particularly space-efficient.

@juanpa.arrivillaga 2019-08-16 21:23:35

@JonathanEunice yes that's what I accounted for and the memory overhead is about 1.6 megabytes, which is several orders of magnitude less than gigabytes of excess memory

@nbro 2019-10-03 00:53:22

@SvenMarnach I found out that my problem was due to the usage of zip in Python 2, which loads all data in memory, as opposed to itertools.izip. You can delete the previous comments and I will also delete this one.

@hojin 2019-10-30 12:47:32

izip_longest was renamed to zip_longest in Python 3

@jsbueno 2012-01-24 18:03:11

"Simpler is better than complex" - a straightforward generator a few lines long can do the job. Just place it in some utilities module or so:

def grouper (iterable, n):
    iterable = iter(iterable)
    count = 0
    group = []
    while True:
        try:
            group.append(next(iterable))
            count += 1
            if count % n == 0:
                yield group
                group = []
        except StopIteration:
            yield group
            break

@eidorb 2012-10-09 09:48:03

I was working on something today and came up with what I think is a simple solution. It is similar to jsbueno's answer, but I believe his would yield empty groups when the length of iterable is divisible by n. My answer does a simple check when the iterable is exhausted.

def chunk(iterable, chunk_size):
    """Generate sequences of `chunk_size` elements from `iterable`."""
    iterable = iter(iterable)
    while True:
        chunk = []
        try:
            for _ in range(chunk_size):
                chunk.append(iterable.next())
            yield chunk
        except StopIteration:
            if chunk:
                yield chunk
            break

@Marcin 2012-01-24 19:31:46

A succinct implementation is:

chunker = lambda iterable, n: (ifilterfalse(lambda x: x == (), chunk) for chunk in (izip_longest(*[iter(iterable)]*n, fillvalue=())))

This works because [iter(iterable)]*n is a list containing the same iterator n times; zipping over that takes one item from each iterator in the list, which is the same iterator, with the result that each zip-element contains a group of n items.

izip_longest is needed to fully consume the underlying iterable, rather than iteration stopping when the first exhausted iterator is reached, which chops off any remainder from iterable. This results in the need to filter out the fill-value. A slightly more robust implementation would therefore be:

def chunker(iterable, n):
    class Filler(object): pass
    return (ifilterfalse(lambda x: x is Filler, chunk) for chunk in (izip_longest(*[iter(iterable)]*n, fillvalue=Filler)))

This guarantees that the fill value is never an item in the underlying iterable. Using the definition above:

iterable = range(1,11)

map(tuple,chunker(iterable, 3))
[(1, 2, 3), (4, 5, 6), (7, 8, 9), (10,)]

map(tuple,chunker(iterable, 2))
[(1, 2), (3, 4), (5, 6), (7, 8), (9, 10)]

map(tuple,chunker(iterable, 4))
[(1, 2, 3, 4), (5, 6, 7, 8), (9, 10)]

This implementation almost does what you want, but it has issues:

def chunks(it, step):
  start = 0
  while True:
    end = start+step
    yield islice(it, start, end)
    start = end

(The difference is that because islice does not raise StopIteration or anything else on calls that go beyond the end of it this will yield forever; there is also the slightly tricky issue that the islice results must be consumed before this generator is iterated).

To generate the moving window functionally:

izip(count(0, step), count(step, step))

So this becomes:

(it[start:end] for (start,end) in izip(count(0, step), count(step, step)))

But, that still creates an infinite iterator. So, you need takewhile (or perhaps something else might be better) to limit it:

chunk = lambda it, step: takewhile((lambda x: len(x) > 0), (it[start:end] for (start,end) in izip(count(0, step), count(step, step))))

g = chunk(range(1,11), 3)

tuple(g)
([1, 2, 3], [4, 5, 6], [7, 8, 9], [10])

@Sven Marnach 2012-01-25 14:11:11

1. The first code snippet contains the line start = end, which doesn't seem to be doing anything, since the next iteration of the loop will start with start = 0. Moreover, the loop is infinite -- it's while True without any break. 2. What is len in the second code snippet? 3. All other implementations only work for sequences, not for general iterators. 4. The check x is () relies on an implementation detail of CPython. As an optimisation, the empty tuple is only created once and reused later. This is not guaranteed by the language specification though, so you should use x == ().

@Sven Marnach 2012-01-25 14:11:21

5. The combination of count() and takewhile() is much more easily implemented using range().

@Marcin 2012-01-25 14:20:02

@SvenMarnach: I've edited the code and text in response to some of your points. Much-needed proofing.

@Sven Marnach 2012-01-25 14:30:53

That was fast. :) I still have an issue with the first code snippet: It only works if the yielded slices are consumed. If the user does not consume them immediately, strange things may happen. That's why Peter Otten used deque(chunk, 0) to consume them, but that solution has problems as well -- see my comment to his answer.

@Sven Marnach 2012-01-25 14:33:15

I like the last version of chunker(). As a side note, a nice way to create a unique sentinel is sentinel = object() -- it is guaranteed to be distinct from any other object.

@Marcin 2012-01-25 14:35:21

I have reversed the order of my answers, so read @SvenMarnach's comments with care.

@Marcin 2012-01-25 14:36:26

@SvenMarnach: Nice tip on sentinels - that didn't occur to me.

@reclosedev 2012-01-25 04:59:30

Although OP asks function to return chunks as list or tuple, in case you need to return iterators, then Sven Marnach's solution can be modified:

def grouper_it(n, iterable):
    it = iter(iterable)
    while True:
        chunk_it = itertools.islice(it, n)
        try:
            first_el = next(chunk_it)
        except StopIteration:
            return
        yield itertools.chain((first_el,), chunk_it)

Some benchmarks: http://pastebin.com/YkKFvm8b

It will be slightly more efficient only if your function iterates through elements in every chunk.

@Jonathan Eunice 2015-04-24 01:36:14

I arrived at almost exactly this design today, after finding the answer in the documentation (which is the accepted, most-highly-voted answer above) massively inefficient. When you're grouping hundreds of thousands or millions of objects at a time--which is when you need segmentation the most--it has to be pretty efficient. THIS is the right answer.

@Lawrence Hudson 2018-01-31 09:46:19

This is the best solution.

@Tavian Barnes 2018-12-18 19:01:10

Won't this behave wrongly if the caller doesn't exhaust chunk_it (by breaking the inner loop early for example)?

@loutre 2019-03-01 14:11:49

@TavianBarnes good point, if a first group is not exhausted, a second will start where the first left. But it may be considered as a feature if you want the both to be looped concurrently. Powerful but handle with care.

@ShadowRanger 2020-01-15 03:08:11

@TavianBarnes: This can be made to behave correctly in that case by making a cheap iterator consumer (fastest in CPython if you create it outside the loop is consume = collections.deque(maxlen=0).extend), then add consume(chunk_it) after the yield line; if the caller consumed the yielded chain, it does nothing, if they didn't, it consumes it on their behalf as efficiently as possible. Put it in the finally of a try wrapping the yield if you need it to advance a caller provided iterator to the end of the chunk if the outer loop is broken early.

@Peter Otten 2012-01-24 18:05:09

Here's one that returns lazy chunks; use map(list, chunks(...)) if you want lists.

from itertools import islice, chain
from collections import deque

def chunks(items, n):
    items = iter(items)
    for first in items:
        chunk = chain((first,), islice(items, n-1))
        yield chunk
        deque(chunk, 0)

if __name__ == "__main__":
    for chunk in map(list, chunks(range(10), 3)):
        print chunk

    for i, chunk in enumerate(chunks(range(10), 3)):
        if i % 2 == 1:
            print "chunk #%d: %s" % (i, list(chunk))
        else:
            print "skipping #%d" % i

@Marcin 2012-01-24 19:44:56

Care to comment on how this works.

@Sven Marnach 2012-01-25 14:19:06

A caveat: This generator yields iterables that remain valid only until the next iterable is requested. When using e.g. list(chunks(range(10), 3)), all iterables will already have been consumed.

@Carlos Quintanilla 2012-01-24 19:10:58

Here you go.

def chunksiter(l, chunks):
    i,j,n = 0,0,0
    rl = []
    while n < len(l)/chunks:        
        rl.append(l[i:j+chunks])        
        i+=chunks
        j+=j+chunks        
        n+=1
    return iter(rl)


def chunksiter2(l, chunks):
    i,j,n = 0,0,0
    while n < len(l)/chunks:        
        yield l[i:j+chunks]
        i+=chunks
        j+=j+chunks        
        n+=1

Examples:

for l in chunksiter([1,2,3,4,5,6,7,8],3):
    print(l)

[1, 2, 3]
[4, 5, 6]
[7, 8]

for l in chunksiter2([1,2,3,4,5,6,7,8],3):
    print(l)

[1, 2, 3]
[4, 5, 6]
[7, 8]


for l in chunksiter2([1,2,3,4,5,6,7,8],5):
    print(l)

[1, 2, 3, 4, 5]
[6, 7, 8]

@Sven Marnach 2012-01-25 14:01:04

This only works for sequences, not for general iterables.

@Zach Young 2012-01-24 18:09:08

I forget where I found the inspiration for this. I've modified it a little to work with MSI GUID's in the Windows Registry:

def nslice(s, n, truncate=False, reverse=False):
    """Splits s into n-sized chunks, optionally reversing the chunks."""
    assert n > 0
    while len(s) >= n:
        if reverse: yield s[:n][::-1]
        else: yield s[:n]
        s = s[n:]
    if len(s) and not truncate:
        yield s

reverse doesn't apply to your question, but it's something I use extensively with this function.

>>> [i for i in nslice([1,2,3,4,5,6,7], 3)]
[[1, 2, 3], [4, 5, 6], [7]]
>>> [i for i in nslice([1,2,3,4,5,6,7], 3, truncate=True)]
[[1, 2, 3], [4, 5, 6]]
>>> [i for i in nslice([1,2,3,4,5,6,7], 3, truncate=True, reverse=True)]
[[3, 2, 1], [6, 5, 4]]

@Zach Young 2012-01-24 18:17:14

This answer is close to the one I started with, but not quite: stackoverflow.com/a/434349/246801

@Sven Marnach 2012-01-25 14:15:38

This only works for sequences, not for general iterables.

@Zach Young 2012-01-25 16:02:01

@SvenMarnach: Hi Sven, yes, thank you, you are absolutely correct. I saw the OP's example which used a list (sequence) and glossed over the wording of the question, assuming they meant sequence. Thanks for pointing that out, though. I didn't immediately understand the difference when I saw your comment, but have since looked it up. :)

Related Questions

Sponsored Content

60 Answered Questions

[SOLVED] How do you split a list into evenly sized chunks?

26 Answered Questions

[SOLVED] Does Python have a ternary conditional operator?

38 Answered Questions

[SOLVED] How to get the current time in Python

  • 2009-01-06 04:54:23
  • user46646
  • 3296093 View
  • 2886 Score
  • 38 Answer
  • Tags:   python datetime time

36 Answered Questions

[SOLVED] What is the most "pythonic" way to iterate over a list in chunks?

62 Answered Questions

[SOLVED] Calling an external command from Python

11 Answered Questions

[SOLVED] Iterating over dictionaries using 'for' loops

10 Answered Questions

[SOLVED] Does Python have a string 'contains' substring method?

30 Answered Questions

[SOLVED] Finding the index of an item in a list

  • 2008-10-07 01:39:38
  • Eugene M
  • 3874780 View
  • 3173 Score
  • 30 Answer
  • Tags:   python list indexing

20 Answered Questions

[SOLVED] What are metaclasses in Python?

Sponsored Content