By Raptrex


2010-12-31 06:39:15 8 Comments

Is there a better way to use glob.glob in python to get a list of multiple file types such as .txt, .mdown, and .markdown? Right now I have something like this:

projectFiles1 = glob.glob( os.path.join(projectDir, '*.txt') )
projectFiles2 = glob.glob( os.path.join(projectDir, '*.mdown') )
projectFiles3 = glob.glob( os.path.join(projectDir, '*.markdown') )

30 comments

@facelessuser 2020-02-11 13:58:51

While Python's default glob doesn't really follow after Bash's glob, you can do this with other libraries. We can enable braces in wcmatch's glob.

>>> from wcmatch import glob
>>> glob.glob('*.{md,ini}', flags=glob.BRACE)
['LICENSE.md', 'README.md', 'tox.ini']

You can even use extended glob patterns if that is your preference:

from wcmatch import glob
>>> glob.glob('*[email protected](md|ini)', flags=glob.EXTGLOB)
['LICENSE.md', 'README.md', 'tox.ini']

@Shamoon 2020-03-24 16:09:55

This doesn't take the recursive flag

@facelessuser 2020-03-24 17:31:56

@Shamoon No, it takes the glob.GLOBSTAR flag

@Sway Wu 2020-01-20 17:47:43

import glob
import pandas as pd

df1 = pd.DataFrame(columns=['A'])
for i in glob.glob('C:\dir\path\*.txt'):
    df1 = df1.append({'A': i}, ignore_index=True)
for i in glob.glob('C:\dir\path\*.mdown'):
    df1 = df1.append({'A': i}, ignore_index=True)
for i in glob.glob('C:\dir\path\*.markdown):
    df1 = df1.append({'A': i}, ignore_index=True)

@Tiago Martins Peres 2020-01-20 18:09:20

Hi Sway Wu, welcome. Please consider adding an explanation.

@BPL 2019-07-16 09:24:40

So many answers that suggest globbing as many times as number of extensions, I'd prefer globbing just once instead:

from pathlib import Path

files = {p.resolve() for p in Path(path).glob("**/*") if p.suffix in [".c", ".cc", ".cpp", ".hxx", ".h"]}

@Giova 2019-06-16 12:51:32

By the results I've obtained from empirical tests, it turned out that glob.glob isn't the better way to filter out files by their extensions. Some of the reason are:

  • The globbing "language" does not allows perfect specification of multiple extension.
  • The former point results in obtaining incorrect results depending on file extensions.
  • The globbing method is empirically proven to be slower than most other methods.
  • Even if it's strange even other filesystems objects can have "extensions", folders too.

I've tested (for correcteness and efficiency in time) the following 4 different methods to filter out files by extensions and puts them in a list:

from glob import glob, iglob
from re import compile, findall
from os import walk


def glob_with_storage(args):

    elements = ''.join([f'[{i}]' for i in args.extensions])
    globs = f'{args.target}/**/*{elements}'
    results = glob(globs, recursive=True)

    return results


def glob_with_iteration(args):

    elements = ''.join([f'[{i}]' for i in args.extensions])
    globs = f'{args.target}/**/*{elements}'
    results = [i for i in iglob(globs, recursive=True)]

    return results


def walk_with_suffixes(args):

    results = []
    for r, d, f in walk(args.target):
        for ff in f:
            for e in args.extensions:
                if ff.endswith(e):
                    results.append(path_join(r,ff))
                    break
    return results


def walk_with_regs(args):

    reg = compile('|'.join([f'{i}$' for i in args.extensions]))

    results = []
    for r, d, f in walk(args.target):
        for ff in f:
            if len(findall(reg,ff)):
                results.append(path_join(r, ff))

    return results

By running the code above on my laptop I obtained the following auto-explicative results.

Elapsed time for '7 times glob_with_storage()':  0.365023 seconds.
mean   : 0.05214614
median : 0.051861
stdev  : 0.001492152
min    : 0.050864
max    : 0.054853

Elapsed time for '7 times glob_with_iteration()':  0.360037 seconds.
mean   : 0.05143386
median : 0.050864
stdev  : 0.0007847381
min    : 0.050864
max    : 0.052859

Elapsed time for '7 times walk_with_suffixes()':  0.26529 seconds.
mean   : 0.03789857
median : 0.037899
stdev  : 0.0005759071
min    : 0.036901
max    : 0.038896

Elapsed time for '7 times walk_with_regs()':  0.290223 seconds.
mean   : 0.04146043
median : 0.040891
stdev  : 0.0007846776
min    : 0.04089
max    : 0.042885

Results sizes:
0 2451
1 2451
2 2446
3 2446

Differences between glob() and walk():
0 E:\x\y\z\venv\lib\python3.7\site-packages\Cython\Includes\numpy
1 E:\x\y\z\venv\lib\python3.7\site-packages\Cython\Utility\CppSupport.cpp
2 E:\x\y\z\venv\lib\python3.7\site-packages\future\moves\xmlrpc
3 E:\x\y\z\venv\lib\python3.7\site-packages\Cython\Includes\libcpp
4 E:\x\y\z\venv\lib\python3.7\site-packages\future\backports\xmlrpc

Elapsed time for 'main':  1.317424 seconds.

The fastest way to filter out files by extensions, happens even to be the ugliest one. Which is, nested for loops and string comparison using the endswith() method.

Moreover, as you can see, the globbing algorithms (with the pattern E:\x\y\z\**/*[py][pyc]) even with only 2 extension given (py and pyc) returns also incorrect results.

@feqwix 2016-03-22 23:11:25

For example, for *.mp3 and *.flac on multiple folders, you can do:

mask = r'music/*/*.[mf][pl][3a]*'
glob.glob(mask)

The idea can be extended to more file extensions, but you have to check that the combinations won't match any other unwanted file extension you may have on those folders. So, be careful with this.

To automatically combine an arbitrary list of extensions into a single glob pattern, you can do the following:

mask_base = r'music/*/*.'
exts = ['mp3', 'flac', 'wma']
chars = ''.join('[{}]'.format(''.join(set(c))) for c in zip(*exts))
mask = mask_base + chars + ('*' if len(set(len(e) for e in exts)) > 1 else '')
print(mask)  # music/*/*.[fmw][plm][3a]*

@qik 2019-05-02 10:04:12

If you use pathlib try this:

import pathlib

extensions = ['.py', '.txt']
root_dir = './test/'

files = filter(lambda p: p.suffix in extensions, pathlib.Path(root_dir).glob('**/*'))

print(list(files))

@Petr Vepřek 2018-11-10 16:07:35

Yet another solution (use glob to get paths using multiple match patterns and combine all paths into a single list using reduce and add):

import functools, glob, operator
paths = functools.reduce(operator.add, [glob.glob(pattern) for pattern in [
    "path1/*.ext1",
    "path2/*.ext2"]])

@Projesh Bhoumik 2018-07-26 11:46:15

Use a list of extension and iterate through

from os.path import join
from glob import glob

files = ['*.gif', '*.png', '*.jpg']
for ext in files:
   files.extend(glob(join("path/to/dir", ext)))

print(files)

@Jayhello 2018-07-26 11:23:03

For example:

import glob
lst_img = []
base_dir = '/home/xy/img/'

# get all the jpg file in base_dir 
lst_img += glob.glob(base_dir + '*.jpg')
print lst_img
# ['/home/xy/img/2.jpg', '/home/xy/img/1.jpg']

# append all the png file in base_dir to lst_img
lst_img += glob.glob(base_dir + '*.png')
print lst_img
# ['/home/xy/img/2.jpg', '/home/xy/img/1.jpg', '/home/xy/img/3.png']

A function:

import glob
def get_files(base_dir='/home/xy/img/', lst_extension=['*.jpg', '*.png']):
    """
    :param base_dir:base directory
    :param lst_extension:lst_extension: list like ['*.jpg', '*.png', ...]
    :return:file lists like ['/home/xy/img/2.jpg','/home/xy/img/3.png']
    """
    lst_files = []
    for ext in lst_extension:
        lst_files += glob.glob(base_dir+ext)
    return lst_files

@Derek White 2018-06-12 17:33:30

files = glob.glob('*.txt')
files.extend(glob.glob('*.dat'))

@SunSparc 2018-06-12 17:43:18

Good answers also provide some explanation of code and perhaps even some of your reasoning behind the code.

@scholer 2018-05-09 17:13:11

Here is one-line list-comprehension variant of Pat's answer (which also includes that you wanted to glob in a specific project directory):

import os, glob
exts = ['*.txt', '*.mdown', '*.markdown']
files = [f for ext in exts for f in glob.glob(os.path.join(project_dir, ext))]

You loop over the extensions (for ext in exts), and then for each extension you take each file matching the glob pattern (for f in glob.glob(os.path.join(project_dir, ext)).

This solution is short, and without any unnecessary for-loops, nested list-comprehensions, or functions to clutter the code. Just pure, expressive, pythonic Zen.

This solution allows you to have a custom list of exts that can be changed without having to update your code. (This is always a good practice!)

The list-comprehension is the same used in Laurent's solution (which I've voted for). But I would argue that it is usually unnecessary to factor out a single line to a separate function, which is why I'm providing this as an alternative solution.

Bonus:

If you need to search not just a single directory, but also all sub-directories, you can pass recursive=True and use the multi-directory glob symbol ** 1:

files = [f for ext in exts 
         for f in glob.glob(os.path.join(project_dir, '**', ext), recursive=True)]

This will invoke glob.glob('<project_dir>/**/*.txt', recursive=True) and so on for each extension.

1 Technically, the ** glob symbol simply matches one or more characters including forward-slash / (unlike the singular * glob symbol). In practice, you just need to remember that as long as you surround ** with forward slashes (path separators), it matches zero or more directories.

@Justin 2018-05-08 13:17:13

I had the same issue and this is what I came up with

import os, sys, re

#without glob

src_dir = '/mnt/mypics/'
src_pics = []
ext = re.compile('.*\.(|{}|)$'.format('|'.join(['png', 'jpeg', 'jpg']).encode('utf-8')))
for root, dirnames, filenames in os.walk(src_dir):
  for filename in filter(lambda name:ext.search(name),filenames):
    src_pics.append(os.path.join(root, filename))

@Sarvagya Gupta 2018-04-20 15:44:50

this worked for me:

import glob
images = glob.glob('*.JPG' or '*.jpg' or '*.png')

@Ciprian Tomoiagă 2018-08-17 08:33:51

This cannot possibly work as you intend it to. The or operator returns the first "non-falsy" value, so in your case: *.JPG. This turns your call into glob.glob('*.JPG'), meaning it will only return *.JPG files, completely forgetting about the other extensions.

@colllin 2017-10-11 21:03:06

One glob, many extensions... but imperfect solution (might match other files).

filetypes = ['tif', 'jpg']

filetypes = zip(*[list(ft) for ft in filetypes])
filetypes = ["".join(ch) for ch in filetypes]
filetypes = ["[%s]" % ch for ch in filetypes]
filetypes = "".join(filetypes) + "*"
print(filetypes)
# => [tj][ip][fg]*

glob.glob("/path/to/*.%s" % filetypes)

@Laurent LAPORTE 2017-09-13 13:08:17

To glob multiple file types, you need to call glob() function several times in a loop. Since this function returns a list, you need to concatenate the lists.

For instance, this function do the job:

import glob
import os


def glob_filetypes(root_dir, *patterns):
    return [path
            for pattern in patterns
            for path in glob.glob(os.path.join(root_dir, pattern))]

Simple usage:

project_dir = "path/to/project/dir"
for path in sorted(glob_filetypes(project_dir, '*.txt', '*.mdown', '*.markdown')):
    print(path)

You can also use glob.iglob() to have an iterator:

Return an iterator which yields the same values as glob() without actually storing them all simultaneously.

def iglob_filetypes(root_dir, *patterns):
    return (path
            for pattern in patterns
            for path in glob.iglob(os.path.join(root_dir, pattern)))

@Gil-Mor 2017-07-22 15:07:50

A one-liner, Just for the hell of it..

folder = "C:\\multi_pattern_glob_one_liner"
files = [item for sublist in [glob.glob(folder + ext) for ext in ["/*.txt", "/*.bat"]] for item in sublist]

output:

['C:\\multi_pattern_glob_one_liner\\dummy_txt.txt', 'C:\\multi_pattern_glob_one_liner\\dummy_bat.bat']

@Hans Goldman 2015-02-10 03:13:28

After coming here for help, I made my own solution and wanted to share it. It's based on user2363986's answer, but I think this is more scalable. Meaning, that if you have 1000 extensions, the code will still look somewhat elegant.

from glob import glob

directoryPath  = "C:\\temp\\*." 
fileExtensions = [ "jpg", "jpeg", "png", "bmp", "gif" ]
listOfFiles    = []

for extension in fileExtensions:
    listOfFiles.extend( glob( directoryPath + extension ))

for file in listOfFiles:
    print(file)   # Or do other stuff

@NeStack 2018-12-13 14:12:44

Doesn't work for me. I use directoryPath = "/Users/bla/bla/images_dir*."

@Hans Goldman 2019-01-09 00:39:36

I would need more info to debug this for you... Are you getting an exception? Also, if you're on Windows, that path doesn't look like it would work (missing drive letter).

@unpangloss 2017-04-24 04:27:40

import os    
import glob
import operator
from functools import reduce

types = ('*.jpg', '*.png', '*.jpeg')
lazy_paths = (glob.glob(os.path.join('my_path', t)) for t in types)
paths = reduce(operator.add, lazy_paths, [])

https://docs.python.org/3.5/library/functools.html#functools.reduce https://docs.python.org/3.5/library/operator.html#operator.add

@tzot 2011-01-28 14:12:14

Chain the results:

import itertools as it, glob

def multiple_file_types(*patterns):
    return it.chain.from_iterable(glob.iglob(pattern) for pattern in patterns)

Then:

for filename in multiple_file_types("*.txt", "*.sql", "*.log"):
    # do stuff

@rodrigob 2013-08-06 13:57:30

glob.glob -> glob.iglob so that the iterators chain is fully lazy evaluated

@florisla 2018-04-20 11:35:14

I found the same solution but didn't know about chain.from_iterable. So this is similar, but less readable: it.chain(*(glob.iglob(pattern) for pattern in patterns)).

@cyht 2016-11-07 19:35:28

You could also use reduce() like so:

import glob
file_types = ['*.txt', '*.mdown', '*.markdown']
project_files = reduce(lambda list1, list2: list1 + list2, (glob.glob(t) for t in file_types))

this creates a list from glob.glob() for each pattern and reduces them to a single list.

@patrick-mooney 2015-12-29 08:31:03

glob returns a list: why not just run it multiple times and concatenate the results?

from glob import glob
ProjectFiles = glob('*.txt') + glob('*.mdown') + glob('*markdown')

@Hans Goldman 2017-06-02 22:38:59

This is possibly the most readable solution given. I would change the case of ProjectFiles to projectFiles, but great solution.

@x squared 2020-01-03 12:13:30

I agree, this should be the accepted answer

@user2363986 2014-10-16 11:23:02

from glob import glob

files = glob('*.gif')
files.extend(glob('*.png'))
files.extend(glob('*.jpg'))

print(files)

If you need to specify a path, loop over match patterns and keep the join inside the loop for simplicity:

from os.path import join
from glob import glob

files = []
for ext in ('*.gif', '*.png', '*.jpg'):
   files.extend(glob(join("path/to/dir", ext)))

print(files)

@Winand 2015-10-09 07:42:54

This is a Python 3.4+ pathlib solution:

exts = ".pdf", ".doc", ".xls", ".csv", ".ppt"
filelist = (str(i) for i in map(pathlib.Path, os.listdir(src)) if i.suffix.lower() in exts and not i.stem.startswith("~"))

Also it ignores all file names starting with ~.

@LK__ 2015-05-28 21:12:03

You could use filter:

import os
import glob

projectFiles = filter(
    lambda x: os.path.splitext(x)[1] in [".txt", ".mdown", ".markdown"]
    glob.glob(os.path.join(projectDir, "*"))
)

@jdnoon 2014-11-05 12:45:14

This Should Work:

import glob
extensions = ('*.txt', '*.mdown', '*.markdown')
for i in extensions:
    for files in glob.glob(i):
        print (files)

@Tim Fuller 2013-01-15 15:18:24

The following function _glob globs for multiple file extensions.

import glob
import os
def _glob(path, *exts):
    """Glob for multiple file extensions

    Parameters
    ----------
    path : str
        A file name without extension, or directory name
    exts : tuple
        File extensions to glob for

    Returns
    -------
    files : list
        list of files matching extensions in exts in path

    """
    path = os.path.join(path, "*") if os.path.isdir(path) else path + "*"
    return [f for files in [glob.glob(path + ext) for ext in exts] for f in files]

files = _glob(projectDir, ".txt", ".mdown", ".markdown")

@joemaller 2012-12-06 03:36:52

Not glob, but here's another way using a list comprehension:

extensions = 'txt mdown markdown'.split()
projectFiles = [f for f in os.listdir(projectDir) 
                  if os.path.splitext(f)[1][1:] in extensions]

@thegauraw 2012-01-19 05:46:52

You can try to make a manual list comparing the extension of existing with those you require.

ext_list = ['gif','jpg','jpeg','png'];
file_list = []
for file in glob.glob('*.*'):
  if file.rsplit('.',1)[1] in ext_list :
    file_list.append(file)

@user225312 2010-12-31 06:53:41

Maybe there is a better way, but how about:

>>> import glob
>>> types = ('*.pdf', '*.cpp') # the tuple of file types
>>> files_grabbed = []
>>> for files in types:
...     files_grabbed.extend(glob.glob(files))
... 
>>> files_grabbed   # the list of pdf and cpp files

Perhaps there is another way, so wait in case someone else comes up with a better answer.

@Novitoll 2016-11-10 06:22:11

files_grabbed = [glob.glob(e) for e in ['*.pdf', '*.cpp']]

@robroc 2017-01-29 20:04:31

Novitoll's solution is short, but it ends up creating nested lists.

@AlexG 2017-06-10 04:50:04

you could always do this ;) [f for f_ in [glob.glob(e) for e in ('*.jpg', '*.mp4')] for f in f_]

@florisla 2018-04-20 11:27:44

files_grabbed = [glob.glob(e) for e in ['.pdf', '*.cpp']]

@Ridhuvarshan 2018-11-09 12:31:00

This loops twice through the list of files. In the first iteration it checks for *.pdf and in the second it checks for *.cpp. Is there a way to get it done in one iteration? Check the combined condition each time?

@niid 2019-09-28 16:06:57

How does it play out in either of the above solutions if 2 or more extensions match the same file. In that case we would have duplicates that need to be accounted for... I think the task implies that we want every unique file so the solution should account for that.

@Andrew Alcock 2012-05-15 09:30:12

I have released Formic which implements multiple includes in a similar way to Apache Ant's FileSet and Globs.

The search can be implemented:

import formic
patterns = ["*.txt", "*.markdown", "*.mdown"]
fileset = formic.FileSet(directory=projectDir, include=patterns)
for file_name in fileset.qualified_files():
    # Do something with file_name

Because the full Ant glob is implemented, you can include different directories with each pattern, so you could choose only those .txt files in one subdirectory, and the .markdown in another, for example:

patterns = [ "/unformatted/**/*.txt", "/formatted/**/*.mdown" ]

I hope this helps.

Related Questions

Sponsored Content

25 Answered Questions

[SOLVED] Does Python have a ternary conditional operator?

65 Answered Questions

[SOLVED] Calling an external command from Python

10 Answered Questions

[SOLVED] Proper way to declare custom exceptions in modern Python?

19 Answered Questions

[SOLVED] What are metaclasses in Python?

25 Answered Questions

[SOLVED] How can I safely create a nested directory?

31 Answered Questions

[SOLVED] How do I check if a string is a number (float)?

12 Answered Questions

[SOLVED] Determine the type of an object?

62 Answered Questions

[SOLVED] How do you split a list into evenly sized chunks?

8 Answered Questions

[SOLVED] How to return dictionary keys as a list in Python?

5 Answered Questions

[SOLVED] Catch multiple exceptions in one line (except block)

Sponsored Content