Fast folder iteration in python
Post created at 2022-01-19 07:53
I have a problem to deal here.
I need to iterate over a big tree of folders and files and run process over each file.
In python, we have some options to deal to folders and files.
- glob
- iglob
- os.walk
- os.scandir
- pathlib.Path
Running this benchmark, we can read some implementation details:
glob
Python
import glob
folder='some_folder/another_folder/**/*'
for file in glob.glob(folder, recursive=True):
print(file)
Text Only
some_folder/another_folder/f1/file1
some_folder/another_folder/f1/file1
some_folder/another_folder/f1/file2
Pros
- Easy to use, less code to write
- Easy to apply filters using masks (fnmatch based)
- Returns data as a list
Cons
- Time to scan is the worst (more than 11x the quickest method)
- Data will be available only after all files/folders been scanned
iglob
Python
import glob
folder='some_folder/another_folder/**/*'
for file in glob.iglob(folder, recursive=True):
print(file)
Text Only
some_folder/another_folder/f1/file1
some_folder/another_folder/f1/file1
some_folder/another_folder/f1/file2
Pros
- Easy to use, like glob
- Same filtering
- Returns data as iterator, so you can have your data without need to wait all iteration ending.
Cons
- Time to scan is a little better than glob (10x the quickest method)
- Time to first file could be mutch better
os.walk
Python
import os
folder='some_folder/another_folder/**/*'
for root, _, files in os.walk(folder):
for file in files:
print(os.path.join(root, file))
Pros
- Good performance. Second quickest method (less than 2x slower than the quickest)
- Time for the first file almost imediattely (better result of all)
- Explicit code in loops gives more visibility and control if you need validations or another nasty processes
Cons
- More code needed, with nested loops.
- If you need to nest os.walk in another os.walk loop, some strange things can ocurr. But probably your code needs some refactoring.
os.scandir
Python
import os
from typing import Generator
folder='some_folder/another_folder/**/*'
def get_files(folder: str) -> Generator:
with os.scandir(folder) as scan:
for item in scan:
if item.is_file():
yield item.path
else:
for subitem in get_files(item.path):
yield subitem
Pros
- Best performance of all
- Context based, assure resources are released after processing
- Less verbose than os.walk
- Easy to implement your custom business rules
Cons
- A little more complex. No big deal.
pathlib.Path.rglob
Python
import pathlib
folder='some_folder/another_folder'
for path in pathlib.Path(folder).rglob('*'):
if path.is_file():
yield str(path)
Pros
- Easy to implement your custom business rules
- Returns a iterator
- You can have your files soon and don't need to wait for all scan (7x more time than os.scandir)
Cons
- Average time to scan (almost 8 times greater than os.scandir)
- Memory consumption is 4 times more than the other methods
Some data from tests
System Information
System | Release | Version | Machine | Processor |
---|---|---|---|---|
Linux | 5.11.0-46-generic | #51~20.04.1-Ubuntu SMP Fri Jan 7 06:51:40 UTC 2022 | x86_64 | x86_64 |
CPU Info
Physical cores | Total cores | Max frequency | Min frequency | Current frequency |
---|---|---|---|---|
4 | 8 | 3400 | 400 | 1.892 |
CPU Usage Per Core
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | Total |
---|---|---|---|---|---|---|---|---|
15.6 | 16.5 | 10.4 | 14.1 | 15.2 | 14.4 | 15.2 | 12.4 | 14 |
Memory Information
Total | Available | Used | Percentage |
---|---|---|---|
15.51GB | 4.77GB | 9.56GB | 69.2% |
SWAP
Total | Free | Used | Percentage |
---|---|---|---|
15.26GB | 3.28GB | 11.98GB | 78.5% |
Text Only
Creating sample files
+ Creating folder ./test_files
+ Creating files LEVELS=6 FOLDER_COUNT=6 FILE_COUNT_BY_FOLDER=20
+ Created 933120 files in 0:02:08.928680
* Running iterators: GlobFolderIterator IGlobFolderIterator OSWalkFolderIterator ScanDirIterator PathLibFolderIterator
MEMORY USAGE
Iterator | RSS | VMS | DATA |
---|---|---|---|
IGlobFolderIterator | 111280128 | 96468992 | 96468992 |
OSWalkFolderIterator | 111280128 | 96468992 | 96468992 |
ScanDirIterator | 111280128 | 96468992 | 96468992 |
GlobFolderIterator | 117309440 | 102498304 | 120254464 |
PathLibFolderIterator | 469721088 | 455081984 | 455081984 |
ELAPSED TIME
Iterator | Elapsed time | X |
---|---|---|
ScanDirIterator | 0:00:01.317526 | 1 |
OSWalkFolderIterator | 0:00:02.496639 | 1.9 |
PathLibFolderIterator | 0:00:10.464794 | 7.9 |
IGlobFolderIterator | 0:00:14.358165 | 10.9 |
GlobFolderIterator | 0:00:15.308936 | 11.6 |
TIME FOR FIRST FILE
Iterator | Elapsed time | X |
---|---|---|
ScanDirIterator | 0:00:00.000098 | 1 |
OSWalkFolderIterator | 0:00:00.000247 | 2.5 |
PathLibFolderIterator | 0:00:00.000690 | 7 |
IGlobFolderIterator | 0:00:00.001135 | 11.6 |
GlobFolderIterator | 0:00:07.831005 | 79908.2 |
You can check the source code for this here.
Image from this wikipedia article.
Last update:
September 18, 2024
Created: September 18, 2024
Created: September 18, 2024