Python recursive folder read
Make sure you understand the three return values of os.walk
:
for root, subdirs, files in os.walk(rootdir):
has the following meaning:
root
: Current path which is "walked through"subdirs
: Files inroot
of type directoryfiles
: Files inroot
(not insubdirs
) of type other than directory
And please use os.path.join
instead of concatenating with a slash! Your problem is filePath = rootdir + '/' + file
- you must concatenate the currently "walked" folder instead of the topmost folder. So that must be filePath = os.path.join(root, file)
. BTW "file" is a builtin, so you don't normally use it as variable name.
Another problem are your loops, which should be like this, for example:
import os
import sys
walk_dir = sys.argv[1]
print('walk_dir = ' + walk_dir)
# If your current working directory may change during script execution, it's recommended to
# immediately convert program arguments to an absolute path. Then the variable root below will
# be an absolute path as well. Example:
# walk_dir = os.path.abspath(walk_dir)
print('walk_dir (absolute) = ' + os.path.abspath(walk_dir))
for root, subdirs, files in os.walk(walk_dir):
print('--\nroot = ' + root)
list_file_path = os.path.join(root, 'my-directory-list.txt')
print('list_file_path = ' + list_file_path)
with open(list_file_path, 'wb') as list_file:
for subdir in subdirs:
print('\t- subdirectory ' + subdir)
for filename in files:
file_path = os.path.join(root, filename)
print('\t- file %s (full path: %s)' % (filename, file_path))
with open(file_path, 'rb') as f:
f_content = f.read()
list_file.write(('The file %s contains:\n' % filename).encode('utf-8'))
list_file.write(f_content)
list_file.write(b'\n')
If you didn't know, the with
statement for files is a shorthand:
with open('filename', 'rb') as f:
dosomething()
# is effectively the same as
f = open('filename', 'rb')
try:
dosomething()
finally:
f.close()
Scanning and listing files recursively without os.walk()
About your code:
Your recursion doesn't take in level
as argument, so you increment it without results (as seen in your output).[Talking about '---'
,'-----'
incremental]
Here I refactored a little your code:
def tree(path='.', level=1):
#print(path)
#level=1
separator = "----" * level
#print("Scanning files in " + path)
#print(f"content: {os.listdir(path)}")
for each_item in os.listdir(path):
#print('Checking item ',each_item)
if os.path.isfile(path+f'/{each_item}'):
#print("file found! It's called " + each_item)
print(separator + each_item)
pass
elif os.path.isdir(path+f'/{each_item}'):
#print("directory found! It's called " + each_item)
print(separator + each_item)
level+=1
#print(f"Entering directory...{each_item}")
tree(path+f'/{each_item}', level)
else:
print(f"{each_item} is corrupted or of an unknown format.")
tree()
Commented the most of the prints from within the function, remaining output:
----Folder
------------example.py
------------NewDirectory
----------------samplefile.txt
----------------folder
--------------------root.txt
------------NewDirectory - Shortcut.lnk
----initial.py
Check online compiler.
One note on recursion with default values (default meaning def function (argument=10)
) you should check this article out.
On a second note, regarding fnmatch
and glob
( and not os.walk
from title).
For a defined structure as yours you could use the fnmatch
to find files that have python (py) as extension (python files). Or texts (txt) and so on:
for file in os.listdir('.'):
if fnmatch.fnmatch(file, '*.py'):
print(file)
Created a basic example of this take inside the compiler link (check main2.py
file).
On a third note, check this Path constant from pathlib library:
from pathlib import Path
for path in Path('.').rglob('*'):
print(path.name)
Using os.walk() to recursively traverse directories in Python
This will give you the desired result
#!/usr/bin/python
import os
# traverse root directory, and list directories as dirs and files as files
for root, dirs, files in os.walk("."):
path = root.split(os.sep)
print((len(path) - 1) * '---', os.path.basename(root))
for file in files:
print(len(path) * '---', file)
Python recursive directory reading without os.walk
os.chdir()
returns None
, not the new directory name. You pass that result to the recursive walkfn()
function, and then to os.listdir()
.
There is no need to assign, just pass path
to walkfn()
:
for name in os.listdir(dirname):
path = os.path.join(dirname, name)
if os.path.isdir(path):
print "'", name, "'"
os.chdir(path)
walkfn(path)
You usually want to avoid changing directories; there is no need to if your code uses absolute paths:
def walkfn(dirname):
output = os.path.join(dirname, 'output')
if os.path.exists(output):
with open(output) as file1:
for line in file1:
if line.startswith('Final value:'):
print line
else:
for name in os.listdir(dirname):
path = os.path.join(dirname, name)
if os.path.isdir(path):
print "'", name, "'"
walkfn(path)
Getting all files under a folder and subfolders in a recursive manner, without the path of the subfolders itself
os.walk()
recurses into all subdirectories. The first element returned in each iteration is the path to the directory, you join that with the filename to get the full path of the file.
def get_all_filePaths(folderPath):
result = []
for dirpath, dirnames, filenames in os.walk(folderPath):
result.extend([os.path.join(dirpath, filename) for filename in filenames])
return result
How to use glob() to find files recursively?
pathlib.Path.rglob
Use pathlib.Path.rglob
from the the pathlib
module, which was introduced in Python 3.5.
from pathlib import Path
for path in Path('src').rglob('*.c'):
print(path.name)
If you don't want to use pathlib, use can use glob.glob('**/*.c')
, but don't forget to pass in the recursive
keyword parameter and it will use inordinate amount of time on large directories.
For cases where matching files beginning with a dot (.
); like files in the current directory or hidden files on Unix based system, use the os.walk
solution below.
os.walk
For older Python versions, use os.walk
to recursively walk a directory and fnmatch.filter
to match against a simple expression:
import fnmatch
import os
matches = []
for root, dirnames, filenames in os.walk('src'):
for filename in fnmatch.filter(filenames, '*.c'):
matches.append(os.path.join(root, filename))
python recursive directory reading
This seems to work for me
import os
op = os.path
def fileRead(mydir):
data = {}
root = set()
for i in os.listdir(mydir):
path = op.join(mydir, i)
print(path)
if op.isfile(path):
data.setdefault(i, set())
root.add(op.relpath(mydir).replace("\\", "/"))
data[i] = root
else:
data.update(fileRead(path))
return data
d = fileRead("c:\python32\programas")
print(d)
Still I am not sure why you use the set root. I think the purpose is to keep all the directories when you have the same file in two directories. But it doesnt work: each update deletes the stored values for repeated keys (file names).
Here you have a working code, using a defaultdict /you can do the same with an ordinary dictionary (as in your code) but using defauldict you dont need to check if a key has been initialized before:
import os
from collections import defaultdict
op = os.path
def fileRead(mydir):
data = defaultdict(list)
for i in os.listdir(mydir):
path = op.join(mydir, i)
print(path)
if op.isfile(path):
root = op.relpath(mydir).replace("\\", "/")
data[i].append(root)
else:
for k, v in fileRead(path).items():
data[k].extend(v)
return data
d = fileRead("c:\python32\programas")
print(d)
Edit: Relative to the comment from @hughdbrown:
If you update data with data.update(fileRead(path).items())
you get tthis when calling for fileRead("c:/python26/programas/pack")
in my computer (now in py26):
c:/python26/programas/pack\copia.py
c:/python26/programas/pack\in pack.py
c:/python26/programas/pack\pack2
c:/python26/programas/pack\pack2\copia.py
c:/python26/programas/pack\pack2\in_pack2.py
c:/python26/programas/pack\pack2\pack3
c:/python26/programas/pack\pack2\pack3\copia.py
c:/python26/programas/pack\pack2\pack3\in3.pydefaultdict( 'list'>, {'in3.py': ['pack/pack2/pack3'], 'copia.py': ['pack/pack2/pack3'],
'in pack.py': ['pack'], 'in_pack2.py': ['pack/pack2']})
Note that files that are repeated in several directories (copia.py) only show one of those directories, the deeper one. However all the directories are listed when using:
for k, v in fileRead(path).items(): data[k].extend(v)
c:/python26/programas/pack\copia.py
c:/python26/programas/pack\in pack.py
c:/python26/programas/pack\pack2
c:/python26/programas/pack\pack2\copia.py
c:/python26/programas/pack\pack2\in_pack2.py
c:/python26/programas/pack\pack2\pack3
c:/python26/programas/pack\pack2\pack3\copia.py
c:/python26/programas/pack\pack2\pack3\in3.pydefaultdict(, {'in3.py': ['pack/pack2/pack3'], 'copia.py': ['pack', 'pack/pack2', 'pack/pack2/pack3'],
'in pack.py': ['pack'], 'in_pack2.py': ['pack/pack2']})
Recursively read files from sub-folders into a list and merge each sub-folder's files into one csv per sub-folder
The issue is most probably that in the main directory - Folder
(or /dir
according to your code) , you do not have any files , so file_list
is empty and hence df_list
is also empty. so when you pass an empty list into pd.concat()
, you are getting that error. Example -
In [5]: pd.concat([])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython> in <module>()
----> 1 pd.concat([])
/path/to/merge.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
752 keys=keys, levels=levels, names=names,
753 verify_integrity=verify_integrity,
--> 754 copy=copy)
755 return op.get_result()
756
/path/to/merge.py in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy)
797
798 if len(objs) == 0:
--> 799 raise ValueError('All objects passed were None')
800
801 # consolidate data & figure out what our result ndim is going to be
ValueError: All objects passed were None
I would suggest you should check that the files you are reading are really files and that they end with .csv
and that the df_list
is not empty, when you pass it into pd.concat()
. Also I would suggest that you use os.path.join()
, rather than concatenating strings, to create paths. Example -
import pandas as pd
import os.path
import os
working_dir = "/dir/"
for root, dirs, files in os.walk(working_dir):
file_list = []
for filename in files:
if filename.endswith('.csv'):
file_list.append(os.path.join(root, filename))
df_list = [pd.read_table(file) for file in file_list]
if df_list:
final_df = pd.concat(df_list)
final_df.to_csv(os.path.join(root, "Final.csv"))
EDIT:
As you say -
Also the output is adding another column that looks to be an id column.
The new column that comes in is most probably the index of the DataFrames.
When doing DataFrame.to_csv()
, if you do not want the index of the DataFrame to be written to csv , you should specify index
keyword argument as False
so that the index is not written to the csv. Example -
final_df.to_csv(os.path.join(root, "Final.csv"), index=False)
Related Topics
Python List Subtraction Operation
How to Run a Python Program in the Command Prompt in Windows 7
How to Mock an Open Used in a with Statement (Using the Mock Framework in Python)
Convert Unicode to Ascii Without Errors in Python
Python Pandas: Apply a Function with Arguments to a Series
Can You Explain Closures (As They Relate to Python)
Converting JSON String to Dictionary Not List
How to Access a Dictionary Element in a Django Template
What Does Preceding a String Literal with "R" Mean
List of Lists into Numpy Array
Python:List Index Out of Range Error While Iteratively Popping Elements
Locale Date Formatting in Python
":=" Syntax and Assignment Expressions: What and Why
How to Know If a Generator Is Empty from the Start