Python Recursive Folder Read

Python recursive folder read

Make sure you understand the three return values of os.walk:

for root, subdirs, files in os.walk(rootdir):

has the following meaning:

  • root: Current path which is "walked through"
  • subdirs: Files in root of type directory
  • files: Files in root (not in subdirs) of type other than directory

And please use os.path.join instead of concatenating with a slash! Your problem is filePath = rootdir + '/' + file - you must concatenate the currently "walked" folder instead of the topmost folder. So that must be filePath = os.path.join(root, file). BTW "file" is a builtin, so you don't normally use it as variable name.

Another problem are your loops, which should be like this, for example:

import os
import sys

walk_dir = sys.argv[1]

print('walk_dir = ' + walk_dir)

# If your current working directory may change during script execution, it's recommended to
# immediately convert program arguments to an absolute path. Then the variable root below will
# be an absolute path as well. Example:
# walk_dir = os.path.abspath(walk_dir)
print('walk_dir (absolute) = ' + os.path.abspath(walk_dir))

for root, subdirs, files in os.walk(walk_dir):
print('--\nroot = ' + root)
list_file_path = os.path.join(root, 'my-directory-list.txt')
print('list_file_path = ' + list_file_path)

with open(list_file_path, 'wb') as list_file:
for subdir in subdirs:
print('\t- subdirectory ' + subdir)

for filename in files:
file_path = os.path.join(root, filename)

print('\t- file %s (full path: %s)' % (filename, file_path))

with open(file_path, 'rb') as f:
f_content = f.read()
list_file.write(('The file %s contains:\n' % filename).encode('utf-8'))
list_file.write(f_content)
list_file.write(b'\n')

If you didn't know, the with statement for files is a shorthand:

with open('filename', 'rb') as f:
dosomething()

# is effectively the same as

f = open('filename', 'rb')
try:
dosomething()
finally:
f.close()

Scanning and listing files recursively without os.walk()

About your code:

Your recursion doesn't take in level as argument, so you increment it without results (as seen in your output).[Talking about '---','-----' incremental]

Here I refactored a little your code:

def tree(path='.', level=1):
#print(path)
#level=1
separator = "----" * level
#print("Scanning files in " + path)
#print(f"content: {os.listdir(path)}")
for each_item in os.listdir(path):
#print('Checking item ',each_item)
if os.path.isfile(path+f'/{each_item}'):
#print("file found! It's called " + each_item)
print(separator + each_item)
pass
elif os.path.isdir(path+f'/{each_item}'):
#print("directory found! It's called " + each_item)
print(separator + each_item)
level+=1
#print(f"Entering directory...{each_item}")
tree(path+f'/{each_item}', level)
else:
print(f"{each_item} is corrupted or of an unknown format.")

tree()

Commented the most of the prints from within the function, remaining output:

----Folder
------------example.py
------------NewDirectory
----------------samplefile.txt
----------------folder
--------------------root.txt
------------NewDirectory - Shortcut.lnk
----initial.py

Check online compiler.

One note on recursion with default values (default meaning def function (argument=10) ) you should check this article out.

On a second note, regarding fnmatch and glob ( and not os.walk from title).

For a defined structure as yours you could use the fnmatch to find files that have python (py) as extension (python files). Or texts (txt) and so on:

for file in os.listdir('.'):
if fnmatch.fnmatch(file, '*.py'):
print(file)

Created a basic example of this take inside the compiler link (check main2.py file).

On a third note, check this Path constant from pathlib library:

from pathlib import Path

for path in Path('.').rglob('*'):
print(path.name)

Using os.walk() to recursively traverse directories in Python

This will give you the desired result

#!/usr/bin/python

import os

# traverse root directory, and list directories as dirs and files as files
for root, dirs, files in os.walk("."):
path = root.split(os.sep)
print((len(path) - 1) * '---', os.path.basename(root))
for file in files:
print(len(path) * '---', file)

Python recursive directory reading without os.walk

os.chdir() returns None, not the new directory name. You pass that result to the recursive walkfn() function, and then to os.listdir().

There is no need to assign, just pass path to walkfn():

for name in os.listdir(dirname):
path = os.path.join(dirname, name)
if os.path.isdir(path):
print "'", name, "'"
os.chdir(path)
walkfn(path)

You usually want to avoid changing directories; there is no need to if your code uses absolute paths:

def walkfn(dirname):
output = os.path.join(dirname, 'output')
if os.path.exists(output):
with open(output) as file1:
for line in file1:
if line.startswith('Final value:'):
print line
else:
for name in os.listdir(dirname):
path = os.path.join(dirname, name)
if os.path.isdir(path):
print "'", name, "'"
walkfn(path)

Getting all files under a folder and subfolders in a recursive manner, without the path of the subfolders itself

os.walk() recurses into all subdirectories. The first element returned in each iteration is the path to the directory, you join that with the filename to get the full path of the file.

def get_all_filePaths(folderPath):
result = []
for dirpath, dirnames, filenames in os.walk(folderPath):
result.extend([os.path.join(dirpath, filename) for filename in filenames])
return result

How to use glob() to find files recursively?

pathlib.Path.rglob

Use pathlib.Path.rglob from the the pathlib module, which was introduced in Python 3.5.

from pathlib import Path

for path in Path('src').rglob('*.c'):
print(path.name)

If you don't want to use pathlib, use can use glob.glob('**/*.c'), but don't forget to pass in the recursive keyword parameter and it will use inordinate amount of time on large directories.

For cases where matching files beginning with a dot (.); like files in the current directory or hidden files on Unix based system, use the os.walk solution below.

os.walk

For older Python versions, use os.walk to recursively walk a directory and fnmatch.filter to match against a simple expression:

import fnmatch
import os

matches = []
for root, dirnames, filenames in os.walk('src'):
for filename in fnmatch.filter(filenames, '*.c'):
matches.append(os.path.join(root, filename))

python recursive directory reading

This seems to work for me

import os

op = os.path

def fileRead(mydir):
data = {}
root = set()
for i in os.listdir(mydir):
path = op.join(mydir, i)
print(path)
if op.isfile(path):
data.setdefault(i, set())
root.add(op.relpath(mydir).replace("\\", "/"))
data[i] = root
else:
data.update(fileRead(path))
return data

d = fileRead("c:\python32\programas")
print(d)

Still I am not sure why you use the set root. I think the purpose is to keep all the directories when you have the same file in two directories. But it doesnt work: each update deletes the stored values for repeated keys (file names).

Here you have a working code, using a defaultdict /you can do the same with an ordinary dictionary (as in your code) but using defauldict you dont need to check if a key has been initialized before:

import os
from collections import defaultdict
op = os.path

def fileRead(mydir):
data = defaultdict(list)
for i in os.listdir(mydir):
path = op.join(mydir, i)
print(path)
if op.isfile(path):
root = op.relpath(mydir).replace("\\", "/")
data[i].append(root)
else:
for k, v in fileRead(path).items():
data[k].extend(v)
return data

d = fileRead("c:\python32\programas")
print(d)

Edit: Relative to the comment from @hughdbrown:

If you update data with data.update(fileRead(path).items()) you get tthis when calling for fileRead("c:/python26/programas/pack") in my computer (now in py26):

c:/python26/programas/pack\copia.py

c:/python26/programas/pack\in pack.py

c:/python26/programas/pack\pack2

c:/python26/programas/pack\pack2\copia.py

c:/python26/programas/pack\pack2\in_pack2.py

c:/python26/programas/pack\pack2\pack3

c:/python26/programas/pack\pack2\pack3\copia.py

c:/python26/programas/pack\pack2\pack3\in3.py

defaultdict( 'list'>, {'in3.py': ['pack/pack2/pack3'], 'copia.py': ['pack/pack2/pack3'],

'in pack.py': ['pack'], 'in_pack2.py': ['pack/pack2']})

Note that files that are repeated in several directories (copia.py) only show one of those directories, the deeper one. However all the directories are listed when using:

for k, v in fileRead(path).items():  data[k].extend(v)

c:/python26/programas/pack\copia.py

c:/python26/programas/pack\in pack.py

c:/python26/programas/pack\pack2

c:/python26/programas/pack\pack2\copia.py

c:/python26/programas/pack\pack2\in_pack2.py

c:/python26/programas/pack\pack2\pack3

c:/python26/programas/pack\pack2\pack3\copia.py

c:/python26/programas/pack\pack2\pack3\in3.py

defaultdict(, {'in3.py': ['pack/pack2/pack3'], 'copia.py': ['pack', 'pack/pack2', 'pack/pack2/pack3'],

'in pack.py': ['pack'], 'in_pack2.py': ['pack/pack2']})

Recursively read files from sub-folders into a list and merge each sub-folder's files into one csv per sub-folder

The issue is most probably that in the main directory - Folder (or /dir according to your code) , you do not have any files , so file_list is empty and hence df_list is also empty. so when you pass an empty list into pd.concat() , you are getting that error. Example -

In [5]: pd.concat([])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython> in <module>()
----> 1 pd.concat([])

/path/to/merge.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
752 keys=keys, levels=levels, names=names,
753 verify_integrity=verify_integrity,
--> 754 copy=copy)
755 return op.get_result()
756

/path/to/merge.py in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy)
797
798 if len(objs) == 0:
--> 799 raise ValueError('All objects passed were None')
800
801 # consolidate data & figure out what our result ndim is going to be

ValueError: All objects passed were None

I would suggest you should check that the files you are reading are really files and that they end with .csv and that the df_list is not empty, when you pass it into pd.concat(). Also I would suggest that you use os.path.join() , rather than concatenating strings, to create paths. Example -

import pandas as pd
import os.path
import os

working_dir = "/dir/"

for root, dirs, files in os.walk(working_dir):
file_list = []
for filename in files:
if filename.endswith('.csv'):
file_list.append(os.path.join(root, filename))
df_list = [pd.read_table(file) for file in file_list]
if df_list:
final_df = pd.concat(df_list)
final_df.to_csv(os.path.join(root, "Final.csv"))

EDIT:

As you say -

Also the output is adding another column that looks to be an id column.

The new column that comes in is most probably the index of the DataFrames.

When doing DataFrame.to_csv() , if you do not want the index of the DataFrame to be written to csv , you should specify index keyword argument as False so that the index is not written to the csv. Example -

final_df.to_csv(os.path.join(root, "Final.csv"), index=False)


Related Topics



Leave a reply



Submit