How to Search Directories and Find Files That Match Regex

How do i search directories and find files that match regex?

import os
import re

rootdir = "/mnt/externa/Torrents/completed"
regex = re.compile('(.*zip$)|(.*rar$)|(.*r01$)')

for root, dirs, files in os.walk(rootdir):
for file in files:
if regex.match(file):
print(file)

CODE BELLOW ANSWERS QUESTION IN FOLLOWING COMMENT

That worked really well, is there a way to do this if match is found on regex group 1 and do this if match is found on regex group 2 etc ? – nillenilsson

import os
import re

regex = re.compile('(.*zip$)|(.*rar$)|(.*r01$)')
rx = '(.*zip$)|(.*rar$)|(.*r01$)'

for root, dirs, files in os.walk("../Documents"):
for file in files:
res = re.match(rx, file)
if res:
if res.group(1):
print("ZIP",file)
if res.group(2):
print("RAR",file)
if res.group(3):
print("R01",file)

It might be possible to do this in a nicer way, but this works.

How do I search through a folder for the filename that matches a regular expression using Python?

This will find all files starting with two digits and ending in gif, you can add the files into a global list, if you wish:

import re
import os
r = re.compile(r'\d{2}.+gif$')
for root, dirs, files in os.walk('/home/vinko'):
l = [os.path.join(root,x) for x in files if r.match(x)]
if l: print l #Or append to a global list, whatever

How can I recursively find all files in current and subfolders based on regular expressions

To match whole paths that end in a filename matching a given regular expression, you could prepend .*/ to it, for example .*/f.+1$. The .*/ should match the path preceding the filename.

Regular expression matching of the contents of text files in a directory

you need to read the files, you're just checking the patterns against the filenames.

for file in os.listdir('/home/ea/medical'):
contents = open(os.path.join('/home/ea/medical', file)).read()
status = 1
if re.search(pattern1, contents):
status += 1
if re.search(pattern2, contents):
status += 1
print(f"{file} Status: {status}")

How search for files using regex in linux shell script

Find all .py files.

find / -name '*.py'

Find files with the word "python" in the name.

find / -name '*python*'

Same as above but case-insensitive.

find / -iname '*python*'

Regex match, more flexible. Find both .py files and files with the word "python" in the name.

find / -regex '.*python.*\|.*\.py'

List (find) files with repeated pattern in their name

You can use

find . -type f -regextype posix-extended -regex '.*/(20190[0-9]{3})_fl_\1\.nc$'

The regex matches

  • .*/ - any chars up to the rightmost / (necessary because the pattern used with find requires a full string match)
  • (20190[0-9]{3}) - Group 1: 2019 and any three digits
  • _fl_ - a fixed substring
  • \1 - backreference to Group 1 value
  • \.nc - .nc string
  • $ - end of input.

The -regextype posix-extended option is necessary since the pattern above is POSIX ERE compliant.

Trying to use GNU find to search recursively for filenames only (not directories) containing a string in any portion of the file name

specification:

  1. match "rain"
  2. in filename
  3. only at start of a word
  4. case-insensitive

assumptions:


  1. define "word" to be sequence of letters (no punctuation, digits, etc)
  2. paths have form prefix/name where prefix can have one or more levels delimited by / and name does not contain /

constraints:


  1. find -iregex matches against entire path (-name only matches filename)
  2. find -iregex must match entirety of path (eg. "c" is only a partial match and does not match path "a/b/c")


method:

find can return matches against non-files (eg. directories). Given definition 6, we would be unable to tell if name is a directory or an ordinary file. To satisfy 2, we can exclude non-files using find's -type f predicate.

We can compare paths found by find against our specification by using find's case-insensitive regex matching predicate (-iregex). The "grep" flavour (-regextype grep) is sufficiently expressive.

Just using 1, a suitable regex is: rain

2+6+7 says we must forbid / after "rain": rain[^/]*$

  • [/] matches character in set (ie. /)
  • [^/]: ^ inverts match: ie. character that is not /
  • * matches preceding match zero or more times
  • $ constrains preceding match to occur at end of input

3+5 says there must be no immediately preceding word characters: [^a-z]rain[^/]*$

  • a-z is a shortcut for the range a to z

8 requires matching the prefix explicitly: ^.*[^a-z]rain[^/]*$

  • ^ outside of [...] constrains subsequent match to occur at beginning of input
  • . matches anything
  • [^a-z] matches a non-alphabetic

Final command-line:

find . -type f -regextype grep -iregex '^.*[^a-z]rain[^/]*$'

Note: The leading ^ and trailing $ are not actually required, given 8, and could be elided.



exercise for the reader:


  1. extend "word" to non-ASCII characters (eg. UTF-8)


Related Topics



Leave a reply



Submit