Retrieving Files from Directory That Contains Large Amount of Files

Retrieving files from directory that contains large amount of files

Have you tried EnumerateFiles method of DirectoryInfo class?

As MSDN Says

The EnumerateFiles and GetFiles methods differ as follows: When you
use EnumerateFiles, you can start enumerating the collection of
FileInfo objects before the whole collection is returned; when you
use GetFiles, you must wait for the whole array of FileInfo objects to
be returned before you can access the array. Therefore, when you are
working with many files and directories, EnumerateFiles can be more
efficient.

Listing a very large number of files in a directory in C#

You are creating a list of 20 million object in memory. I don't think you will ever use that, even if it become possible.

Instead use to Directory.EnumerateFiles(searchDir) and iterate each item one by one.

like:

foreach(var file in Directory.EnumerateFiles(searchDir))
{
//Copy to other location, or other stuff
}

With your current code, your program will have 20 million objects first loaded up in memory and then you have to iterate, or perform operations on them.

See: Directory.EnumerateFiles Method (String)

The EnumerateFiles and GetFiles methods differ as follows: When you
use EnumerateFiles, you can start enumerating the collection of
names before the whole collection is returned;
when you use
GetFiles, you must wait for the whole array of names to be returned
before you can access the array. Therefore, when you are working with
many files and directories, EnumerateFiles can be more efficient.

Getting files in a directory that has over 7 million items using powershell

When you do not need any information except file name, you should use [System.IO.Directory]::EnumerateFiles($folderPath, '*')

EnumerateFiles returns IEnumerable[String].
IEnumerable is a special type that can be used in foreach statements. It does not loads information into memory, but instead it gets next item only when requested. It works almost immediately.

So, your code will be

$filesIEnumerable = [System.IO.Directory]::EnumerateFiles($folderPath,'*')
foreach ($fullName in $filesIEnumerable) {
# code here
$fileName = [System.IO.Path]::GetFileName($fullName)
# more code here
}

In case you want to keep in-memory all list of files instead of iterating once ( for example you need to iterate several times ), EnumerateFiles is still a faster and requires less memory than Get-ChildItem because it does not get any extended file attributes:

$files = @([System.IO.Directory]::EnumerateFiles($folderPath,'*'))

Look further about EnumerateFiles at learn.microsoft.com

PHP - read huge amount of files from directory

You can use RecursiveDirectoryIterator with a Generator if memory is a huge issue.

function recursiveDirectoryIterator($path) {
foreach (new RecursiveIteratorIterator(new RecursiveDirectoryIterator($path)) as $file) {
if (!$file->isDir()) {
yield $file->getFilename() . $file->getExtension();
}
}
}

$start = microtime(true);
$instance = recursiveDirectoryIterator('../vendor');
$total_files = 0;
foreach($instance as $value) {
// echo $value
$total_files++;
}
echo "Mem peak usage: " . (memory_get_peak_usage(true)/1024/1024)." MiB";
echo "Total number of files: " . $total_files;
echo "Completed in: ", microtime(true) - $start, " seconds";

Here's what I got on my not-so-great laptop.

Sample Image

How to I get all files from a directory with a variable extension of specified length?

I don't believe there's a way you can do this without looping through the files in the directory and its subfolders. The search pattern for GetFiles doesn't support regular expressions, so we can't really use something like [\d]{7} as a filter. I would suggest using Directory.EnumerateFiles and then return the files that match your criteria.

You can use this to enumerate the files:

private static IEnumerable<string> GetProprietaryFiles(string topDirectory)
{
Func<string, bool> filter = f =>
{
string extension = Path.GetExtension(f);
// is 8 characters long including the .
// all remaining characters are digits
return extension.Length == 8 && extension.Skip(1).All(char.IsDigit);
};

// EnumerateFiles allows us to step through the files without
// loading all of the filenames into memory at once.
IEnumerable<string> matchingFiles =
Directory.EnumerateFiles(topDirectory, "*", SearchOption.AllDirectories)
.Where(filter);

// Return each file as the enumerable is iterated
foreach (var file in matchingFiles)
{
yield return file;
}
}

Path.GetExtension includes the . so we check that the number of characters including the . is 8, and that all remaining characters are digits.

Usage:

List<string> fileList = GetProprietaryFiles(someDir).ToList();

Limiting the amount of files grabbed from system.io.directory.getfiles

The Directory.GetFiles method always retrieves the full list of matching files before returning. There is no way to limit it (outside of specifying a more narrow search pattern, that is). There is, however, the Directory.EnumerateFiles method which does what you need. From the MSDN article:

The EnumerateFiles and GetFiles methods differ as follows: When you use EnumerateFiles, you can start enumerating the collection of names before the whole collection is returned; when you use GetFiles, you must wait for the whole array of names to be returned before you can access the array. Therefore, when you are working with many files and directories, EnumerateFiles can be more efficient.

So, for instance, you could do something like this:

dirs = Directory.
EnumerateFiles(
JigFolderBrowse.SelectedPath.ToString(),
"*",
SearchOption.AllDirectories).
Take(50).
ToArray()

Take is a LINQ extension method which returns only the first x-number of items from any IEnumerable(Of T) list. So, in order for that line to work, you'll need to import the System.Linq namespace. If you can't, or don't want to, use LINQ, you can just implement your own method that does the same sort of thing (iterates an IEnumerable list in a for loop and returns after reading only the first 50 items).

Side Note 1: Unused Array

Also, it's worth mentioning, in your code, you initialize your dirs variable to point to a 50-element string array. You then, in the very next line, set it to point to a whole new array (the one returned by the Directory.GetFiles method). While it's not breaking functionality, it is unnecessarily inefficient. You're creating that extra array, just giving the garbage collector extra work to do, for no reason. You never use that first array. It just gets dereferenced and discarded in the very next line. It would be better to create the array variable as null:

Dim dirs() As String

Or

Dim dirs() As String = Nothing

Or, better yet:

Dim dirs() As String = Directory.
EnumerateFiles(
JigFolderBrowse.SelectedPath.ToString(),
"*",
SearchOption.AllDirectories).
Take(50).
ToArray()

Side Note 2: File Extension Comparisons

Also, it looks like you are trying to compare the file extensions in a case-insensitive way. There are two problems with the way you are doing it. First, you only comparing it against two values: all lowercase (e.g. ".pdf") and all uppercase (e.g. ".PDF). That won't work with mixed-case (e.g. ".Pdf").

It is admittedly annoying that the String.Contains method does not have a case-sensitivity option. So, while it's a little hokey, the best option would be to make use of the String.IndexOf method, which does have a case-insensitive option:

If dirs(i).IndexOf(".pdf", StringComparison.CurrentCultureIgnoreCase) <> -1 Then

However, the second problem, which invalidates my last point of advice, is that you are checking to see if the string contains the particular file extension rather than checking to see if it ends with it. So, for instance, a file name like "My.pdf.zip" will still match, even though it's extension is ".zip" rather than ".pdf". Perhaps this was your intent, but, if not, I would recommend using the Path.GetExtension method to get the actual extension of the file name and then compare that. For instance:

Dim ext As String = Path.GetExtension(dirs(i))
If ext.Equals("pdf", StringComparison.CurrentCultureIgnoreCase) Then
' ...

How to iterate over a folder with a large number of files in PowerShell?

If you do

$files = Get-ChildItem $dirWithMillionsOfFiles
#Now, process with $files

you WILL face memory issues.

Use PowerShell piping to process the files:

Get-ChildItem $dirWithMillionsOfFiles | %{ 
#process here
}

The second way will consume less memory and should ideally not grow beyond a certain point.



Related Topics



Leave a reply



Submit