File Search Algorithms Using Indexing in Linux

File search algorithms using indexing in linux

You haven't seen the locate command that comes with findutils? Like beagled, it's free software, so you can study the code.

The findutils package is always looking for contributors.

Information on the database format is at http://www.gnu.org/software/findutils/manual/html_node/find_html/Database-Formats.html

what is the better way to search in millions of file names with wildcard(GLOB) support

As far as I know there is no way to do better than O(n) for generalized glob searching.

However for the special cases of prefix and suffix search you can make yourself sorted indexes to do a binary search on, resulting in O(log n) for prefix and suffix search. The prefix index would be sorted based on the first character, then the second, and so on. The suffix index would be sorted based on the last character, then the second last, and so on (the reverse strings).

I would do as you suggest in your post and do the search in two phases, search the prefix or suffix indexes, followed by a brute force search through the reduced list provided from the first phase using a regex made from the glob.

Since string length comparisons are faster than regexes, I would also pre-filter for the minimum matching string length, or fixed length matching string for the ???.doc example.

From the sound of your original post the indexes would need to refer to the full path of each entry as well, so that you can display it after the final results are found.

How to search a file system efficiently (algorithm-wise)?

If you really have no guidance whatsoever about where the file might be, then I don't think there's a lot you can do. You should try to minimize seeks and seek time with some tricks, but filesystems get fragmented and there's no way for you to know about that, so it's hard to do much there. Searching the files in a directory before searching a subdirectory should be faster on a lot of filesystems, especially if you're looking for a small file that may have been inlined. Not exhausting kernel resources with a full BFS is also a good thing.

Even if you only have a hint of where the file may be, that can help a lot. For example, if it's a file that the user put somewhere and then forgot the location of, then start with the home directory, temp directory, and root of the drive, and do a DFS up to a reasonable recursion limit (eg. 6-8 would find any manually-placed file or automatically downloaded file on my Windows or OS X machines) on the theory that users typically don't end up with deep trees accidentally, but automatically-generated hierarchies can get deep. If that search fails, go back and search the deep directories you skipped earlier. If the file is that lost, the search is going to be slow anyways, so fall back on DFS to be safe and not cause too many problems as the user continues to use the machine.

And most importantly, if the system has any kind of search index, check that first, even if it means writing a lot more code to support it.

Writing a C++ Program to Search an Index File From the Linux Command Line

Use map or multimap and find in the STL. Make a datatype out of the remaining data with the index as the key. The program will have to read in the entire file first and then find the searched index.

Search for a string in a normal file

You could use mmap to map the entire file into memory, and then do a strnstr search:

#include <sys/mman.h>

const char *fileName = "~/test.txt";
long fileLength;

// open file and get it's length
FILE *fptr = fopen(fileName, "r");

if (fptr == NULL || ferror(fptr))
{
    perror("could not open file");
    fclose(fptr);
    return 0;
}

fseek(fptr, 0, SEEK_END);
fileLength = ftell(fptr);
// return to the start of the file
fseek(fptr, 0, SEEK_SET);

// map the file
char *fileData = mmap(NULL, fileLength, PROT_READ, MAP_FILE | MAP_SHARED, fileno(fptr), 0);

if (fileData == MAP_FAILED)
    perror("could not map file");

// scan the file
char stringToSearchFor[] = "this is my string!";
if (strnstr(fileData, stringToSearchFor, fileLength) != NULL)
{
    printf("file contains the string!");
}
else {
    printf("file doesn't contain the string");
}

// clean up our code
munmap(fileData, fileLength);
fclose(fptr);

I want a clever algorithm for indexing a file directory...pointers?

The hard part of this is the scanning of the directory, just because it can be expensive.

But that's a cruel reality since you can't use inotify et al.

In your database, simply create a node type record:

create table node (
    nodeKey integer not null primary key,
    parentNode integer references node(nodeKey), // allow null for the root, or have root point to itself, whatever
    fullPathName varchar(2048),
    nodeName varchar(2048),
    nodeType varchar(1) // d = directory, f = file, or whatever else you want
)

That's your node structure.

You can use the full path column to quickly find anything by the absolute path.

When a file moves, simply recalculate the path.

Finally, scan you music files. In unix, you can do something like:

find . -type f | sort > sortedListOfFiles

Next, simply suck all of the path names out of the database.

select fullPathName from node where nodeType != 'd' order by fullPathName

Now you have two sorted list of files.

Run them through DIFF (or comm), and you'll have a list of deleted and new files. You won't have a list of "moved" files. If you want to do some heuristic where you compare new and old files and they have the same endings (i.e. ..../album/song) to try and detect "moves" vs new and old, then fine, no big deal. Worth a shot.

But diff will give you your differential in a heartbeat.

If you have zillions of files, then, sorry, this it going to take some time -- but you already know that when you lose the inotify capability. If you had that it would just be incremental maintenance.

When a file moves, it's trivial to find its new absolute path, because you can ask its parent for its path and simply append your name to it. After that, you're not crawling a tree or anything, unless you want to. Works both ways.

Addenda:

If you want to track actual name changes, you can get a little more information.

You can do this:

find . -type f -print0 | xargs -0 ls -i | sort -n > sortedListOfFileWithInode

The -print0 and -0 are used to work with files with spaces in them. Quotes in the file names will wreck this however. You might be better off running the raw list through python and fstat to get the inode. Different things you can do here.

What this does is rather than just having names, you also get the inode of the file. The inode is the "real" file, a directory links names to inodes. This is how you can have multiple names (hard links) in a unix file system to a single file, all of the names point to the same inode.

When a file is renamed, the inode will remain the same. In unix, there's a single command used for renaming, and moving files, mv. When mv renames or moves the file, the inode stays the same AS LONG AS THE FILE IS ON THE SAME FILE SYSTEM.

So, using the inode as well as the file name will let you capture some more interesting information, like file moves.

It won't help if they delete the file and add a new file. But you WILL (likely) be able to tell that it happened, since it is unlikely that an old inode will be reused for the new inode.

So if you have a list of files (sorted by file name):

1234 song1.mp3
1235 song2.mp3
1236 song3.mp3

and someone removes and adds back song 2, you'll have something like

1234 song1.mp3
1237 song2.mp3
1236 song3.mp3

But if you do this:

mv song1.mp3 song4.mp3

You'll get:

1237 song2.mp3
1236 song3.mp3
1234 song4.mp3

The other caveat is that if you lose the drive and restore it from backup, likely all of the inodes will change, forcing effectively a rebuild of your index.

If you're real adventurous you can try playing with extended file system attributes and assign other interesting meta data to files. Haven't done much with that, but it's got possibilities as well, and there are likely unseen dangers, but...

Fast search for text in files in a directory in unix?

This might be overkill for your purposes, but Beagle allows you to perform very fast searches of local files. It's usually marketed as a desktop application, but in fact it is just a daemon that can respond to requests from the command-line using beagle-query.

File Search Algorithms Using Indexing in Linux