Fast Algorithm for Searching for Substrings in a String

Fast algorithm for searching for substrings in a string

Read up on the Aho-Corasick algorithm and the Rabin-Karp algorithm.

If the input is not too large, you don't want to repeat the search many times and you do not have many patterns, it might be a good idea to use a single pattern algorithm several times. The Wikipedia article on search algorithms gives many algorithms with running and preprocessing times.

Implementations:

  • https://hkn.eecs.berkeley.edu/~dyoo/java/index.html
  • http://algs4.cs.princeton.edu/53substring/RabinKarp.java.html
  • https://github.com/hankcs/AhoCorasickDoubleArrayTrie
  • https://github.com/RokLenarcic/AhoCorasick
  • https://github.com/robert-bor/aho-corasick

Presentations:

  • http://www.slideshare.net/taka111/ahocorasick-string-matching-algorithm-15078438

Can you find all substrings of a string in faster than O(N^2) time if your constrained about what sub strings your looking for?

Yes, that is what string searching algorithms are for.

Sample Image

What is the fastest method to find all substrings that are contained in a string?

Your problem is the origin of Aho-Corasick Algorithm, a string-searching algorithm invented by Alfred V. Aho and Margaret J. Corasick, that matches simultaneously the occurrences of all patterns within the input text.

Its running time is O(n + m + M) where n is the sum of lengths of the patterns, m is the length of input text and M is the amount of matches.

If you find suitable to use a database as input data is huge take into consideration this linear time algorithm.

what is the fastest substring search method in Java

String[] names = new String[]{"jack", "jackson", "jason", "dijafu"};
long start = 0;
long stop = 0;

//Contains
start = System.nanoTime();
for (int i = 0; i < names.length; i++){
names[i].contains("ja");
}
stop = System.nanoTime();
System.out.println("Contains: " + (stop-start));

//IndexOf
start = System.nanoTime();
for (int i = 0; i < names.length; i++){
names[i].indexOf("ja");
}
stop = System.nanoTime();
System.out.println("IndexOf: " + (stop-start));

//Matches
start = System.nanoTime();
for (int i = 0; i < names.length; i++){
names[i].matches("ja");
}
stop = System.nanoTime();
System.out.println("Matches: " + (stop-start));

Output:

Contains: 16677
IndexOf: 4491
Matches: 864018

Fast String Searching Algorithm for large strings

Your algorithm is doing lot of repeated searches purely a brute force. Some of the problems features can be considered to optimize it.
How do you define 'plagiarism'? content-1 and content-2 are nearly similar. Let us say >80% are same. i.e content-1 is taken 20% is changed to produce content-2.

Now, Let us try to solve: what will be cost (no.of changes) to convert content-1 to content-2?
This is a well know problem in DP(dynamic programming world) as EDIT Distance problem. The standard problem talks about strings distance, but you can easily modify it for words instead of chars.

Now, the above problem will give you least no.of changes for conversion of content-1 to content-2.
With the total length of content-1, we can easily calculate the % of changes to go to content-2 from content-1. If it below a fixed threshold (say 20%) then declare the plagiarism.

Fastest string searching algorithm for fixed text and a lot of substrings

Assuming your string are 8-bit aligned, from 100MB buffer you'll get 100 millions different strings, which can be put into the hash table approximately 800MB in size with constant (O(1)) access time.

This will allow you to make the search as fast as possible, because once you have your 8 byte string, you immediately know where this string was seen in your buffer.

Searching a substring in C

Advanced search algorithms like Boyer-Moore and Aho-Corasick work by precomputing lookup tables from the string(s) to be searched for, which incurs a large start-up time. It's very unlikely that searching something as small as a pathname would be able to make up for that high overhead. You really have to be searching something like multi-page documents before those algorithms show their value.



Related Topics



Leave a reply



Submit