What Is an Algorithm to Diff the Two Strings in the Same Way That So Does on the Version Page

What is an Algorithm to Diff the Two Strings in the Same Way that SO Does on the Version Page?

The algorithm you are looking for is Longest Common Subsequence it does most of the work for you.

The outline is something along these lines.

  1. Split by word (input, output)
  2. Calculate LCS on input / output array.
  3. Walk through the array and join up areas intelligently.

So for example say you have:

"hello world this is a test"

compared with:

"mister hello world"

The result from the LCS is

  • "mister" +
  • "hello" =
  • "world" =
  • "this" -
  • "is" -
  • "a" -
  • "test" -

Now you sprinkle the special sauce when building up. You join the string together while staying mindful of the previous action. The naive algorithm is just join sections that are the same action.

  • "mister" +
  • "hello world" =
  • "this is a test" -

Finally you transform it to html:

<ins>mister</ins> hello world <del>this is a test</del>  

Of course the devil is in the detail:

  • You need to consider how you handle tags
  • Do you compare markdown or html
  • Are there any edge cases where the UI stops making sense.
  • Do you need special handling for punctuations.

What are some algorithms for comparing how similar two strings are?

What you're looking for are called String Metric algorithms. There a significant number of them, many with similar characteristics. Among the more popular:

  • Levenshtein Distance : The minimum number of single-character edits required to change one word into the other. Strings do not have to be the same length
  • Hamming Distance : The number of characters that are different in two equal length strings.
  • Smith–Waterman : A family of algorithms for computing variable sub-sequence similarities.
  • Sørensen–Dice Coefficient : A similarity algorithm that computes difference coefficients of adjacent character pairs.

Have a look at these as well as others on the wiki page on the topic.

Text comparison algorithm

Typically this is accomplished by finding the Longest Common Subsequence (commonly called the LCS problem). This is how tools like diff work. Of course, diff is a line-oriented tool, and it sounds like your needs are somewhat different. However, I'm assuming that you've already constructed some way to compare words and sentences.

create a WIKI like diff between two strings

This one has worked pretty well for me in my projects.

How to get a % difference of two NSStrings

Another off the wall suggestion:

The source, and hence the algorithm, for diff and similar programs is easily available. These compare input on a line-by-line basis and detect insertions, deletions and changes.

When comparing text strings for "closeness" then the insertion, deletion or changing of words seems as good a measure as any.

So:

  1. Break each string into "words" (white space separated should be sufficient).
  2. Compare the two lists using the diff algorithm, treating each "word" as a "line", use a re-sync length of 1 (the number of "lines" that need to be the same to treat the two inputs as back in sync)
  3. Calculate the "closeness" as the number of insertions/deletions/changes compared to the total word count.

For the two example strings this would give 1:4 changes or 75% similar.

If you want greater granularity for each change split the two words into characters and repeat the algorithm giving you a fraction the word is similar by (as opposed to the whole word).

For the two example strings this would give 3 6/7 words out of 4, or 96% similar.

Text difference algorithm

In Python, there is difflib, as also others have suggested.

difflib offers the SequenceMatcher class, which can be used to give you a similarity ratio. Example function:

def text_compare(text1, text2, isjunk=None):
return difflib.SequenceMatcher(isjunk, text1, text2).ratio()

Any existing C# code (OSS) that will calculate diff between two strings and output html?

There is a C# class available from here (under a BSD licence) that will diff two textual inputs. If you download the source code, there is also some code that will turn this output into HTML. An example of its output can be found here.

Comparing two strings or objects and getting the difference back

You want a diffing algorithm (I've tagged the question as such), which I highly recommend you not try to write yourself. I've tried - and failed - as it's a NP complete problem and not easy to wrap your mind around. Instead, check out diff-match-patch, which has a JavaScript and Java implementation for client (demo) or server side processing. If you need to do HTML differencing look at daisydiff instead, albeit be forewarned HTML/XML diffing is truly a painful experience (see this page for some reasons why).

Probably the grand-daddy of diffing is GNU diff, which also has a Java implementation (find "GNU Diff for Java"). This algorithm is more optimized than diff-match-patch (dmp), albeit dmp seems to be improving all the time, so if you need to compare very large strings (e.g. megabytes) the GNU algorithm is probably a better bet.

Is there a function to compare two strings using a custom homoglyphs list

The key to your problem can be thought of like an IQ word association question.

  Sound       Glyph
--------- = ----------
Homophone Homoglyphs

Now if you know that there is a way to find similar sounding words (homophone) then the same can be applied but instead of sounds change to glyphs (homoglyph).

The way to find similar sounding words is via Soundex (Sound Index).

So just do what Soundex does but instead of having a mapping from similar homophones use similar homoglyphs.

Once you convert each word (glyphs) input into a Glyphdex (Glyph Index) then you can compute the Levenshtein distance for the two Glyphdex.

Make sense?


If you are into cellular biology then codon translation into amino acids (ref) might make more sense. Many amino acids are coded by more than one 3 letter codon.


Note: Since the word glyhdex has been used prior to me writing this I can not say I coined that word, however the usage I currently find via Google (search) for the word are not in the same context as described here. So in the context of converting a sequence of glyphs into an index of similar sequence of glyphs I will take credit.

How do you compare two version Strings in Java?

Tokenize the strings with the dot as delimiter and then compare the integer translation side by side, beginning from the left.



Related Topics



Leave a reply



Submit