Text Difference Algorithm

Text comparison algorithm

Typically this is accomplished by finding the Longest Common Subsequence (commonly called the LCS problem). This is how tools like diff work. Of course, diff is a line-oriented tool, and it sounds like your needs are somewhat different. However, I'm assuming that you've already constructed some way to compare words and sentences.

Text difference algorithm

In Python, there is difflib, as also others have suggested.

difflib offers the SequenceMatcher class, which can be used to give you a similarity ratio. Example function:

def text_compare(text1, text2, isjunk=None):
return difflib.SequenceMatcher(isjunk, text1, text2).ratio()

Text comparison algorithm or program?

You could e.g. look at: https://github.com/wumpz/java-diff-utils and to its examples https://github.com/wumpz/java-diff-utils/wiki/Examples. The modification to include your specific tags instead of markup charactars is easy: e.g.

DiffRowGenerator generator = DiffRowGenerator.create()
.showInlineDiffs(true)
.mergeOriginalRevised(true)
.inlineDiffByWord(true)
.newTag(f -> f?"<span style=\"background-color:#ffc6c6\">":"</span>")
.oldTag(f -> f?"<span style=\"background-color:#c4ffc3\">":"</span>")
.columnWidth(10000000)
.build();

List<DiffRow> rows = generator.generateDiffRows(
Arrays.asList(lines.get(0)),
Arrays.asList(lines.get(1)));

System.out.println(rows.get(0).getOldLine());

What are some algorithms for comparing how similar two strings are?

What you're looking for are called String Metric algorithms. There a significant number of them, many with similar characteristics. Among the more popular:

  • Levenshtein Distance : The minimum number of single-character edits required to change one word into the other. Strings do not have to be the same length
  • Hamming Distance : The number of characters that are different in two equal length strings.
  • Smith–Waterman : A family of algorithms for computing variable sub-sequence similarities.
  • Sørensen–Dice Coefficient : A similarity algorithm that computes difference coefficients of adjacent character pairs.

Have a look at these as well as others on the wiki page on the topic.

Stackoverflow's text diff

From the images that you posted, and my own (albeit little experience) it seems that the website uses a modification of the longest common sub sequence algorithm. This explains why it never shows rearrangement / shuffling of words.

The first modification is that instead of thinking of alphabets as atomic units, they consider words as atomic units. (also punctuation)

Secondly, the algorithm is relatively naive, it points out that you crossed out "work" when you actually just inserted a to there. It seems to just mark discontinuities of any kind (insertions, deletions, modifications) and crosses out one word or the whole discontinuation portion.

Thirdly, everything in the second list not a part of the first list is marked in green.

Seems relatively easy to implement. Check out some tutorial on dynamic programming.

What algorithm does Copyscape use for text comparison?

I am not sure how copyscape plagiarism works. But if you ask me to implement one.

I will start with - Define 'plagiarism'? content-1 and content-2 are nearly similar. Let us say >80% are same. i.e content-1 is taken 20% is changed to produce content-2.

Now, Let us try to solve: what will be cost (no.of changes) to convert content-1 to content-2? This is a well know problem in DP(dynamic programming world) as Levenshtein distance or EDIT Distance problem. The standard problem talks about strings distance, but you can easily modify it for words instead of chars. Additionally, you may need to track all the changes @ line #, word position on both contents.

Now, the above problem will give you Least no.of changes for conversion of content-1 to content-2. With the total length of content-1, we can easily calculate the % of changes to go to content-2 from content-1. If it below a fixed threshold (say 20%) then declare the plagiarism. Also, with the auxiliary information on line#, word position on both contents - You can show the changes made.

Seeking algo for text diff that detects and can group similar lines

With an algo such as Levenshtein, I could find that of all right lines in the set of 3 to 5, line 5 matches left line 3 best, thus I could deduct that lines 3 and 4 on the right were added, and perform the inter-line comparison on left line 3 and right line 5.

After you have determined it, use the same algorithm to determine what lines in these two chinks match each other. But you need to make slight modificaiton. When you used the algorithm to match equal lines, the lines could either match or not match, so that added either 0 or 1 to the cell of the table you used.

When comparing strings in one chunk some of them are "more equal" than others (ack. to Orwell). So they can add a real number from 0 to 1 to the cell when considering what sequence matches best so far.

To compute this metrics (from 0 to 1), you can apply to each pair of strings you encounter... right, the same algorithm again (actually, you already did this when you were doing the first pass of Levenstein algorithm). This will compute the length of LCS, whose ratio to the average length of two strings would be the the metric value.

Or, you can borrow the algorithm from one of diff tools. For instance, vimdiff can highlight the matches you require.



Related Topics



Leave a reply



Submit