Similarity String Comparison in Java

Similarity String Comparison in Java

Yes, there are many well documented algorithms like:

  • Cosine similarity
  • Jaccard similarity
  • Dice's coefficient
  • Matching similarity
  • Overlap similarity
  • etc etc

A good summary ("Sam's String Metrics") can be found here (original link dead, so it links to Internet Archive)

Also check these projects:

  • Simmetrics
  • jtmt

How to compare almost similar Strings in Java? (String distance measure)

The Levensthein distance is a measure for how similar strings are. Or, more precisely, how many alterations have to be made that they are the same.

The algorithm is available in pseudo-code on Wikipedia. Converting that to Java shouldn't be much of a problem, but it's not built-in into the base class library.

Wikipedia has some more algorithms that measure similarity of strings.

How to check for String similarity

Let assume your Strings have the same length, so you need a function that iterate through both of them, comparing each char and find the number of differences:

double similarity(String a, String b) {
if(a.length() == 0) return 1;
int numberOfSimilarities = 0;
for(int i = 0; i < a.length(); ++i) {
if(a.charAt(i) == b.charAt(i)) {
++numberOfSimilarities;
}
}
return (double) numberOfSimilarities / a.length();
}

How to compare strings by similarity without ignoring typos?

I was going to use pure regex at first, and there is probably a way, but this code will produce the results you are looking for, using first and last, or first and middle, and ignoring de and da.

private void checkName(String target, String source) {
Pattern pattern = Pattern.compile("^(?<firstName>[^\\s]+)\\s((de|da)(\\s|$))?(?<otherName>.*)$");
Matcher targetMatcher = pattern.matcher(target.trim().toLowerCase());
Matcher sourceMatcher = pattern.matcher(source.trim().toLowerCase());
if (!targetMatcher.matches() || !sourceMatcher.matches()) {
System.out.println("Nok");
}

boolean ok = true;
if (!sourceMatcher.group("firstName").equals(targetMatcher.group("firstName"))) {
ok = false;
} else {
String[] otherSourceName = sourceMatcher.group("otherName").split("\\s");
String[] otherTargetName = targetMatcher.group("otherName").split("\\s");

int targetIndex = 0;
for (String s : otherSourceName) {
boolean hit = false;
for (; targetIndex < otherTargetName.length; targetIndex++) {
if (s.equals(otherTargetName[targetIndex])) {
hit = true;
break;
}
}
if (!hit) {
ok = false;
break;
}
}
}
System.out.println(ok ? "ok" : "Nok");
}

For your examples, the output is:

ok
ok
Nok
Nok
Nok
ok

What are some algorithms for comparing how similar two strings are?

What you're looking for are called String Metric algorithms. There a significant number of them, many with similar characteristics. Among the more popular:

  • Levenshtein Distance : The minimum number of single-character edits required to change one word into the other. Strings do not have to be the same length
  • Hamming Distance : The number of characters that are different in two equal length strings.
  • Smith–Waterman : A family of algorithms for computing variable sub-sequence similarities.
  • Sørensen–Dice Coefficient : A similarity algorithm that computes difference coefficients of adjacent character pairs.

Have a look at these as well as others on the wiki page on the topic.

How do I compare strings in Java?

== tests for reference equality (whether they are the same object).

.equals() tests for value equality (whether they are logically "equal").

Objects.equals() checks for null before calling .equals() so you don't have to (available as of JDK7, also available in Guava).

Consequently, if you want to test whether two strings have the same value you will probably want to use Objects.equals().

// These two have the same value
new String("test").equals("test") // --> true

// ... but they are not the same object
new String("test") == "test" // --> false

// ... neither are these
new String("test") == new String("test") // --> false

// ... but these are because literals are interned by
// the compiler and thus refer to the same object
"test" == "test" // --> true

// ... string literals are concatenated by the compiler
// and the results are interned.
"test" == "te" + "st" // --> true

// ... but you should really just call Objects.equals()
Objects.equals("test", new String("test")) // --> true
Objects.equals(null, "test") // --> false
Objects.equals(null, null) // --> true

You almost always want to use Objects.equals(). In the rare situation where you know you're dealing with interned strings, you can use ==.

From JLS 3.10.5. String Literals:

Moreover, a string literal always refers to the same instance of class String. This is because string literals - or, more generally, strings that are the values of constant expressions (§15.28) - are "interned" so as to share unique instances, using the method String.intern.

Similar examples can also be found in JLS 3.10.5-1.

Other Methods To Consider

String.equalsIgnoreCase() value equality that ignores case. Beware, however, that this method can have unexpected results in various locale-related cases, see this question.

String.contentEquals() compares the content of the String with the content of any CharSequence (available since Java 1.5). Saves you from having to turn your StringBuffer, etc into a String before doing the equality comparison, but leaves the null checking to you.



Related Topics



Leave a reply



Submit