Comparing Strings Lexicographically

String Comparison in Java

Leading from answers from @Bozho and @aioobe, lexicographic comparisons are similar to the ordering that one might find in a dictionary.

The Java String class provides the .compareTo () method in order to lexicographically compare Strings. It is used like this "apple".compareTo ("banana").

The return of this method is an int which can be interpreted as follows:

  • returns < 0 then the String calling the method is lexicographically first (comes first in a dictionary)
  • returns == 0 then the two strings are lexicographically equivalent
  • returns > 0 then the parameter passed to the compareTo method is lexicographically first.

More specifically, the method provides the first non-zero difference in ASCII values.

Thus "computer".compareTo ("comparison") will return a value of (int) 'u' - (int) 'a' (20). Since this is a positive result, the parameter ("comparison") is lexicographically first.

There is also a variant .compareToIgnoreCase () which will return 0 for "a".compareToIgnoreCase ("A"); for example.

Comparing n strings lexicographically

There is no need to implement your custom lexicographical comparing algorithm. You can simply compare strings in order to know if one of them is lexicographically first.

According to documentation,

compareTo(String anotherString) compares two strings lexicographically.

if (testArray[0].compareTo(testArray[1]) >= 0))
{
// testArray[0] is lexigraphically "bigger" or equal
} else {
// testArray[1] is lexigraphically "bigger"
}

Now, you only need to apply any sorting algorithm.

For example, Bubble sort:

boolean swapped = true;
int j = 0;
int tmp;
while (swapped) {
swapped = false;
j++;
for (int i = 0; i < testArray.length - j; i++) {
if (testArray[i] > testArray[i + 1]) {
tmp = testArray[i];
testArray[i] = testArray[i + 1];
testArray[i + 1] = tmp;
swapped = true;
}
}
}

You may want to use O(N * log(N)) sorting algorithms like QuickSort or MergeSort, but it is another question. You can find a lot of on-topic information in the Internet.

Update: Since you are not allowed to even use compareTo, you can implement a custom function, which will compare strings char by char. If one string is a full prefix of another one, then the shortest one should be first:

int myCompareTo(String a, String b)
{
int aLength = a.length(), bLength = b.length();
int minLength = Math.min(aLength, bLength);

for (int i = 0; i < minLength; i++)
{
if (a.charAt(i) > b.charAt(i)) return 1;
if (a.charAt(i) < b.charAt(i)) return -1;
}

if (aLength > bLength) return 1;
if (aLength < bLength) return -1;
return 0;
}

Comparing Strings lexicographically new approach fails for one test case

Your logic is not correct. Comparing the sums of the characters is wrong, since "bab", "abb" and "bba" will have the same value, but that tells you nothing regarding which of them comes first lexicographicaly.

You should compare each pair of characters separately. The first time you encounter a pair of characters not equal to each other, the one with the lower value belongs to the String that should come first.

for(int i=0; i<asciLength; i++) {
if (arr1[i] > arr2[i]) {
System.out.println("Not In Lexicographic Order");
return;
} else if (arr1[i] < arr2[i]) {
System.out.println("In Lexicographic Order");
return;
}
}
// at this point we know that the Strings are either equal or one
// is fully contained in the other. The shorter String must come first
if (arr1.length <= arr2.length) {
System.out.println("In Lexicographic Order");
} else {
System.out.println("Not In Lexicographic Order");
}

Compare two possibly null Java strings lexicographically

Guava provides the very useful Ordering class (more info). It has a fluent interface, and it extends Comparator, so you can use it anywhere you'd use a Comparator.

Comparator<String> nullSafeComparator = Ordering.<String>natural().nullsLast();

What is string lexicographically? Java

The value returned does not really matter as the compareTo contract is to return negative, positive or 0 (as you already know).

However, if really you want to understand why -31 is returned when comparing Dog with cat (or any other string) then you could simply look at the method directly in String class :

public int compareTo(String anotherString) {
int len1 = value.length;
int len2 = anotherString.value.length;
int lim = Math.min(len1, len2);
char v1[] = value;
char v2[] = anotherString.value;

int k = 0;
while (k < lim) {
char c1 = v1[k];
char c2 = v2[k];
if (c1 != c2) {
return c1 - c2;
}
k++;
}
return len1 - len2;
}

Keep in mind that value is the char array backing the string.

private final char value[];

So how does this method proceed ?

  • You retrieve the minimum of both string length in a variable lim.
  • You create a copy of both string char array.
  • You loop over each characters (verifying if they are equals) until reaching the lowest limit.
  • If two characters at same index are not equals, you return the result of substracting the second one to the first. The char can be represented as int value (which take their ascii value) and are already ordered. Thus when substracting a negative number will be returned if the second char is "higher" then the first one. A positive will be returned if the second char is "lower" then the first one. 0 will be returned if both are equals.
  • If all characters were equals while looping for the lowest string length, you return a substraction of both length.

In your example, first letter of both words are not equals so you get to compare D with c which are respectively represented as 68 and 99. Substract 99 to 68 and you get -31.

So to answer this question :

Does this mean that the int returned is the number of places away the
strings are form one another if they were to be sorted alphabetically
like a dictionary?

No, it is actually either the difference between two non-matching char's ascii value or the difference of both length.

Also, how does the method deal with case sensitivity? Are lower case
letters first in line before uppercase? Is there a chart for this?

If you want to ignore the case when comparing, you can use String#compareToIgnoreCase.

Also you can check this chart for ascii values (upper and lower case).

What does or mean in python while comparing strings

The < or > would result in lexicographic comparison of the two strings:

>>> x="absx"
>>> o="abcdef"
>>> x > o
True

Lexicographic ordering is same as dictionary ordering, and basically, the operators are checking for which string would come earlier (or later) in a dictionary order. The behavior is same for both Python 2 and 3.

The final result does not depend on the size of the string, Example:

>>> "a" < "aaaaa" 
True

In example above "a" would come before "aaaaa" when written in dictionary order. To compare by length of strings, use the len() function on the strings.

Comparing strings lexicographically

Comparing std::string -s like that will work. However you are comparing string literals. To do the comparison you want either initialize a std::string with them or use strcmp:

if(std::string("aa") > std::string("bz")) cout<<"Yes";

This is the c++ style solution to that.

Or alternatively:

if(strcmp("aa", "bz") > 0) cout<<"Yes";

EDIT(thanks to Konrad Rudolph's comment): in fact in the first version only one of the operands should be converted explicitly so:

if(std::string("aa") > "bz") cout<<"Yes";

Will again work as expected.

EDIT(thanks to churill's comment): since c++14 you can use string literals:

if("aa"s > "bz") cout<<"Yes";

Lexicographically comparing two strings

The problem in declaring of both variables mini and max

you already initialize them as empty string so the condition will not return the expected result when comparing them to x and their values won't change

you can initialize them:

 mini = s.substring(0, z);
max = s.substring(0, z);

Edit:

if you try to test this condition :

x.compareTo(("")) it will always return value bigger than 0 so mini won't change at any case.

Comparing two strings containing '_' lexicographically in bash

As pointed out in the comments, the [[ < ]] operator depends on your current locale. This is also documented in bash's manual:

When used with [[, the ‘<’ and ‘>’ operators sort lexicographically using the current locale.

You can check your current locale using the command locale. When you run this command on your Mac OS and Debian you should get different results.

You can overwrite your system's locale for your script using export LC_ALL=.... The locale for sorting by ascii codes is C.

$ export LC_ALL=en_US.UTF-8; [[ ab < a_c ]]; echo $?
0
$ export LC_ALL=C; [[ ab < a_c ]]; echo $?
1


Related Topics



Leave a reply



Submit