Levenshtein Type Algorithm with Numeric Vectors

Levenshtein type algorithm with numeric vectors

An integer vector can be seen as a single string encoded in UTF-32 (in which one Unicode code point is represented as a single 32-bit integer). You may obtain an "ordinary" string, just by converting such a vector to UTF-8 with intToUtf8.

intToUtf8(c(65, 97))
## [1] "Aa"

By the way, adist does utf8ToInt (reverse op) by default on its inputs anyway. So internally, it computes the results according to integer vectors. No big hack.

This is the solution.

adist(intToUtf8(c(1, 3, 4, 5, 6, 7, 8)), intToUtf8(c(54, 23, 12, 53, 7, 8)), counts=TRUE)
##      [,1]
## [1,]    5
## attr(,"counts")
## , , ins
## 
##      [,1]
## [1,]    0
## 
## , , del
## 
##      [,1]
## [1,]    1
## 
## , , sub
## 
##      [,1]
## [1,]    4
## 
## attr(,"trafos")
##      [,1]     
## [1,] "SSSSDMM"

The above code should work if at least all the numbers are strictly greater than 0.
R treats Unicode code points quite liberally (in fact, too liberally, but in this case you're a winner), even the largest possible integer is accepted:

utf8ToInt(intToUtf8(c(2147483647)))
## 2147483647

If you have a vector with negative values, you may transform it somehow, e.g. with x <- x-min(x)+1.

If you need different costs for insertion, removal, replacement, check out the adist's costs argument. There is also a package called stringdist, which included many other string metrics. The above scheme should also work there.

Levenshtein / edit distance for arbitrary sequences

You can use intToUtf8 to map your integers to Unicode characters:

a2 <- intToUtf8(a)
b2 <- intToUtf8(b)

adist(a2, b2)
#      [,1]
# [1,]    1

How can I adapt the Levenshtein Distance algorithm to limit matches to a single word?

I can get pretty close to what you want by making levenshtein_distance a generic algorithm on a sequence container and including a cost function that calculates the distance between two elements:

template<typename T, typename C>
size_t
seq_distance(const T& seq1, const T& seq2, const C& cost,
             const typename T::value_type& empty = typename T::value_type()) {
  const size_t size1 = seq1.size();
  const size_t size2 = seq2.size();

  std::vector<size_t> curr_col(size2 + 1);
  std::vector<size_t> prev_col(size2 + 1);

  // Prime the previous column for use in the following loop:
  prev_col[0] = 0;
  for (size_t idx2 = 0; idx2 < size2; ++idx2) {
    prev_col[idx2 + 1] = prev_col[idx2] + cost(empty, seq2[idx2]);
  }

  for (size_t idx1 = 0; idx1 < size1; ++idx1) {
    curr_col[0] = curr_col[0] + cost(seq1[idx1], empty);

    for (size_t idx2 = 0; idx2 < size2; ++idx2) {
      curr_col[idx2 + 1] = std::min(std::min(
        curr_col[idx2] + cost(empty, seq2[idx2]),
        prev_col[idx2 + 1] + cost(seq1[idx1], empty)),
        prev_col[idx2] + cost(seq1[idx1], seq2[idx2]));
    }

    curr_col.swap(prev_col);
    curr_col[0] = prev_col[0];
  }

  return prev_col[size2];
}

Given the above seq_distance, the edit distance between two sentences such that edits can not be made between word boundaries, can be defined with the following:

size_t
letter_distance(char letter1, char letter2) {
  return letter1 != letter2 ? 1 : 0;
}

size_t
word_distance(const std::string& word1, const std::string& word2) {
  return seq_distance(word1, word2, &letter_distance);
}

size_t
sentence_distance(const std::string& sentence1, const std::string& sentence2) {
  std::vector<std::string> words1;
  std::vector<std::string> words2;
  std::istringstream iss1(sentence1);
  std::istringstream iss2(sentence2);
  std::copy(std::istream_iterator<std::string>(iss1),
            std::istream_iterator<std::string>(),
            std::back_inserter(words1));
  std::copy(std::istream_iterator<std::string>(iss2),
            std::istream_iterator<std::string>(),
            std::back_inserter(words2));
  return seq_distance(words1, words2, &word_distance);
}

Here's the code working on ideone. I've tested a few cases and I'm pretty sure it does the right thing, but you should try it out more to make sure the results are reasonable.

Note that this isn't exactly what you asked for, since it ignores all spaces in the edit distance measurement: I think it shouldn't be too hard to modify it not to do that, but I haven't thought it through completely. In any case, this might be just as good (or even better), depending on your needs, so I'll let you decide if you want to try to tweak it.

Just a minor note, your original code was slightly buggy in that the following two lines:

curr_col.reserve(length2 + 1);
prev_col.reserve(length2 + 1);

reserve capacity in the vectors, but do not actually change the sizes of them, so accessing the array after that was undefined behavior. You should actually resize the vector if you're going to access elements in a range: reserve is usually for situations where you are about to push_back a certain number of elements one-by-one (which increases the size as you go, not all at once) and you want to avoid the cost of multiple internal reallocations (since the internal capacity only increases by a certain factor each time the capacity is exceeded).

EDIT:

This version takes into consideration spaces between words as part of the edit distance, but the results are still not exactly the same as your examples because of the requirement to add multiple spaces in some cases.

How to pass an array of vectors to a function in R

In R there is not index 0

function_b <- function(x) {
  z = (x[1] ^ 2 + x[2] ^ 2) / (2 * x[1] * x[2])
  return(z)
}

v_2 <- list(c(1.1, 2.3), c(1.7, 5.2), c(6.23, 7.41))
sapply(v_2,FUN = function_b)
[1] 1.284585 1.692873 1.015081

How to apply the Levenshtein distance to a set of target strings?

You need to calculate these probabilities first: probability of insertion, deletion and substitution. Then use log of these probabilities as penalties for each operation.

In a "context independent" situation, if pi is probability of insertion, pd is probability of deletion and ps probability of substitution, the probability of observing the same symbol is pp=1-ps-pd.

In this case use log(pi/pp/k), log(pd/pp) and log(ps/pp/(k-1)) as penalties for insertion, deletion and substitution respectively, where k is the number of symbols in the system.

Essentially if you use this distance measure between a source and target you get log probability of observing that target given the source. If you have a bunch of training data (i.e. source-target pairs) choose some initial estimates for these probabilities, align source-target pairs and re-estimate these probabilities (AKA EM strategy).

You can start with one set of probabilities and assume context independence. Later you can assume some kind of clustering among the contexts (eg. assume there are k different sets of letters whose substitution rate is different...).

Levenshtein Type Algorithm with Numeric Vectors