Find Common Substrings Between Two Character Variables

Find common substrings between two character variables

Here's a CRAN package for that:

library(qualV)

sapply(seq_along(a), function(i)
paste(LCS(strsplit(a[i], '')[[1]], strsplit(b[i], '')[[1]])$LCS,
collapse = ""))

Find common substring between two strings

Its called Longest Common Substring problem. Here I present a simple, easy to understand but inefficient solution. It will take a long time to produce correct output for large strings, as the complexity of this algorithm is O(N^2).

def longestSubstringFinder(string1, string2):
answer = ""
len1, len2 = len(string1), len(string2)
for i in range(len1):
match = ""
for j in range(len2):
if (i + j < len1 and string1[i + j] == string2[j]):
match += string2[j]
else:
if (len(match) > len(answer)): answer = match
match = ""
return answer

print(longestSubstringFinder("apple pie available", "apple pies"))
print(longestSubstringFinder("apples", "appleses"))
print(longestSubstringFinder("bapples", "cappleses"))

Output

apple pie
apples
apples

Identify a common pattern

Function LCS from qualV package (in Find common substrings between two character variables; not a possible duplicate) does something else than what you need. It solves the longest common subsequence problem, where subsequences are not required to occupy consecutive positions within the original sequences.

What you have is the longest common substring problem, for which you could use this algorithm, and here is the code assuming that there is a unique (in terms of length) longest common substring:

a <- "WWDUISBURG-HAMBORNS"
b <- "QQQQQQDUISBURG (-31.7.29)S"

A <- strsplit(a, "")[[1]]
B <- strsplit(b, "")[[1]]

L <- matrix(0, length(A), length(B))
ones <- which(outer(A, B, "=="), arr.ind = TRUE)
ones <- ones[order(ones[, 1]), ]
for(i in 1:nrow(ones)) {
v <- ones[i, , drop = FALSE]
L[v] <- ifelse(any(v == 1), 1, L[v - 1] + 1)
}
paste0(A[(-max(L) + 1):0 + which(L == max(L), arr.ind = TRUE)[1]], collapse = "")
# [1] "DUISBURG"

Find common elements of two strings including characters that occur many times

Well there are multiple ways in which you can get the desired result. For me the simplest algorithm to get the answer would be:

  1. Define an empty dict. Like d = {}
  2. Iterate through each character of the first string:

    if the character is not present in the dictionary, add the character to the dictionary.

    else increment the count of character in the dictionary.
  3. Create a variable as common = ""
  4. Iterate through the second string characters, if the count of that character in the dictionary above is greater than 0: decrement its value and add this character to common
  5. Do whatever you want to do with the common

The complete code for this problem:

s1 = 'aebcdee'
s2 = 'aaeedfskm'

d = {}

for c in s1:
if c in d:
d[c] += 1
else:
d[c] = 1

common = ""

for c in s2:
if c in d and d[c] > 0:
common += c
d[c] -= 1

print(common)

Find all common contiguous substrings of two strings in python

import string
s1 = 'Today is a good day, it is a good idea to have a walk.'
s2 = 'Yesterday was not a good day, but today is good, shall we have a walk?'
z=[]
s1=s1.translate(None, string.punctuation) #remove punctuation
s2=s2.translate(None, string.punctuation)
print s1
print s2
sw1=s1.lower().split() #split it into words
sw2=s2.lower().split()
print sw1,sw2
i=0
while i<len(sw1): #two loops to detect common strings. used while so as to change value of i in the loop itself
x=0
r=""
d=i
#print r
for j in range(len(sw2)):
#print r
if sw1[i]==sw2[j]:
r=r+' '+sw2[j] #if string same keep adding to a variable
x+=1
i+=1
else:
if x>0: # if not same check if there is already one in buffer and add it to result (here z)
z.append(r)
i=d
r=""
x=0
if x>0: #end case of above loop
z.append(r)
r=""
i=d
x=0
i+=1
#print i
print list(set(z))

#O(n^3)

how to compare two strings to find common substring

The reason for the timeout is probably: to compare two strings that each are 1.000.000 characters long, your code needs 1.000.000 * 1.000.000 comparisons, always.

There is a faster algorithm that only needs 2 * 1.000.000 comparisons. You should use the faster algorithm instead. Its basic idea is:

  • for each character in s1: add the character to a set (this is the first million)
  • for each character in s2: test whether the set from step 1 contains the character, and if so, return "yes" immediately (this is the second million)

Java already provides a BitSet data type that does all you need. It is used like this:

BitSet seenInS1 = new BitSet();
seenInS1.set('x');
seenInS1.get('x');

How to get the common first substrings In two Strings

It maybe isn't the best solution but my attempt would be to find the first matching characters in each string and then continue to check the following characters if they are still the same:

private static String extractFirstEqual(String a, String b) {

//Split your string into an array of characters
String[] arr = a.split("");
String[] brr = b.split("");
StringBuilder result = new StringBuilder();

//Iterate over both arrays
for (int i = 0; i < arr.length; i++) {
for (int j = 0; j < brr.length; j++) {
//Find first matching character
if (arr[i].equals( brr[j])) {
//While there are more characters in both arrays and the characters keep matching, append them
// to the result
while (arr[i].equals(brr[j]) && i < arr.length && j < brr.length) {
result.append(arr[i]);
i++;
j++;
}

return result.toString();
}
}
}

return result.toString();
}


Related Topics



Leave a reply



Submit