Find common substrings between two character variables
Here's a CRAN package for that:
library(qualV)
sapply(seq_along(a), function(i)
paste(LCS(strsplit(a[i], '')[[1]], strsplit(b[i], '')[[1]])$LCS,
collapse = ""))
Find common substring between two strings
Its called Longest Common Substring problem. Here I present a simple, easy to understand but inefficient solution. It will take a long time to produce correct output for large strings, as the complexity of this algorithm is O(N^2).
def longestSubstringFinder(string1, string2):
answer = ""
len1, len2 = len(string1), len(string2)
for i in range(len1):
match = ""
for j in range(len2):
if (i + j < len1 and string1[i + j] == string2[j]):
match += string2[j]
else:
if (len(match) > len(answer)): answer = match
match = ""
return answer
print(longestSubstringFinder("apple pie available", "apple pies"))
print(longestSubstringFinder("apples", "appleses"))
print(longestSubstringFinder("bapples", "cappleses"))
Output
apple pie
apples
apples
Identify a common pattern
Function LCS
from qualV
package (in Find common substrings between two character variables; not a possible duplicate) does something else than what you need. It solves the longest common subsequence problem, where subsequences are not required to occupy consecutive positions within the original sequences.
What you have is the longest common substring problem, for which you could use this algorithm, and here is the code assuming that there is a unique (in terms of length) longest common substring:
a <- "WWDUISBURG-HAMBORNS"
b <- "QQQQQQDUISBURG (-31.7.29)S"
A <- strsplit(a, "")[[1]]
B <- strsplit(b, "")[[1]]
L <- matrix(0, length(A), length(B))
ones <- which(outer(A, B, "=="), arr.ind = TRUE)
ones <- ones[order(ones[, 1]), ]
for(i in 1:nrow(ones)) {
v <- ones[i, , drop = FALSE]
L[v] <- ifelse(any(v == 1), 1, L[v - 1] + 1)
}
paste0(A[(-max(L) + 1):0 + which(L == max(L), arr.ind = TRUE)[1]], collapse = "")
# [1] "DUISBURG"
Find common elements of two strings including characters that occur many times
Well there are multiple ways in which you can get the desired result. For me the simplest algorithm to get the answer would be:
- Define an empty
dict
. Liked = {}
- Iterate through each character of the first string:
if the character is not present in the dictionary, add the character to the dictionary.
else increment the count of character in the dictionary. - Create a variable as
common = ""
- Iterate through the second string characters, if the count of that character in the dictionary above is greater than 0: decrement its value and add this character to
common
- Do whatever you want to do with the
common
The complete code for this problem:
s1 = 'aebcdee'
s2 = 'aaeedfskm'
d = {}
for c in s1:
if c in d:
d[c] += 1
else:
d[c] = 1
common = ""
for c in s2:
if c in d and d[c] > 0:
common += c
d[c] -= 1
print(common)
Find all common contiguous substrings of two strings in python
import string
s1 = 'Today is a good day, it is a good idea to have a walk.'
s2 = 'Yesterday was not a good day, but today is good, shall we have a walk?'
z=[]
s1=s1.translate(None, string.punctuation) #remove punctuation
s2=s2.translate(None, string.punctuation)
print s1
print s2
sw1=s1.lower().split() #split it into words
sw2=s2.lower().split()
print sw1,sw2
i=0
while i<len(sw1): #two loops to detect common strings. used while so as to change value of i in the loop itself
x=0
r=""
d=i
#print r
for j in range(len(sw2)):
#print r
if sw1[i]==sw2[j]:
r=r+' '+sw2[j] #if string same keep adding to a variable
x+=1
i+=1
else:
if x>0: # if not same check if there is already one in buffer and add it to result (here z)
z.append(r)
i=d
r=""
x=0
if x>0: #end case of above loop
z.append(r)
r=""
i=d
x=0
i+=1
#print i
print list(set(z))
#O(n^3)
how to compare two strings to find common substring
The reason for the timeout is probably: to compare two strings that each are 1.000.000 characters long, your code needs 1.000.000 * 1.000.000 comparisons, always.
There is a faster algorithm that only needs 2 * 1.000.000 comparisons. You should use the faster algorithm instead. Its basic idea is:
- for each character in s1: add the character to a set (this is the first million)
- for each character in s2: test whether the set from step 1 contains the character, and if so, return "yes" immediately (this is the second million)
Java already provides a BitSet data type that does all you need. It is used like this:
BitSet seenInS1 = new BitSet();
seenInS1.set('x');
seenInS1.get('x');
How to get the common first substrings In two Strings
It maybe isn't the best solution but my attempt would be to find the first matching characters in each string and then continue to check the following characters if they are still the same:
private static String extractFirstEqual(String a, String b) {
//Split your string into an array of characters
String[] arr = a.split("");
String[] brr = b.split("");
StringBuilder result = new StringBuilder();
//Iterate over both arrays
for (int i = 0; i < arr.length; i++) {
for (int j = 0; j < brr.length; j++) {
//Find first matching character
if (arr[i].equals( brr[j])) {
//While there are more characters in both arrays and the characters keep matching, append them
// to the result
while (arr[i].equals(brr[j]) && i < arr.length && j < brr.length) {
result.append(arr[i]);
i++;
j++;
}
return result.toString();
}
}
}
return result.toString();
}
Related Topics
Remove Multiple Objects with Rm()
In 'Knitr' How to Test for If the Output Will Be PDF or Word
Generate an Incrementally Increasing Sequence Like 112123123412345
Drop-Down Checkbox Input in Shiny
Merge and Perfectly Align Histogram and Boxplot Using Ggplot2
Passing Command Line Arguments to R Cmd Batch
How to Change Library Location in R
How to Draw Stacked Bars in Ggplot2 That Show Percentages Based on Group
Cannot Install an R Package from Github
Ggplot2 Pie and Donut Chart on Same Plot
Split Up a Dataframe by Number of Rows
Automatically Delete Files/Folders
Meaning of Ddply Error: 'Names' Attribute [9] Must Be the Same Length as the Vector [1]
How to Obtain an 'Unbalanced' Grid of Ggplots
Explain Ggplot2 Warning: "Removed K Rows Containing Missing Values"