Making Pairs of Words Based on One Column

Making pairs of words based on one column

in two steps

$ sort -k2 file > file.s
$ join -j2 file.s{,} | awk '!(a[$2,$3]++ + a[$3,$2]++){print $2,$3,$1}'

A C ID.1
A D ID.1
C D ID.1
B E ID.2

How to select sub-strings based on the presence of word pairs? Python

I believe there is a bug in your code.

else:
return ''

This means if the 1st comparison is not a match, 'func' will return immediately. That might be why the code does not return any matches.

A sample working code is below:

# The function seems to loop over all r's but only over the first b:
def func(sentence, first_twos=b):
for first_two in first_twos:
if first_two in sentence:
s = sentence[sentence.index(first_two):]
return s
return ''

df['Segments'] = a.apply(func)

And the output:

df:   
{
'First2': ['can I', 'should it', 'what does'],
'Segments': ['what does it say? ', 'should it say more?', ''],
'Sentence': ['If this is a string what does it say? ', 'And this is a string, should it say more?', 'This is yet another string. ' ]
}

Combining columns and count combinations (pairs)

If we always write the earliest letter in the alphabet first in the pair := assignment, the code will produce the desired result. We'll use ifelse() to decide whether to write V1 before V2 as follows.

library(data.table)
set.seed(126)
dt <- data.table(V1 = sample(LETTERS[1:4], 30, replace = T),
V2 = sample(LETTERS[1:4], 30, replace = T))

# adjusted version where first letter always < second letter

#Exclude rows with the same name
dt <- dt[V1 != V2]

#Create pairs by combining V1 and V2
dt[, pair := ifelse(V1 < V2,paste(V1, V2, sep="_"), paste(V2, V1, sep = "_"))]

#Count the pairs
dt[, .N, by=.(pair)]

...and the output:

> #Count the pairs 
> dt[, .N, by=.(pair)]
pair N
1: A_C 3
2: B_C 9
3: C_D 5
4: A_B 4
5: B_D 3
6: A_D 1
>

Trying to match strings from multiple columns and create pair list where matches are found

Based on the update, we may filter after splitting the column in 'df1', then create a sequence index and reshape to 'long' format

library(dplyr)
library(tidyr)
df1 %>%
separate(values, into = c('values1', 'values2')) %>%
filter(if_all(everything(), ~ .x %in% df2$values)) %>%
mutate(paired = row_number()) %>%
pivot_longer(cols = -paired, values_to = 'value', names_to = NULL) %>%
select(value, paired)

-output

# A tibble: 6 × 2
value paired
<chr> <int>
1 apples 1
2 x 1
3 oranges 2
4 z 2
5 bananas 3
6 y 3

How generate all pairs of values, from the result of a groupby, in a pandas dataframe

Its simple use itertools combinations inside apply and stack i.e

from itertools import combinations
ndf = df.groupby('ID')['words'].apply(lambda x : list(combinations(x.values,2)))
.apply(pd.Series).stack().reset_index(level=0,name='words')

ID words
0 1 (word1, word2)
1 1 (word1, word3)
2 1 (word2, word3)
0 2 (word4, word5)
0 3 (word6, word7)
1 3 (word6, word8)
2 3 (word6, word9)
3 3 (word7, word8)
4 3 (word7, word9)
5 3 (word8, word9)

To match you exact output further we have to do

sdf = pd.concat([ndf['ID'],ndf['words'].apply(pd.Series)],1).set_axis(['ID','WordsA','WordsB'],1,inplace=False)

ID WordsA WordsB
0 1 word1 word2
1 1 word1 word3
2 1 word2 word3
0 2 word4 word5
0 3 word6 word7
1 3 word6 word8
2 3 word6 word9
3 3 word7 word8
4 3 word7 word9
5 3 word8 word9

To convert it to a one line we can do :

combo = df.groupby('ID')['words'].apply(combinations,2)\
.apply(list).apply(pd.Series)\
.stack().apply(pd.Series)\
.set_axis(['WordsA','WordsB'],1,inplace=False)\
.reset_index(level=0)

Creating a 'Rough Match' Function

You don't need VBA for this. Enter this in D1 as an array formula with ctrl-shift-enter:

=SUM(COUNTIF(A1,"*"&B1:C1&"*"))>0

The asterisks are wildcards, and the array formula, in effect, loops through each cell in B1:C1. So the formula says to count the instances of B1 or C1, preceded and followed by any text, found in A1.

I need to create unique word pairs either in R or excel

(You don't really need second list to do that, one is enough)

cities  <- list("London", "Paris", "Kyiv", "Geneva", "Tokyo")

combn(cities, 2, paste, collapse = "-")

# [1] "London-Paris" "London-Kyiv" "London-Geneva" "London-Tokyo" "Paris-Kyiv"
# [6] "Paris-Geneva" "Paris-Tokyo" "Kyiv-Geneva" "Kyiv-Tokyo" "Geneva-Tokyo"


Related Topics



Leave a reply



Submit