Extracting noun+noun or (adj|noun)+noun from Text
It is possible.
EDIT:
You got it. Use the POS tagger and split on spaces: ll <- strsplit(acqTag,' '). From there iterate on the length of the input list (length of ll) like:
for (i in 1:37){qq <-strsplit(ll[[1]][i],'/')} and get the part of speech sequence you're looking for.
After splitting on spaces it is just list processing in R.
getting verbal noun from noun
There is no solution that works in all cases, since you cannot determine all cases. In English, effectively any noun can be "verbed", resulting in a sort of infinite set.
What you can do is lemmatize your tokens and then use nltk's lemma.derivationally_related_forms() function in order to get all nouns that are derived from the verb. Searching the corresponding data structure will give you the right results. In order to reduce the number of verbs you have to search for for each noun, you could use something like the largest common prefix, e.g. .
look at this:
https://www.howtobuildsoftware.com/index.php/how-do/4EO/python-nlp-wordnet-get-noun-from-verb-wordnet
R function for pattern matching
Okay, here we go. Using this data (shared nicely with dput()
):
df = structure(list(V1 = structure(c(15L, 3L, 11L, 4L, 5L, 9L, 2L,
16L, 18L, 14L, 13L, 8L, 12L, 20L, 19L, 1L, 7L, 10L, 6L, 17L), .Label = c(",",
".", "american", "are", "catching", "country", "in", "is", "on",
"our", "people", "profoundly", "something", "that", "the", "they",
"today", "understand", "when", "wrong"), class = "factor"), V2 = structure(c(3L,
5L, 7L, 12L, 11L, 10L, 2L, 8L, 12L, 4L, 6L, 13L, 10L, 5L, 14L,
1L, 4L, 9L, 6L, 6L), .Label = c(",", ".", "DT", "IN", "JJ", "NN",
"NNS", "PRP", "PRP$", "RB", "VBG", "VBP", "VBZ", "WRB"), class = "factor")), .Names = c("V1",
"V2"), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15",
"16", "17", "18", "19", "20"))
I'll use the stringr
package because of its consistent syntax so I don't have to look up the argument order for grep
. We'll first detect the adjectives, then the nouns, and figure out where the line up (offsetting by 1). Then paste the words together that correspond to the matches.
library(stringr)
adj = str_detect(df$V2, "JJ")
noun = str_detect(df$V2, "NN")
pairs = which(c(FALSE, adj) & c(noun, FALSE))
ngram = paste(df$V1[pairs - 1], df$V1[pairs])
# [1] "american people"
Now we can put it in a function. I left the patterns as arguments (with adjective, noun as the defaults) for flexibility.
bigram = function(word, type, patt1 = "JJ", patt2 = "N[A-Z]") {
pairs = which(c(FALSE, str_detect(type, pattern = patt1)) &
c(str_detect(type, patt2), FALSE))
return(paste(word[pairs - 1], word[pairs]))
}
Demonstrating use on the original data
with(df, bigram(word = V1, type = V2))
# [1] "american people"
Let's cook up some data with more than one match to make sure it works:
df2 = data.frame(w = c("american", "people", "hate", "a", "big", "bad", "bank"),
t = c("JJ", "NNS", "VBP", "DT", "JJ", "JJ", "NN"))
df2
# w t
# 1 american JJ
# 2 people NNS
# 3 hate VBP
# 4 a DT
# 5 big JJ
# 6 bad JJ
# 7 bank NN
with(df2, bigram(word = w, type = t))
# [1] "american people" "bad bank"
And back to the original to test out a different pattern:
with(df, bigram(word = V1, type = V2, patt1 = "N[A-Z]", patt2 = "V[A-Z]"))
# [1] "people are" "something is"
Related Topics
Get Start and End Index of Runs of Values
Using Recordlinkage to Add a Column with a Number for Each Person
Extract Coefficients from Ggplot2-Created Nls Fit
How to Generate Multivariate Random Numbers with Different Marginal Distributions
Axis-Labeling in R Histogram and Density Plots; Multiple Overlays of Density Plots
Calculate Peak Values in a Plot Using R
R: Check If Value from Dataframe Is Within Range Other Dataframe
Group/Bin/Bucket Data in R and Get Count Per Bucket and Sum of Values Per Bucket
Spread with Duplicate Identifiers for Rows
Find Second Highest Value on a Raster Stack in R
Filter Dataframe Using Global Variable with The Same Name as Column Name
What Does Na.Rm=True Actually Means
R Not Responding Request to Interrupt Stop Process
How to Set Contrasts for My Variable in Regression Analysis with R