Extracting "((Adj|Noun)+|((Adj|Noun)(Noun-Prep))(Adj|Noun))Noun" from Text (Justeson & Katz, 1995)

Extracting noun+noun or (adj|noun)+noun from Text

It is possible.

EDIT:

You got it. Use the POS tagger and split on spaces: ll <- strsplit(acqTag,' '). From there iterate on the length of the input list (length of ll) like:
for (i in 1:37){qq <-strsplit(ll[[1]][i],'/')} and get the part of speech sequence you're looking for.

After splitting on spaces it is just list processing in R.

getting verbal noun from noun

There is no solution that works in all cases, since you cannot determine all cases. In English, effectively any noun can be "verbed", resulting in a sort of infinite set.
What you can do is lemmatize your tokens and then use nltk's lemma.derivationally_related_forms() function in order to get all nouns that are derived from the verb. Searching the corresponding data structure will give you the right results. In order to reduce the number of verbs you have to search for for each noun, you could use something like the largest common prefix, e.g. .

look at this:

https://www.howtobuildsoftware.com/index.php/how-do/4EO/python-nlp-wordnet-get-noun-from-verb-wordnet

R function for pattern matching

Okay, here we go. Using this data (shared nicely with dput()):

df = structure(list(V1 = structure(c(15L, 3L, 11L, 4L, 5L, 9L, 2L, 
16L, 18L, 14L, 13L, 8L, 12L, 20L, 19L, 1L, 7L, 10L, 6L, 17L), .Label = c(",",
".", "american", "are", "catching", "country", "in", "is", "on",
"our", "people", "profoundly", "something", "that", "the", "they",
"today", "understand", "when", "wrong"), class = "factor"), V2 = structure(c(3L,
5L, 7L, 12L, 11L, 10L, 2L, 8L, 12L, 4L, 6L, 13L, 10L, 5L, 14L,
1L, 4L, 9L, 6L, 6L), .Label = c(",", ".", "DT", "IN", "JJ", "NN",
"NNS", "PRP", "PRP$", "RB", "VBG", "VBP", "VBZ", "WRB"), class = "factor")), .Names = c("V1",
"V2"), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15",
"16", "17", "18", "19", "20"))

I'll use the stringr package because of its consistent syntax so I don't have to look up the argument order for grep. We'll first detect the adjectives, then the nouns, and figure out where the line up (offsetting by 1). Then paste the words together that correspond to the matches.

library(stringr)
adj = str_detect(df$V2, "JJ")
noun = str_detect(df$V2, "NN")

pairs = which(c(FALSE, adj) & c(noun, FALSE))

ngram = paste(df$V1[pairs - 1], df$V1[pairs])
# [1] "american people"

Now we can put it in a function. I left the patterns as arguments (with adjective, noun as the defaults) for flexibility.

bigram = function(word, type, patt1 = "JJ", patt2 = "N[A-Z]") {
pairs = which(c(FALSE, str_detect(type, pattern = patt1)) &
c(str_detect(type, patt2), FALSE))
return(paste(word[pairs - 1], word[pairs]))
}

Demonstrating use on the original data

with(df, bigram(word = V1, type = V2))
# [1] "american people"

Let's cook up some data with more than one match to make sure it works:

df2 = data.frame(w = c("american", "people", "hate", "a", "big", "bad",  "bank"),
t = c("JJ", "NNS", "VBP", "DT", "JJ", "JJ", "NN"))
df2
# w t
# 1 american JJ
# 2 people NNS
# 3 hate VBP
# 4 a DT
# 5 big JJ
# 6 bad JJ
# 7 bank NN

with(df2, bigram(word = w, type = t))
# [1] "american people" "bad bank"

And back to the original to test out a different pattern:

with(df, bigram(word = V1, type = V2, patt1 = "N[A-Z]", patt2 = "V[A-Z]"))
# [1] "people are" "something is"


Related Topics



Leave a reply



Submit