Extracting Noun+Noun or (Adj|Noun)+Noun from Text

How to extract noun and adjective pairs including conjunctions

You may wish to try noun_chunks:

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('I got a red candy and an interesting and big book.')

noun_adj_pairs = {}
for chunk in doc.noun_chunks:
adj = []
noun = ""
for tok in chunk:
if tok.pos_ == "NOUN":
noun = tok.text
if tok.pos_ == "ADJ":
adj.append(tok.text)
if noun:
noun_adj_pairs.update({noun:adj})

# expected output
noun_adj_pairs
{'candy': ['red'], 'book': ['interesting', 'big']}

Should you wish to include conjunctions:

noun_adj_pairs = {}
for chunk in doc.noun_chunks:
adj = []
noun = ""
for tok in chunk:
if tok.pos_ == "NOUN":
noun = tok.text
if tok.pos_ == "ADJ" or tok.pos_ == "CCONJ":
adj.append(tok.text)
if noun:
noun_adj_pairs.update({noun:" ".join(adj)})

noun_adj_pairs
{'candy': 'red', 'book': 'interesting and big'}

How to extract all possible noun phrases from text

You may wish to make use of noun_chunks attribute:

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described.')

phrases = set()
for nc in doc.noun_chunks:
phrases.add(nc.text)
phrases.add(doc[nc.root.left_edge.i:nc.root.right_edge.i+1].text)
print(phrases)
{'junctions of columns and beams', 'junctions', 'the items', 'a shoulder', 'columns', 'richer mix', 'beams', 'columns and beams', 'a shoulder of richer mix', 'these junctions'}

Extracting noun+noun or (adj|noun)+noun from Text

It is possible.

EDIT:

You got it. Use the POS tagger and split on spaces: ll <- strsplit(acqTag,' '). From there iterate on the length of the input list (length of ll) like:
for (i in 1:37){qq <-strsplit(ll[[1]][i],'/')} and get the part of speech sequence you're looking for.

After splitting on spaces it is just list processing in R.

is there a method to extract noun-adjectives pair from sentence in french?

I wrote something by using stanza for high quality dependency parsing. It should not be a lot of work to convert this to spaCy if you need that specifically. Recursion is needed if you need to find embedded structures. Note that this specifically works for such constructions where an adjective is the parent of the subject that you are interested in and not for adjectival positions. E.g., this will not find adjectives like La belle voiture.

import stanza

nlp = stanza.Pipeline("fr")

doc = nlp("La voiture est belle et jolie, et grand. Le tableau qui est juste en dessous est grand. La femme intelligente et belle est grande. Le service est rapide et les plats sont délicieux.")

def recursive_find_adjs(root, sent):
children = [w for w in sent.words if w.head == root.id]

if not children:
return []

filtered_c = [w for w in children if w.deprel == "conj" and w.upos == "ADJ"]
# Do not include an adjective if it is the parent of a noun to prevent
results = [w for w in filtered_c if not any(sub.head == w.id and sub.upos == "NOUN" for sub in sent.words)]
for w in children:
results += recursive_find_adjs(w, sent)

return results

for sent in doc.sentences:
nouns = [w for w in sent.words if w.upos == "NOUN"]
noun_adj_pairs = {}
for noun in nouns:
# Find constructions in the form of "La voiture est belle"
# In this scenario, the adjective is the parent of the noun
cop_root = sent.words[noun.head-1]
adjs = [cop_root] + recursive_find_adjs(cop_root, sent) if cop_root.upos == "ADJ" else []

# Find constructions in the form of "La femme intelligente et belle"
# Here, the adjectives are descendants of the noun
mod_adjs = [w for w in sent.words if w.head == noun.id and w.upos == "ADJ"]
# This should only be one element because conjunctions are hierarchical
if mod_adjs:
mod_adj = mod_adjs[0]
adjs.extend([mod_adj] + recursive_find_adjs(mod_adj, sent))

if adjs:
unique_adjs = []
unique_ids = set()
for adj in adjs:
if adj.id not in unique_ids:
unique_adjs.append(adj)
unique_ids.add(adj.id)

noun_adj_pairs[noun.text] = " ".join([adj.text for adj in unique_adjs])

print(noun_adj_pairs)

This will output:

{'voiture': 'belle jolie grand'}
{'tableau': 'grand'}
{'femme': 'grande belle intelligente'}
{'service': 'rapide', 'plats': 'délicieux'}

How to extract noun adjective pairs from a sentence

Spacy's POS tagging would be a better than NLTK. It's faster and better. Here is an example of what you want to do

import spacy
nlp = spacy.load('en')
doc = nlp(u'Mark and John are sincere employees at Google.')
noun_adj_pairs = []
for i,token in enumerate(doc):
if token.pos_ not in ('NOUN','PROPN'):
continue
for j in range(i+1,len(doc)):
if doc[j].pos_ == 'ADJ':
noun_adj_pairs.append((token,doc[j]))
break
noun_adj_pairs

output

[(Mark, sincere), (John, sincere)]



Related Topics



Leave a reply



Submit