Yesterday I wrote about how you could use the spaCy Python library to find proper nouns in a document. Now suppose you want to refine this and find proper nouns that are the subjects of sentences or proper nouns that are direct objects.
This post was motivated by a project in which I needed to pull out company names from a large amount of text, and it was important to know how the company name was being used.
Dependency labels
Tokens in spaCy have a dependency label attribute dep
(or dep_
for its string representation). Dependency labels tell you how a word is being used. For example, dobj
tells you the word is being used as a direct object, and nsubj
tells you its being used as a nominal subject.
In yesterday’s post the line
if tok.pos_ == "PROPN": print(tok)
filtered tokens to look for proper nouns. We could modify the script to also tell us how the proper nouns are being used by printing tok.dep_
.
There are three proper nouns in the opening paragraph of Moby Dick: Ishmael, November, and Cato.
Call me Ishmael. … whenever it is a damp, drizzly November in my soul … With a philosophical flourish Cato throws himself upon his sword …
If we run
if tok.pos_ == "PROPN": print(tok, tok.dep_)
on the first paragraph we get
Ishmael oprd November attr Cato nsubj
but it’s not obvious what the output means. If we wrap tok.dep_
with spacy.explain
we get a more verbose explanation.
Ishmael object predicate November attribute Cato nominal subject
Pulling out subjects
Now suppose we wanted to pull out words that are subjects. We could filter on tok.dep_ == "nsubj"
but there are more kinds of subjects than just nominal subjects. There are six kinds of subjects:
- nsubj: nominal subject
- nsubjpass: nominal passive subject
- csubj: clausal subject
- csubjpass: clausal passive subject
- agent: agent
- expl: expletive
Finding the range of possible values for dependency labels takes some digging. I don’t believe it’s in the spaCy documentation per se, but if you’re persistent you’ll find a link this list or the paper it came from.