Using WordNet to create a PAO system

NLP software infers parts of speech by context. For example, the SpaCy NLP software can determine the parts of speech in the poem Jabberwocky even though the words are nonsense. More on this here.

If you want to tell the parts of speech for isolated words, maybe software like SpaCy isn’t the right tool. You might use lists of nouns, verbs, etc. This is what I’ll do in this post using the WordNet corpus. I’d like to show how you could use WordNet to create a mnemonic system.

PAO (person-action-object)

One way that people memorize six-digit numbers, or memorize longer numbers six digits at a time, is the PAO (person-action-object) system. They memorize a list of 100 people, 100 actions, and 100 direct objects. The first two digits of a six-digit number are encoded as a person, the next two digits as an action, and the last two digits as an object. For example, “Einstein dances with a broom” might encode 201294 if 20 is associated with Einstein, 12 with dance, and 94 with broom.

These mappings could be completely arbitrary, but you could memorize them faster if there were some patterns to the mappings, such as using the Major mnemonic system.

I’ve written before about how to use the CMU Pronouncing Dictionary to create a list of words along with the numbers they correspond to in the Major system. This post will show how to pull out the nouns and verbs from this list. The nouns are potential objects and the verbs are potential actions. I may deal with persons in another post.

Noun and verb lists in WordNet

The WordNet data contains a file index.noun with nouns and other information. We want to discard the first 29 lines of preamble and extract the nouns in the first column of the file. We can do this with the following one-liner.

   tail -n +30 index.noun | cut -d' ' -f1 > nouns.txt

Likewise we can extract a list of verbs with the following

   tail -n +30 index.verb | cut -d' ' -f1 > verbs.txt

There is some overlap between the two lists since some words can be nouns or verbs depending on context. For example, running

   grep '^read$' nouns.txt

and

   grep '^read$' verbs.txt

shows that “read” is in both lists. (The regex above anchors the beginning of the match with ^ and the end with $ so we don’t get unwanted matches like “treadmill” and “readjust.”)

Sorting CMU dictionary by part of speech

The following Python code will parse the file cmu_major.txt from here to pull out a list of nouns and a list of verbs, along with their Major system encodings.

    with open("nouns.txt") as f:
        nouns = set()
        for line in f.readlines():
            nouns.add(line.strip())
    
    with open("verbs.txt") as f:
        verbs = set()
        for line in f.readlines():
            verbs.add(line.strip())
    
    cmunouns = open("cmu_nouns.txt", "w")
    cmuverbs = open("cmu_verbs.txt", "w")
    
    with open("cmu_major.txt") as f:
        for line in f.readlines():
            w = line.split()[0].lower()
            if w in nouns:
                cmunouns.write(line)
            if w in verbs:
                cmuverbs.write(line)

You can download the output if you’d like: cmu_nouns.txt, cmu_verbs.txt.

Going back to the example of 201294, the file cmu_verbs.txt contains 82 words that correspond to numbers starting with 12. And the file cmu_nouns.txt contains 1,057 words that correspond to numbers starting with 94.

Choosing verbs is the hard part. Although there are verbs for every number 00 through 99, many of these would not be good choices. You want active verbs that can be combined with any subject and object.

My impression is that most people who use the PAO system—I do not—pick their names, verbs, and objects without regard to the Major system, and I could understand why: your choices are sometimes limited if you want to be compatible with the Major system. You might compromise and use Major-compatible pegs when possible and make exceptions as needed.

Cosine similarity does not satisfy the triangle inequality

The previous post looked at cosine similarity for embeddings of words in vector spaces. Word embeddings like word2vec map words into high-dimensional vector spaces in such a way that related words correspond to vectors that are roughly parallel. Ideally the more similar the words, the smaller the angle between their corresponding vectors.

The cosine similarity of vectors x and y is given by

\cos(\theta) = \frac{\mathbf{x} \cdot \mathbf{y}}{ ||\mathbf{x} || \,||\mathbf{y} || }

If the vectors are normalized to have length 1 (which word embeddings are typically not) then cosine similarity is just dot product. Vectors are similar when their cosine similarity is near 1 and dissimilar when their cosine similarity is near 0.

If x is similar to y and y is similar to z, is x similar to z? We could quantify this as follows.

Let cossim(x, y) be the cosine similarity between x and y.

If cossim(x, y)= 1 − ε1, and cossim(y, z) =1 − ε2, is cossim(x, z) at least 1 − ε1 − ε2? In other words, does the complement of cosine similarity satisfy the triangle inequality?

Counterexample

The answer is no. I wrote a script to search for a counterexample by generating random points on a sphere. Here’s one of the examples it came up with.

    x = [−0.9289  −0.0013   0.3704]
    y = [−0.8257   0.3963   0.4015]
    z = [−0.6977   0.7163  −0.0091]

Let d1 = 1 − cossim(x, y), d2 = 1 − cossim(y, z), and d3 be 1 − cossim(x, z).

Then d1 = 0.0849, d2 = 0.1437, and d3 = 0.3562 and so d3 > d1 + d2.

Triangle inequality

The triangle inequality does hold if we work with θs rather than their cosines. The angle θ between two vectors is the distance between these two vectors interpreted as points on a sphere and the triangle inequality does hold on the sphere.

Approximate triangle inequality

If the cosine similarity between x and y is close to 1, and the cosine similarity between y and z is close to 1, then the cosine similarity between x and z is close to one, though the triangle inequality may not hold. I wrote about this before in the post Nearly parallel is nearly transitive.

I wrote in that post that the law of cosines for spherical trigonometry says

cos c = cos a cos b + sin a sin b cos γ

where γ is the angle between the arcs a and b. If cos a = 1 − ε1, and cos b = 1 − ε2, then

cos a cos b = (1 − ε1)(1 − ε2) = 1 − ε1 − ε2 + ε1 ε2

If ε1 and ε2 are small, then their product is an order of magnitude smaller. Also, the term

sin a sin b cos γ

is of the same order of magnitude. So if 1 − cossim(x, y) =  ε1 and 1 − cossim(y, z) =  ε2 then

1 − cossim(x, z) =  ε1 + ε1 + O1 ε2)

Is the triangle inequality desirable?

Cosine similarity does not satisfy the triangle inequality, but do we want a measure of similarity between words to satisfy the triangle inequality? You could argue that semantic similarity isn’t transitive. For example, lion is similar to king in that a lion is a symbol for a king. And lion is similar to house cat in that both are cats. But king and house cat are not similar. So the fact that cosine similarity does not satisfy the triangle inequality might be a feature rather than a bug.

Related posts

Angles between words

Natural language processing represents words as high-dimensional vectors, on the order of 100 dimensions. For example, the glove-wiki-gigaword-50 set of word vectors contains 50-dimensional vectors, and the the glove-wiki-gigaword-200 set of word vectors contains 200-dimensional vectors.

The intent is to represent words in such a way that the angle between vectors is related to similarity between words. Closely related words would be represented by vectors that are close to parallel. On the other hand, words that are unrelated should have large angles between them. The metaphor of two independent things being orthogonal holds almost literally as we’ll illustrate below.

Cosine similarity

For vectors x and y in two dimensions,

\mathbf{x} \cdot \mathbf{y} = ||\mathbf{x} || \,||\mathbf{y} || \, \cos(\theta)

where θ is the angle between the vectors. In higher dimensions, this relation defines the angle θ in terms of the dot product and norms:

\cos(\theta) = \frac{\mathbf{x} \cdot \mathbf{y}}{ ||\mathbf{x} || \,||\mathbf{y} || }

The right-hand side of this equation is the cosine similarity of x and y. NLP usually speaks of cosine similarity rather than θ, but you could always take the inverse cosine of cosine similarity to compute θ. Note that cos(0) = 1, so small angles correspond to large cosines.

Examples

For our examples we’ll use gensim with word vectors from the glove-twitter-200 model. As the name implies, this data set maps words to 200-dimensional vectors.

Note that word embeddings differ in the data they were trained on and the algorithm used to produce the vectors. The examples below could be very different using a different source of word vectors.

First some setup code.

    
    import numpy as np
    import gensim.downloader as api
    word_vectors = api.load("glove-twitter-200")

    def norm(word):
         v = word_vectors[word]
        return np.dot(v, v)**0.5

    def cosinesim(word0, word1):
        v = word_vectors[word0]
        w = word_vectors[word1]
        return np.dot(v, w)/(norm(word0)*norm(word1))

Using this mode, the cosine similarity between “dog” and “cat” is 0.832, which corresponds to about a 34° angle. The cosine similarity between “dog” and “wrench” is 0.145, which corresponds to an angle of 82°. A dog is more like a cat than like a wrench.

The similarity between “dog” and “leash” is 0.487, not because a dog is like a leash, but because the word “leash” is often used in the same context as the word “dog.” The similarity between “cat” and “leash” is only 0.328 because people speaking of leashes are more likely to also be speaking about a dog than a cat.

The cosine similarity between “uranium” and “walnut” is only 0.0054, corresponding to an angle of 89.7°. The vectors associated with the two words are very nearly orthogonal because the words are orthogonal in the metaphorical sense.

Note that opposites are somewhat similar. Uranium is not the opposite of walnut because things have to have something in common to be opposites. The cosine similarity of “expensive” and “cheap” is 0.706. Both words are adjectives describing prices and so in some sense they’re similar, though they have opposite valence. “Expensive” has more in common with “cheap” than with “pumpkin” (similarity 0.192).

The similarity between “admiral” and “general” is 0.305, maybe less than you’d expect. But the word “general” is kinda general: it can be used in more contexts than military office. If you add the vectors for “army” and “general”, you get a vector that has cosine similarity 0.410 with “admiral.”

Related posts

Sort and remove duplicates

A common idiom in command line processing of text files is

    ... | sort | uniq | ...

Some process produces lines of text. You want to pipe that text through sort to sort the lines in alphabetical order, then pass it to uniq to filter out all but the unique lines. The uniq utility only removes adjacent duplicates, and so it will not remove all duplicates unless the input is sorted. (At least the duplicate lines need to be grouped together; the groups need not be in any particular order.)

When given the -u flag, sort will sort and remove duplicates. This says the idiom above could be shortened to

    ... | sort -u | ...

Aside from saving a few keystrokes, is there any advantage to the latter? There could be, depending on how sort -u is implemented. If internally it simply sorts its input and then removes duplicates, then there is no advantage. But if the code simultaneously sorts and removes duplicates, it could save memory and time, depending on the input. If the code discarded duplicates as they were encountered, the code would need working memory proportional to the amount of unique input rather than the total amount of input.

I had a project recently that makes a good test case for this. The Gutenberg text corpus contains a list of words for 55,000 different sources, each in a separate text file. There are a lot of files, and there is a lot of redundancy between files. The combined file is 3.4 GB.

Running sort | uniq on the combined file took 610 seconds.

Running sort -u on the file took 394 seconds.

So in this example, sort -u not only saved a few keystrokes, it took about 35% off the time.

I expected it would save even more time. Maybe a custom program optimized for large files with a lot of redundancy could be even faster.

Update: awk

Bob Lyon’s comment made me think about using awk without concatenating the files together. Each of the 55,000 text files contains a word and a count. I didn’t concatenate the files per se but piped each through cut -f 1 to select the first column.

Using awk I created an associative array (a hash table, but awk calls them associative arrays) for each word, then printed out the keys of the array.

    awk '{count[$1]++}; 
        END {for (key in count) {print key}}' | 
        sort > out

This ran in 254 seconds, 58% faster than sort | uniq and 36% faster than sort -u.

There is a lot of redundancy between the files—the list of unique words is 320 times smaller than the union of all the input files—and the awk script takes advantage of this by maintaining an array roughly the size of the output rather than the size of the input.

Tim Chase suggested a more elegant solution:

    awk '!count[$1]++ {print $1}' *.txt | sort > out

It seems this should be more efficient as well since awk only goes through the data once, immediately printing new values as they are encountered. However, it actually took slightly longer, 263 seconds.

ARPAbet and the Major mnemonic system

Giraffe

ARPAbet is a phonetic spelling system developed by— you guessed it—ARPA, before it became DARPA.

The ARPAbet system is less expressive than IPA, but much easier for English speakers to understand. Every sound is encoded as one or two English letters. So, for example, the sound denoted ʒ in IPA is ZH in ARPAbet.

In ARPAbet notation, the Major mnemonic system can be summarized as follows:

0: S or Z
1: D, DH, T, or DH
2: N or NG
3: M
4: R
5: L
6: CH, JH, SH, or ZH
7: G or K
8: F or V
9: P or B

Numbers are encoded using the consonant sounds above; the system is based on sounds and not on spelling. You can insert any vowels or semivowels (e.g. w or y) you like. For example, you could encode 648 as “giraffe” or 85 as “waffle.”

The CMU Pronouncing Dictionary lists 134,373 words along with their ARPAbet pronunciation. The Python code below will read in the pronouncing dictionary and produce a Major mnemonic dictionary. The resulting file is available here as a zip compressed text file.

To find a word that encodes a number, search the code output for that number. For example,

    grep ' 648' cmu_major.txt

will find words whose Major encoding begins with 648, and

    grep ' 648$' cmu_major.txt

fill find words whose Major encoding is exactly 648.

From this we learn that “sheriff” is another possible encoding for 648.

Filling in the gaps

Suppose you’re looking for encodings for all three digit numbers, 000 through 999. This can be hard to do. A common compromise is to only regard up to the first three consonants in a word. For example, you might use “ladybug” to encode 519, ignoring the final G sound on the end.

The tradeoff is that if you adopt this rule then you can’t use “ladybug” to encode 5197. But finding single words that encode 4-digit numbers can be challenging if not impossible, so you may just forego the possibility. (I quantify this here.) This is why in the example above I show both searching for numbers that begin with 648 and numbers that are exactly 648.

Despite the large size of the CMU dictionary, it does not contain words that map to numbers beginning with 42 three-digit numbers. I can offer suggestions for these numbers, but it’s hard to use anyone else’s mnemonics. You may have to make up your own, using, for example, names of people you know personally or brand names you’re familiar with etc.

Python code

# NB: File encoding is Latin-1, not UTF-8.
with open("cmudict-0.7b", "r", encoding="latin-1") as f:
    lines = f.readlines()

for line in lines:
    line.replace('0','') # remove stress notation
    line.replace('1','')
    line.replace('2','')
    
    pieces = line.split()
    numstr = ""
    for p in pieces[1:]:
        match p:
            case "S" | "Z":
                numstr += "0"
            case "D" | "DH" | "T" | "DH":
                numstr += "1"
            case "N" | "NG":
                numstr += "2"
            case "M":
                numstr += "3"
            case "R":
                numstr += "4"
            case "L":
                numstr += "5"
            case "CH" | "JH" | "SH" | "ZH":
                numstr += "6"
            case "G" | "K":
                numstr += "7"
            case "F" | "V":
                numstr += "8"
            case "P" | "B":
                numstr += "9"
    print(pieces[0], numstr)

Named entity recognition

Named entity recognition (NER) is a task of natural language processing: pull out named things text. It sounds like trivial at first. Just create a giant list of named things and compare against that.

But suppose, for example, University of Texas is on your list. If Texas is also on your list, do you report that you have a named entity inside a named entity? And how do you handle The University of Texas? Do you put it on your list as well? What about UT? Can you tell from context whether UT stands for University of Texas, University of Toronto, or the state of Utah?

Searching for Rice University would be even more fun. The original name of the school was The William Marsh Rice Institute for the Advancement of Letters, Science, and Art. I don’t know whether the name was ever officially changed. A friend who went to Rice told me they had a ridiculous cheer that spelled out every letter in the full name. And of course rice could refer to a grain.

Let’s see what happens when we run the following sentence through spaCy looking for named entities.

Researchers from the University of Texas at Austin organized a picleball game with their colleagues from Rice University on Tuesday.

I deliberately did not capitalize the definite article in front of University of Texas because I suspected spaCy might include the article if it were capitalized but not otherwise. It included the article in either case.

The results depend on the language model used. When I used en_core_web_trf it included at Austin as part of the university name.

When I used the smaller en_core_web_sm model it pulled out Austin as a separate entity.

The tag ORG stands for organization and DATE obviously stands for date. GPE is a little less obvious, standing for geopolitical entity.

When I changed Rice University to simply Rice, spaCy still recognized Rice as an organization. When I changed it to rice with no capitalization, it did not recognize it as an organization.

The other day I stress tested spaCy by giving it some text from Chaucer’s Canterbury Tales. Even though spaCy is trained on Modern English, it did better than I would have expected on Middle English.

Using the en_core_web_trf model it recognizes Engelond and Caunterbury as cities.

When I switched to  en_core_web_sm it still recognized Caunterbury as city, but tagged Engelond as a person.

 

Trying NLP on Middle English

It’s not fair to evaluate NLP software on a language it wasn’t designed to process, but I wanted to try it anyway.

The models in the spaCy software library were trained on modern English text and not on Middle English. Nevertheless, spaCy does a pretty good job of parsing Chaucer’s Canterbury Tales, written over 600 years ago. I used the model en_core_web_lg in my little experiment below.

The text I used comes from the prologue:

From every shires ende
of Engelond to Caunterbury they wende
the hooly blisful martir for to seke
that hem hath holpen
whan that they were seeke.

The software correctly identifies, for example, wende (went) and seke (seak) as verbs, and seeke (sick) as an adjective. Overall it does a pretty good job. I imagine it would do worse on Middle English text that differed more from Modern English usage, but so would a contemporary human who doesn’t know Middle English.

Related posts

Natural language processing and unnatural text

I recently evaluated two software applications designed to find PII (personally identifiable information) in free text using natural language processing. Both failed badly, passing over obvious examples of PII. By contrast, I also tried natural language processing software on a nonsensical poem, it the software did quite well.

Doctor’s notes

It occurred to me later that the software packages to search for PII probably assume “natural language” has the form of fluent prose, not choppy notes by physicians. The notes that I tested did not consist of complete sentences marked up with grammatically correct punctuation. The text may have been transcribed from audio.

Some software packages deidentify medical notes better than others. I’ve seen some work well and some work poorly. I suspect the former were written specifically for their purpose and the latter were more generic.

Jabberwocky

I also tried NLP software on Lewis Carroll’s poem Jabberwocky. It too is unnatural language, but in a different sense.

Jabberwocky uses nonsense words that Carroll invented for the poem, but otherwise it is grammatically correct. The poem is standard English at the level of structure, though not at the level of words. It is the opposite of medical notes that are standard English at the word level (albeit with a high density of technical terms), but not at a structural level.

I used the spaCy natural language processing library on a couple stanzas from Lewis’ poem.

“Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!”

He took his vorpal sword in hand;
Long time the manxome foe he sought—
So rested he by the Tumtum tree
And stood awhile in thought.

I fed the lines into spaCy and asked it to diagram the lines, indicating parts of speech and dependencies. The software did a good job of inferring the use of even the nonsense words. I gave the software one line at a time rather than a stanza at a time because the latter results in diagrams that are awkwardly wide, too wide to display here. (The spaCy visualization software has a “compact” option, but this option does not make the visualizations much more compact.)

Here are the visualizations of the lines.








And here is the Python code I used to create the diagrams above.

    import spacy
    from spacy import displacy
    from pathlib import Path
    
    nlp = spacy.load("en_core_web_sm")
        
    lines = [
        "Beware the Jabberwock, my son!",
        "The jaws that bite, the claws that catch!",
        "Beware the Jubjub bird",
        "Shun the frumious Bandersnatch!",
        "He took his vorpal sword in hand.",
        "Long time the manxome foe he sought",
        "So rested he by the Tumtum tree",
        "And stood awhile in thought."
    ]
    
    for line in lines:
        doc = nlp(line)
        svg = displacy.render(doc, style="dep", jupyter=False)    
        file_name = '-'.join([w.text for w in doc if not w.is_punct]) + ".svg"
        output_path = Path(file_name)
        output_path.open("w", encoding="utf-8").write(svg)

Related posts

Filtering on how words are being used

Yesterday I wrote about how you could use the spaCy Python library to find proper nouns in a document. Now suppose you want to refine this and find proper nouns that are the subjects of sentences or proper nouns that are direct objects.

This post was motivated by a project in which I needed to pull out company names from a large amount of text, and it was important to know how the company name was being used.

Dependency labels

Tokens in spaCy have a dependency label attribute dep (or dep_ for its string representation). Dependency labels tell you how a word is being used. For example, dobj tells you the word is being used as a direct object, and nsubj tells you its being used as a nominal subject.

In yesterday’s post the line

    if tok.pos_ == "PROPN":
        print(tok)

filtered tokens to look for proper nouns. We could modify the script to also tell us how the proper nouns are being used by printing tok.dep_.

There are three proper nouns in the opening paragraph of Moby Dick: Ishmael, November, and Cato.

Call me Ishmael. … whenever it is a damp, drizzly November in my soul … With a philosophical flourish Cato throws himself upon his sword …

If we run

    if tok.pos_ == "PROPN":
        print(tok, tok.dep_)

on the first paragraph we get

    Ishmael oprd
    November attr
    Cato nsubj

but it’s not obvious what the output means. If we wrap tok.dep_ with spacy.explain we get a more verbose explanation.

    Ishmael object predicate
    November attribute
    Cato nominal subject

Pulling out subjects

Now suppose we wanted to pull out words that are subjects. We could filter on tok.dep_ == "nsubj" but there are more kinds of subjects than just nominal subjects. There are six kinds of subjects:

  1. nsubj: nominal subject
  2. nsubjpass: nominal passive subject
  3. csubj: clausal subject
  4. csubjpass: clausal passive subject
  5. agent: agent
  6. expl: expletive

Finding the range of possible values for dependency labels takes some digging. I don’t believe it’s in the spaCy documentation per se, but if you’re persistent you’ll find a link this list or the paper it came from.

Searching for proper nouns

Suppose you want to find all the proper nouns in a document. You could grep for every word that starts with a capital letter with something like

    grep '\b[A-Z]\w+'

but this would return the first word of each sentence in addition to the words you’re after.

You could grep for capitalized words that are not preceded by a period or question mark followed by a space.

    grep -P '(?<![.?] )\b[A-Z]\w+'

That’s possibly better, but it misses proper nouns at the beginning of a sentence.

You might be able to accomplish what you’re after by tinkering with regular expressions, but it would be better to use a library that has some idea of what a proper noun is.

NLP with spaCy

The Python natural language processing library spaCy classifies words by part of speech, and so could in particular search for proper nouns.

Here’s an example using the opening lines of Moby Dick.

    import spacy
    nlp = spacy.load("en_core_web_lg")

    doc = nlp("Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul ... I account it high time to get to sea as soon as I can.")

    for tok in doc:
        if tok.pos_ == "PROPN":
            print(tok)

This will print Ishmael and November only. It does not print words at the beginning of a sentence such as Call or Some even though they are capitalized. When spaCy got to the line

Queequeg was George Washington cannibalistically developed.

it detected that Queequeg is a proper noun. Presumably the model can tell this from context, because the word precedes the verb was and not because it knows Queeqeug is proper name.

When I changed November to november spaCy was still able to detect that november was a proper noun. When I downcased Ishmael it did not detect that ishmael was a proper noun, presumably because Ishmael is an uncommon name. When I changed the text to “Call me tim” the library did recognize tim as a proper noun.

When I fed spaCy the sentence

I never go as a passenger; nor, though I am something of a salt, do I ever go to sea as a Commodore, or a Captain, or a Cook.

the library thought that Commadore, Captain, and Cook were proper nouns. If I downcase these words, spaCy does not flag them as proper nouns.

When processing the line

For as in this world,head winds are far more prevalent than winds from astern (that is, if you never violate the Pythagorean maxim), so for the most part the Commodore on the quarter-deck gets his atmosphere at second hand from the sailors on the forecastle

spaCy correctly flagged Commodore as a proper noun in this instance. Also, it did not classify Pythagorean as a proper noun; the word is proper but not a noun, i.e. it’s a proper adjective.

TANSTAAFL

My script above has only six lines of code. But it depends on a library that uses a 588 MB language model. [1]

Related posts

[1] “TANSTAALF” stands for “There ain’t no such thing as a free lunch.” It comes from The Moon is a Harsh Mistress by Heinlein.

Incidentally, when I fed “The term TANSTAAFL comes from The Moon is a Harsh Mistress by Heinlein.” to spaCy, it flagged Harsh and Mistress as proper nouns.

When I fed it “The term TANSTAAFL comes from ‘The moon is a harsh mistress’ by Heinlein.” the library correctly tagged harsh as an adjective and mistress as a (non-proper) noun.