Searching for proper nouns

Suppose you want to find all the proper nouns in a document. You could grep for every word that starts with a capital letter with something like

    grep '\b[A-Z]\w+'

but this would return the first word of each sentence in addition to the words you’re after.

You could grep for capitalized words that are not preceded by a period or question mark followed by a space.

    grep -P '(?<![.?] )\b[A-Z]\w+'

That’s possibly better, but it misses proper nouns at the beginning of a sentence.

You might be able to accomplish what you’re after by tinkering with regular expressions, but it would be better to use a library that has some idea of what a proper noun is.

NLP with spaCy

The Python natural language processing library spaCy classifies words by part of speech, and so could in particular search for proper nouns.

Here’s an example using the opening lines of Moby Dick.

    import spacy
    nlp = spacy.load("en_core_web_lg")

    doc = nlp("Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul ... I account it high time to get to sea as soon as I can.")

    for tok in doc:
        if tok.pos_ == "PROPN":
            print(tok)

This will print Ishmael and November only. It does not print words at the beginning of a sentence such as Call or Some even though they are capitalized. When spaCy got to the the line

Queequeg was George Washington cannibalistically developed.

it detected that Queequeg is a proper noun. Presumably the model can tell this from context, because the word precedes the verb was and not because it knows Queeqeug is proper name.

When I changed November to november spaCy was still able to detect that november was a proper noun. When I downcased Ishmael it did not detect that ishmael was a proper noun, presumably because Ishmael is an uncommon name. When I changed the text to “Call me tim” the library did recognize tim as a proper noun.

When I fed spaCy the sentence

I never go as a passenger; nor, though I am something of a salt, do I ever go to sea as a Commodore, or a Captain, or a Cook.

the library thought that Commadore, Captain, and Cook were proper nouns. If I downcase these words, spaCy does not flag them as proper nouns.

When processing the line

For as in this world,head winds are far more prevalent than winds from astern (that is, if you never violate the Pythagorean maxim), so for the most part the Commodore on the quarter-deck gets his atmosphere at second hand from the sailors on the forecastle

spaCy correctly flagged Commodore as a proper noun in this instance. Also, it did not classify Pythagorean as a proper noun; the word is proper but not a noun, i.e. it’s a proper adjective.

TANSTAAFL

My script above has only six lines of code. But it depends on a library that uses a 588 MB language model. [1]

Related posts

[1] “TANSTAALF” stands for “There ain’t no such thing as a free lunch.” It comes from The Moon is a Harsh Mistress by Heinlein.

Incidentally, when I fed “The term TANSTAAFL comes from The Moon is a Harsh Mistress by Heinlein.” to spaCy, it flagged Harsh and Mistress as proper nouns.

When I fed it “The term TANSTAAFL comes from ‘The moon is a harsh mistress’ by Heinlein.” the library correctly tagged harsh as an adjective and mistress as a (non-proper) noun.

Moments of Tukey’s g-and-h distribution

John Tukey developed his so-called g-and-h distribution to be very flexible, having a wide variety of possible values of skewness and kurtosis. Although the reason for the distribution’s existence is its range of possible skewness and values, calculating the skewness and kurtosis of the distribution is not simple.

Definition

Let φ be the function of one variable and four parameters defined by

\varphi(x; a, b, g, h) = a + b\left( \frac{\exp(gx) - 1}{g} \right) \exp(hx^2/2)

A random variable Y has a g-and-h distribution if it has the same distribution as φ(Z; a, b, g, h) where Z is a standard normal random variable. Said another way, if Y has a g-and-h distribution then the transformation φ−1 makes the data normal.

The a and b parameters are for location and scale. The name of the distribution comes from the parameters g and h that control skewness and kurtosis respectively.

The transformation φ is invertible but φ−1 does not have a closed-form; φ−1 must be computed numerically. It follows that the density function for Y does not have a closed form either.

Special cases

The g distribution is the g-and-h distribution with h = 0. It generalizes the log normal distribution.

The limit of the g-and-h distribution as g does to 0 is the h distribution.

If g and h are both zero we get the normal distribution.

Calculating skewness and kurtosis

The following method of computing the moments of Y comes from [1].

Define f by

f(g, h, i) = \frac{1}{g^i\sqrt{1 - ih}} \sum_{r=0}^i \binom{i}{r} \exp\left(\frac{((i-r)g)^2}{2(1-ih)}\right)

Then the raw moments of Y are given by

\text{E} \, Y^m = \sum_{i=0}^m \binom{m}{i} a^{m-i}b^i f(g,h,i)

Skewness is the 3rd centralized moment and kurtosis is the 4th centralized moment. Equations for finding centralized moments from raw moments are given here.

Related posts

[1] James B. McDonald and Patrick Turley. Distributional Characteristics: Just a Few More Moments. The American Statistician, Vol. 65, No. 2 (May 2011), pp. 96–103

Symmetric functions and U-statistics

A symmetric function is a function whose value is unchanged under every permutation of its arguments. The previous post showed how three symmetric functions of the sides of a triangle

  • a + b + c
  • ab + bc + ac
  • abc

are related to the perimeter, inner radius, and outer radius. It also mentioned that the coefficients of a cubic equation are symmetric functions of its roots.

This post looks briefly at symmetric functions in the context of statistics.

Let h be a symmetric function of r variables and suppose we have a set S of n numbers where nr. If we average h over all subsets of size r drawn from S then the result is another symmetric function, called a U-statistic. The “U” stands for unbiased.

If h(x) = x then the corresponding U-statistic is the sample mean.

If h(x, y) = (xy)²/2 then the corresponding U-function is the sample variance. Note that this is the sample variance, not the population variance. You could see this as a justification for why sample variance as an n − 1 in the denominator while the corresponding term for population variance has an n.

Here is some Python code that demonstrates that the average of (xy)²/2 over all pairs in a sample is indeed the sample variance.

    import numpy as np
    from itertools import combinations

    def var(xs):
        n = len(xs)
        bin = n*(n-1)/2    
        h = lambda x, y: (x - y)**2/2
        return sum(h(*c) for c in combinations(xs, 2)) / bin

    xs = np.array([2, 3, 5, 7, 11])
    print(np.var(xs, ddof=1))
    print(var(xs))

Note the ddof term that causes NumPy to compute the sample variance rather than the population variance.

Many statistics can be formulated as U-statistics, and so numerous properties of such statistics are corollaries general results about U-statistics. For example U-statistics are asymptotically normal, and so sample variance is asymptotically normal.

Relating perimeter, inner radius, outer radius, and sides of a triangle

Suppose a triangle T has sides a, b, and c.

Let s be the semi-perimeter, i.e. half the perimeter.

Let r be the inner radius, the radius of the largest circle that can fit inside T.

Let R be the outer radius, the radius of the smallest circle that can enclose T.

Then three simple equations relate a, b, c, s, r, and R.

\begin{align*} a + b + c &= 2s \\ ab + bc + ac &= s^2 + r^2 +4rR \\ abc &= 4Rrs \end{align*}

Given a, b, and c, use the first equation to solve for s, then the third equation for Rr, then the second for r, then go back to the last equation to find R.

Given s, r, and R, you can calculate the right hand sides of the three equations above, which are the coefficients in a cubic equation for the sides a, b, and c.

x^3 - (2s)x^2 + (s^2 + r^2 + 4Rr)x -(4Rrs)= 0

Note that this last statement is not about triangles per se. It’s a consequence of

(x-a)(x-b)(x-c) = x^3 - (a+b+c)x^2 + (ab + bc + ac) -abc

which would be true even if a, b, and c were not the sides of a triangle. But since they are sides of a triangle here, the coefficients can be interpreted in terms of geometry, namely in terms of perimeter, inner radius, and outer radius.

Related posts

Experiments with Bing chat

My two previous posts looked at experiments with ChatGPT and Google Bard. This post will look at redoing the same experiments with Microsoft’s Bing Chat: looking for mnemonic encodings and simplifying Boolean expressions.

When you open up Bing chat you can select a conversational style:

  • More creative
  • More balanced
  • More precise

I chose “more precise” for my experiments.

Mnemonic encodings

Bing seems to “know” more about the Major mnemonic system. In response to my first question about the system it gave a link to the Wikipedia article on the Major system and a couple other relevant links. It gives a more complete table of encodings than ChatGPT did:

  • 0: s, z
  • 1: t, d, th
  • 2: n
  • 3: m
  • 4: r
  • 5: l
  • 6: j, ch, sh
  • 7: c, k, g, q, ck
  • 8: v, f, ph
  • 9: p, b

Directly asking for encodings

It seems to do pretty well when I ask for an encoding of a two-digit number, and when I ask for an encoding of a three-digit number it encodes the first two digits correctly but ignores the third digit. However, when I asked it to encode 345 it did very well:

Using the Major System, the number 345 can be encoded as “molar” or “mailer” or “miller” or “mural” or “marl”, among others.

ChatGPT and Google Bard said that no English word can encode 345, though Google Bard came up with Merlin anyway. (Strictly speaking “Merlin” encodes 3452, but it’s a common practice to just pay attention the first three consonants in words with more than three consonants.)

Asking for sounds

When I asked “Find a word with the sounds g, r, and f” Bing came back with

 One word that contains the sounds g, r, and f is “giraffe” 🦒

When I asked for the sounds l, f, and b, Bard came up with “belief.” When I added “in that order” to my prompt it came back with “bluff” and “flyby,” ignoring the order requirement.

In general Bing AI seems to do better that ChatGPT and Bard at word game questions.

Boolean minimization

Next I tried the Boolean minimization problem that I’d tried with ChatGPT and Bard using the prompt

Minimize the boolean expression ¬AB¬C¬D + A¬B¬C¬D + A¬B¬CD + A¬BC¬D + A¬BCD + AB¬C¬D + ABC¬D + ABCD

ChatGPT tried a Karnaugh map approach, but botched it. Bard tried an algebraic approach, but also botched it.

Bing tried something different:

The given boolean expression can be simplified using the consensus theorem. The consensus theorem states that for any three variables A, B, and C: AB + AC + BC = AB + AC. Applying this theorem to the given expression, we get

¬AB¬C¬D + A¬B¬C¬D + A¬B¬CD + A¬BC¬D + A¬BCD + AB¬C¬D + ABC¬D + ABCD
= ¬AB¬C¬D + A¬B¬C¬D + A¬B¬CD + A¬BC¬D + A¬BCD + AB¬C¬D + ABC
= ¬AB¬C¬D + A( ¬B¬C¬D + ¬B¬CD + ¬BC¬D + ¬BCD) + AB( ¬C¬D) + ABC
= ¬AB( ¬C) ( ¬D) + A( ¬B) ( ¬C( ¬D) + ¬CD) + AB( ¬C( ¬D)) + ABC
= ¬AB( ¬C) ( ¬D) + A( ¬B) ( ¬CD) + AB( ¬C( ¬D)) + ABC
= (A+B)(A+C)(A+D)(B+C)(B+D)(C+D)

This derivation starts out correctly but goes astray about halfway through.

 

Boolean function minimization with AI

Screenshot of Karnaugh map produced by ChatGPT

I was curious how well LLMs would do at minimizing a Boolean expression, that is, taking a Boolean expression and producing a smaller equivalent expression.

I didn’t expect good performance because this problem is more about logic than recall, but sometimes LLMs surprise you, so I wanted to give it a chance. I thought it might accomplish some reduction in size though not as much as possible. Let’s see what it did.

Prompt

My prompt for the three platforms I tried—GPT4All, ChatGPT, and Google Bard—was

Minimize the boolean expression ¬AB¬C¬D + A¬B¬C¬D + A¬B¬CD + A¬BC¬D + A¬BCD + AB¬C¬D + ABC¬D + ABCD

I knew the expression could be minimized to

B¬C¬D + A¬D + AC.

GPT4All

I first tried GPT4All using the Groovy data set. It crashed immediately. I tried a few variations on the same prompt and it crashed right away every time.

This was on a MacBook Pro. I didn’t try it on Windows or Linux.

ChatGPT

I’ll go through ChatGPT’s response a little at a time.

To minimize the given boolean expression ¬AB¬C¬D + A¬B¬C¬D + A¬B¬CD + A¬BC¬D + A¬BCD + AB¬C¬D + ABC¬D + ABCD, we can use Karnaugh maps or Boolean algebra simplification techniques. I’ll demonstrate both methods.

This is a sensible start. Karnaugh maps are a common classroom technique for solving small Boolean minimization problems, so there were lots of examples of Karnaugh maps in the data scraped to train ChatGPT.

Next ChatGPT draws a Karnaugh map. The screenshot at the top of the post was from a different experiment, a version of the experiment described here but containing an error on my part. This problem produced a black image simply containing “CD” but also contained a Karnaugh map in text. Cleaning up the formatting we get the following.

       | 00 | 01 | 11 | 10 |
    00 |  1 |  0 |  1 |  0 |
    01 |  1 |  1 |  1 |  1 |
    11 |  1 |  1 |  1 |  1 |
    10 |  0 |  1 |  0 |  0 | 

This is nonsense. Nine out of sixteen entries are wrong.

ChatGPT concludes

Grouping the adjacent 1’s in the Karnaugh map, we can simplify the expression to: ¬A¬C + ¬A¬D + A¬B + ACD

This is also nonsense. It does not correctly solve the original problem, nor does it solve the problem corresponding to the incorrect Karnaugh map above.

Next let’s see how it does with Boolean algebra. It starts out OK.

Boolean Algebra Method: We can simplify the expression using Boolean algebra identities and rules.

¬AB¬C¬D + A¬B¬C¬D + A¬B¬CD + A¬BC¬D + A¬BCD + AB¬C¬D + ABC¬D + ABCD

Rearranging the terms: A¬B¬C¬D + A¬B¬CD + AB¬C¬D + A¬BC¬D + ABC¬D + A¬BCD + ABCD + A¬B¬C¬D

Grouping the terms: A¬B¬C¬D + A¬B¬CD + AB¬C¬D + A¬BC¬D + ABC¬D + A¬BCD + ABCD

Combining like terms: (A + A)¬B¬C¬D + (A + A)¬B¬CD + AB¬C¬D + (A + A)¬BC¬D + ABC¬D + A¬BCD + ABCD

These steps haven’t accomplished anything, but they’re not wrong. Then it goes off the rails.

Using the complement law A + ¬A = 1: ¬B¬C¬D + ¬B¬CD + AB¬C¬D + ¬BC¬D + ABC¬D + A¬BCD + ABCD

The explanation is wrong and the solution is wrong.

The solution is completely wrong, but it looks plausible. At first glance I thought it might be right, or more right than it was. If you were making a movie and you needed to generate electrical engineering homework, this would be adequate for a realistic-looking homework solution.

Google Bard

Google Bard gives a more succinct solution with five steps:

  1. Identify the common terms in the expression. The common terms are ¬AB¬C¬D, A¬B¬CD, and ABC¬D.
  2. Combine the common terms. This gives us the expression (¬AB¬C¬D + A¬B¬CD + ABC¬D) + (A¬B¬C¬D + AB¬C¬D).
  3. Apply the distribution property. This gives us the expression ¬AB¬C¬D + A¬B¬CD + ABC¬D + A¬B¬C¬D + AB¬C¬D.
  4. Identify the redundant terms. The redundant terms are ¬AB¬C¬D and A¬B¬C¬D.
  5. Remove the redundant terms. This gives us the expression A¬B¬CD + ABC¬D.

This too is nonsense.

Large language models and mnemonics

The Major mnemonic system encodes numbers as words in order to make them easier to remember. Digits correspond to consonant sounds (not spellings) as explained here. You can use the system ad hoc, improvising an encoding of a word as needed, or you can memorize canonical encodings of numbers, also known as pegs.

Pegs have couple advantages. For one, they are eventually faster. Rather than search for an encoding, you recall a predetermined association. For another, pegs are better for memorizing numbered lists. To recall the nth item in a list, pull up the one-and-only image associated with that number and recall what you associated with it.

Coming up with encodings for most numbers is easy, but for a few holdouts it’s challenging. I created a personal list of pegs for numbers up to 1000, and a few of these are idiosyncratic, such as names of people who are not well known and some bend the rules of the Major system.

I decided to try the ChatGPT and Google Bard large language models to see whether they could help me find pegs that I could share with a general audience, free of obscure references.

Asking directly about the Major system

ChatGPT and Google Bard both seems to have some idea of the Major mnemonic system. As part of its reply to one of my prompts, Google Bard returned this:

    0: S, Z, soft C
    1: D, T, TH
    2: N, L, M
    3: M
    4: R
    5: L
    6: J, SH, soft CH, DG
    7: K
    8: P
    9: H

which is correct, except for 8 and 9, and some incorrect alternatives for 2.

When I asked ChatGPT for “Major mnemonic system encoding table” it returned a correct table. But when I asked it for encodings of numbers, it used different rules. For example, it seemed to systematically use M for 2.

I asked both systems

Find a word that encodes the number 792 in the Major memory system

ChatGPT came back with KNP and GNP. These aren’t words, and it seems ChatGPT might have dyslexia.

Bard said

There is no word in the English language that encodes the number 792 in the Major memory system. The closest word is “Sink”, but it only encodes the digits 7 and 9.

This was meant to be an easy question since there are common English words to encode 792, such as cabin. Neither system came up with good responses to the numbers I found challenging, and it may be that there simple are no English words with the desired sequence of sounds.

Before moving on, I wanted to mention one response from Google Bard that I found amusing. When I asked

Find a word that encodes 345 using the Major mnemonic system

Bard responded with

… The encoded number 345 is therefore MERLE.

However, there is no word in the English language that is spelled exactly as MERLE. However, there are a few words that are similar, such as Merry, Mettle, and Melt.

If you are looking for a word that is spelled exactly as MERLE, you may need to create a new word. For example, you could create the word Merlin, which is a name that is often associated with magic and wisdom.

Asking for sounds

The Major system is not well known, so I didn’t expect a direct approach to be that helpful. Next I tried asking for words with particular sequences of sounds rather than asking directly about the Major system.

When I asked both systems

Find a word containing the consonant sounds k, b, and n, in that order

ChatGPT came back with cabin and Bard said

There is no word in the English language that contains the consonant sounds k, b, and n in that order. The closest word is “knob”, but the n sound is pronounced as “ng” in this word.

So the point goes to GPT on that one.

Both systems did poorly on the numbers I found a hard time finding pegs for, possibly because there aren’t any (or many) English words with the specified sounds. With easier questions, the results were mixed.

When I asked

Find a word with the sounds g, r, and f

I was thinking giraffe. ChatGPT returned graf and Bard returned graft.

When I asked

Find a word with sounds l, f, and b

I was thinking, for example, lifeboat. But ChatGPT came back with

A word that contains the consonant sounds “l,” “f,” and “b” is “elfishly.”

and Bard replied

The word “blurb” contains the sounds l, f, and b in that order.

Phonetics

Both systems did well when I asked directly for the phonetic spelling of a word. So I tried to get both systems to come up with lifeboat by asking

Words whose IPA phonetic spellings contain l, f, and b

Bard found words that had some of these letters but not all. When I was more specific, asking

Words whose IPA phonetic spellings contain each of the letters l, f, and b

the results weren’t better.

ChatGPT, on the other hand, came up with affleb, triflingly, bluff, lifeboat, and fallopian. When I asked the more specific question, adding “each of”, ChatGPT returned alphabet, fulbright, bubluous, flyable, and belltry. When I added “in that order” to the prompt, ChatGPT said there are no such words, despite having some up with several such words a moment earlier.

Related posts

When does a function have an addition theorem?

Motivating examples

The addition theorem for cosine says that

\cos(x + y) = \cos x \cos y - \sin x \sin y

and the addition theorem for hyperbolic cosine is analogous, though with a sign change.

\cosh(x + y) = \cosh x \cosh y + \sinh x \sinh y

An addition theorem is a theorem that relates a function’s value at x + y to its values at x and at y. The squaring function satisfies a very simple addition theorem

(x + y)^2 = x^2 + 2xy + y^2

and the Jacobi function sn satisfies a more complicated addition theorem.

\text{sn}(x + y) = \frac{ \text{cn}(x)\, \text{cn}(y) - \text{sn}(x) \,\text{sn}(y) \,\text{dn}(x) \,\text{dn}(y) }{ 1 - m \, \text{sn}^2(x) \,\text{sn}^2(y) }

Defining an algebraic addition theorem

Which functions have addition theorems? Before we can answer this question we need to be more precise about what an addition theorem is. We’ve said that an addition theorem for φ relates φ(x + y) to φ(x) and φ(y). But what exactly do we mean by “relate”? What counts as a relation?

Also, the examples above don’t exactly satisfy this definition. The addition law for cosines, for example, relates cos(x + y) to the values of cos(x) and cos(y) but also to sin(x) and sin(y). Somehow that feels OK because sine and cosine are related. But here again we’re talking about things being related without saying exactly what we mean.

Weierstrass (1815–1897) made the idea of an addition theorem precise and classified functions having addition theorems. A function satisfies an algebraic addition theorem if there is a polynomial F in three variables such that

F(\varphi(x + y), \varphi(x), \varphi(y)) = 0

For example, if φ(x) = x² then

\varphi(x+y)^2 - \left(\varphi(x)^2 + 2\varphi(x)\varphi(y) + \varphi(y)^2 \right) = 0

and so we could take F to be

F(a, b, c) = a^2 - b^2 - c^2 - 2bc

Similarly, if φ(x) = cos x then

\left(\varphi(x+ y) - \varphi(x) \varphi(y)\right)^2 - (1 - \varphi(x))^2 (1 - \varphi(y))^2 = 0

and so we could take F to be

Classifying functions with algebraic addition theorems

Now for Weierstrass’ theorem. A meromorphic function φ(z) has an algebraic addition theorem if and only if it is an elliptic function of z, a rational function of z, or a rational function of exp(λz).

A meromorphic function is one that is analytic everywhere except at isolated singularities. To put it another way, we assume φ has a convergent power series everywhere in the complex plane except at isolated points.

The examples above illustrate all three cases of Weierstrass’ theorem. The function sn(z) is elliptic, the function z² is rational, and the functions cos(z) and cosh(z) are rational functions of exp(iz).

Other kinds of addition theorems

Algebraic addition theorems are not the only kind of addition theorems. For example, Bessel functions satisfy a different kind of addition theorem:

J_n(x + y) = \sum_{k=-\infty}^\infty J_{n-k}(x) J_k(y)

This theorem relates the value of a Bessel function at x + y to the values of other Bessel functions at x and at y, but it is not an algebraic addition theorem because the right hand side is an infinite sum and because the Bessel functions are not algebraically related to each other.

Related posts

How to mark a language in HTML

In HTML you can mark the language of a piece of text by putting it inside span tags and setting the lang attribute to a two-letter abbreviation. For example,

    <span lang="fr">Allons enfants de la Patrie,
    Le jour de gloire est arrivé !<span>

indicates that the first two lines of the French national anthem are in French.

What are the two-letter codes for languages? I’ve had to look this up several times, and I’m writing the answer here for my future reference and for the benefit of anyone else with the same question.

Finding these abbreviations is a bit of a goose chase. Search for the Microdata standard and that takes you to a W3C document. Search that document for “language” and you don’t find what you’re looking for. But if you’re persistent you’ll find that you’re supposed to use BCP 47 abbreviations. Go there and you see a link for RFC 5646: Tags for Identifying Languages. Click on that and you get a link for the RFC in various formats. Click on that and you think you must have finally found it, a table of languages and abbreviations. Au contraire! This takes you to an 84-page document on how to format language abbreviations. Eventually you see something about ISO 639, and searching on that may take you to the Wikipedia page on ISO 639-1 and that has the table you’re looking for.

Short answer: Look up ISO 639.

The longer answer is that language classification has a surprising amount of detail. There are two-letter and three-letter abbreviations, four-to-eight letter abbreviations, variations, private use variations, … But as a first pass, simply use the two-letter abbreviation. There are ways to be more specific if you need to.

And if you have to guess, use the first two letters of the English name, such as ar for Arabic and ru for Russian. Two major exceptions are Chinese (zh) and Spanish (es). There are many other exceptions as well, but if I wanted to remember a list of abbreviations, I’d narrow the list by first scratching off the ones that are abbreviated by their first two letters.

Related posts

Russian transliteration hack

I mentioned in the previous post that I had been poking around in HTML entities and noticed symbols for Fourier transforms and such. I also noticed HTML entities for Cyrillic letters. These entities have the form

& + transliteration + cy;.

For example, the Cyrillic letter П is based on the Greek letter Π and its closest English counterpart is P, and its HTML entity is &Pcy;.

The Cyrillic letter Р has HTML entity &Rpcy; and not &Pcy; because although it looks like an English P, it sounds more like an English R.

Just as a hack, I decided to write code to transliterate Russian text by converting letters to their HTML entities, then chopping off the initial & and the final cy;.

I don’t speak Russian, but according to Google Translate, the Russian translation of “Hello world” is “Привет, мир.”

Here’s my hello-world program for transliterating Russian.

    from bs4.dammit import EntitySubstitution

    def transliterate(ch):
        entity = escaper.substitute_html(ch)[1:]
        return entity[:-3]
    
    a = [transliterate(c) for c in "Привет, мир."]
    print(" ".join(a))

This prints

P r i v ie t m i r

Here’s what I get trying to transliterate Chebyshev’s native name Пафну́тий Льво́вич Чебышёв.

P a f n u t i j L soft v o v i ch CH ie b y sh io v

I put a space between letters because of possible outputs like “soft v” above.

This was just a fun hack. Here’s what I’d get if I used software intended to be used for transliteration.

    import unidecode

    for x in ["Привет, мир", "Пафну́тий Льво́вич Чебышёв"]:
        print(unidecode.unidecode(x))

This produces

Privet, mir
Pafnutii L’vovich Chebyshiov

The results are similar.

Related posts