Analyzing the Federalist Papers

The Federalist Papers, a collection of 85 essays published anonymously between 1787 and 1788, were one of the first subjects for natural language processing aided by a computer. Because the papers were anonymous, people were naturally curious who wrote each of the essays. Early on it was determined that the authors were Alexander Hamilton, James Madison, and John Jay, but the authorship of individual essays wasn’t known.

In 1944, Douglass Adair conjectured the authorship of each essay, and twenty years later Frederick Mosteller and David Wallace confirmed Adair’s conclusions by Bayesian analysis. Mosteller and Wallace used a computer to carry out their statistical calculations, but they did not have an electronic version of the text.

They physically chopped a printed copy of the text into individual words and counted them. Mosteller recounted in his autobiography that until working on The Federalist Papers, he had underestimated how hard it was to count a large number things, especially little pieces of paper that could be scattered by a draft.

I’m not familiar with how Mosteller and Wallace did their analysis, but I presume they formed a prior distribution on the frequency of various words in writings known to be by Hamilton, Madison, and Jay, then computed the posterior probability of authorship by each author for each essay.

The authorship of the papers was summarized in the son “Non-Stop” from the musical Hamilton:

The plan was to write a total of twenty-five essays, the work divided evenly among the three men. In the end, they wrote eighty-five essays in the span of six months. John Jay got sick after writing five. James Madison wrote twenty-nine. Hamilton wrote the other fifty-one!

Yesterday I wrote about the TF-IDF statistic for the importance of words in a corpus of documents. In that post I used the books of the Bible as my corpus. Today I wanted to reuse the code I wrote for that post by applying it to The Federalist Papers.

Federalist No. 10 is the best known essay in the collection. Here are the words with the highest TF-IDF scores from that essay.

faction: 0.0084
majority: 0.0047
democracy: 0.0044
controlling: 0.0044
parties: 0.0039
republic: 0.0036
cure: 0.0035
factious: 0.0035
property: 0.0033
faculties: 0.0033

I skimmed a list of the most important words in the essays by Madison and Hamilton and noticed that Madison’s list had several words from classic literature: Achaens, Athens, Draco, Lycurgus, Sparta, etc. There were a only couple classical references in Hamilton’s top words: Lysander and Pericles. I noticed “debt” was important to Hamilton.

You can find the list of top 10 words in each essay here.

Counting points on an elliptic curve

Suppose you have an elliptic curve

y² = x³ + ax + b

over a finite field Fp for prime p. How many points are on the curve?

Brute force

You can count the number of points on the curve by brute force, as I did here. Loop through each of the p possibilities for x and for y and count how many satisfy the curve’s equation, then add one for the point at infinity. This is the most obvious but slowest approach, taking O(p²) time.

Here’s a slight variation on the code posted before. This time, instead of passing in the function defining the equation, we’ll assume the curve is in the form above (short Weierstrass form) and pass in the parameters a and b. This will work better when we refine the code below.

def order(a, b, p):
    c = 1 # The point at infinity
    for x in range(p):
        for y in range(p):
            if (y**2 - x**3 - a*x - b) % p == 0:
                c += 1
    return c

Better algorithm

A better approach would be to loop over the x values but not the y‘s. For each x, test determine whether

x³ + ax + b

is a square mod p by computing the Legendre symbol. This takes O(log³ p) time [1], and we have to do it for p different values of x, so the run time is O(p log³ p).

from sympy import legendre_symbol

def order2(a, b, p):
    c = 1 # The point at infinity
    for x in range(p):
        r = x**3 + a*x + b
        if r % p == 0:
            c += 1 # y == 0
        elif legendre_symbol(r, p) == 1:
            c += 2
    return c

Schoof’s algorithm

There’s a more efficient algorithm, Schoof’s algorithm. It has run time O(logk p) but I’m not clear on the value of k. I’ve seen k = 8 and k = 5. I’ve also seen k left unspecified. In any case, for very large p Schoof’s algorithm will be faster than the one above. However, Schoof’s algorithm is much more complicated, and the algorithm above is fast enough if p isn’t too large.

Comparing times

Let’s take our log to be log base 2; all logs are proportional, so this doesn’t change the big-O analysis.

If p is on the order of a million, i.e. around 220, then the brute force algorithm will have run time on the order of 240 and the improved algorithm will have run time on the order of 220 × 20³ ≈ 233. If k = 8 in Schoof’s algorithm, its runtime will be on the order of 208 ≈ 234, so roughly the same as the previous algorithm.

But if p is on the order of 2256, as it often is in cryptography, then the three algorithms have runtimes on the order of 2512, 2270, and 264. In this case Schoof’s algorithm is expensive to run, but the others are completely unfeasible.

[1] Note that logk means (log q)k, not log applied k times. It’s similar to the convention for sine and cosine.

Using TF-IDF to pick out important words

TF-IDF (Term Frequency-Inverse Document Frequency) is commonly used in natural language processing to extract important words. The idea behind the statistic is that a word is important if it occurs frequently in a particular document but not frequently in the corpus of documents the document came from.

The term-frequency (TF) of a word in a document is the probability of selecting that word at random from the document, i.e. the number of times the word appears in the document divided by the total number of words in the document.

Inverse document frequency (IDF) is not quite what the name implies. You might reasonably assume that inverse document frequency is the inverse (i.e. reciprocal) of document frequency, where document frequency is the proportion of documents containing the word. Or in other words, the reciprocal of the probability of selecting a document at random containing the word. That’s almost right, except you take the logarithm.

TF-IDF for a word and a document is the product of TF and IDF for that word and document. You could say

TF-IDF = TF * IDF

where the “-” on the left side is a hyphen, not a minus sign.

To try this out, let’s look at the King James Bible. The text is readily available, for example from Project Gutenberg, and it divides into 66 documents (books).

Note that if a word appears in every document, in our case every book of the Bible, then IDF = log(1) = 0. This means that common words like “the” and “and” that appear in every book get a zero score.

Here are the most important words in Genesis, as measured by TF-IDF.

laban: 0.0044
abram: 0.0040
joseph: 0.0037
jacob: 0.0034
esau: 0.0032
rachel: 0.0031
said: 0.0031
pharaoh: 0.0030
rebekah: 0.0029
duke: 0.0028

It’s surprising that Laban comes out on top. Surely Joseph is more important than Laban, for example. Joseph appears more often in Genesis than does Laban, and so has a higher TF score. But Laban only appears in two books, whereas Joseph appears in 23 books, and so Laban has a higher IDF score.

Note that TF-IDF only looks at sequences of letters. It cannot distinguish, for example, the person named Laban in Genesis from the location named Laban in Deuteronomy.

Another oddity above is the frequency of “duke.” In the language of the KJV, a duke was the head of a clan. It wasn’t a title of nobility as it is in contemporary English.

The most important words in Revelation are what you might expect.

angel: 0.0043
lamb: 0.0034
beast: 0.0033
throne: 0.0028
seven: 0.0028
dragon: 0.0025
angels: 0.0025
bottomless: 0.0024
overcometh: 0.0023
churches: 0.0022

You can find the top 10 words in each book here.

Related posts

Genesis Block Easter Egg

The White House put out a position paper Strengthening American Leadership in Digital Financial Technology a few days ago. The last page of the paper contains a hex dump.

Kinda surprising to see something like that coming out of the White House, but it makes sense in the context of cryptocurrency. Presumably Donald Trump has no idea what a hex dump is, but someone around him does.

My first thought was that something was wrong because the hex codes don’t correspond to the text on the side as it would if you were opening a text file in a hex editor. But it’s not a mistake; it’s an Easter Egg.

Extracting text from image

I tried to convert the image to text using tesseract but it fell down. I’ve had good experience with tesseract in the past, but this time was disappointing.

I was skeptical that an LLM would do a better job, because the LLMs use tesseract internally. Or at least at one time OpenAI did. Grok 4 initially did a poor job, but it worked after I gave it more help using the following prompt.

Convert the attached image to text. It is a hex dump: all characters are hexadecimal symbols: digits and the capital letters A, B, C, D, E, or F.

Here’s the result.

01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 3B A3 ED FD 7A 7B 12 B2 7A C7 2C 3E
67 76 8F 61 7F C8 1B C3 88 8A 51 32 3A 9F B8 AA
4B 1E 5E 4A 29 AB 5F 49 FF FF 00 1D 1D AC 2B 7C
01 01 00 00 00 01 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
01 04 45 54 68 65 20 54 69 6D 65 73 20 30 33 2F
4A 61 6E 2F 32 30 30 39 20 43 68 61 6E 63 65 6C
6C 6F 72 20 6F 6E 20 62 72 69 6E 6B 20 6F 66 20
73 65 63 6F 6E 64 20 62 61 69 6C 6F 75 74 20 66
6F 72 20 62 61 6E 6B 73 FF FF FF FF 01 00 F2 05
2A 01 00 00 00 43 41 04 67 8A FD B0 FE 55 48 27
19 67 F1 A6 71 30 B7 10 5C D6 A8 28 E0 39 09 A6
79 62 E0 EA 1F 61 DE B6 49 F6 BC 3F 4C EF 38 C4
F3 55 04 E5 1E C1 12 DE 5C 38 4D F7 BA 0B 8D 57
8A 4C 70 2B 6B F1 1D 5F AC 00 00 00 00

The Genesis Block

The hex content is the header of the Bitcoin “Genesis Block,” the first block in the Bitcoin blockchain. You can find full breakdown of the bytes here.

The defining characteristic of a blockchain is that it is a chain of blocks. The blocks are connected by each block containing the cryptographic hash of the previous block’s header. For Bitcoin, the hash starts in the 5th byte and runs for the next 32 bytes. You see a lot of zeros at the top of the hex dump above because the Genesis Block had no predecessor on the chain.

Easter Egg Within an Easter Egg

Quoting the hex dump of the Genesis Block in the whitepaper was an Easter Egg for Bitcoin enthusiast. The Genesis Block contains a sort of Easter Egg of its own.

The section of the header

    54 69 6D ... 6E 6B 73

is the ASCII text

The Times 03/Jan/2009 Chancellor on brink of second bailout for banks

Satoshi Nakamoto quoted the headline from The Times from January 3, 2009 to prove that the genesis block was created on or after that date. The headline seems to also be a sort of Easter Egg, an implicit commentary on the instability of fractional-reserve banking.

Related posts

Making the two-dimensional one-dimensional

We often want to reduce something that’s inherently two-dimensional into something one-dimensional. We want to turn graph into a list.

And we’d like to do this with some kind of faithfulness. We’d like things that are close together in 2D space to be close together in their 1D representation, and vice versa, to the extent possible.

For example, postal codes are a way of imposing a linear order on geographic regions. You would like (or maybe naively assume) that regions whose zip codes are numerically close together are geographically close together. This is roughly true. See this post to explore that further.

Tours are another way to turn a graph into a list. A Traveling Salesman tour is a path of shortest length through a set of points. For example, here is a Traveling Salesman tour of Texas counties. Counties that are visited consecutively are close together, though it may take a long time to come back to a county close to the one you’re in at a given time.

Sometimes there are purely mathematical reasons to flatten a 2D structure into a linear tour, such as Hilbert curves or Cantor’s diagonal trick.

All this came to mind because I saw a post on Hacker News this morning about a way to enumerate a zigzag spiral.

The remarkable thing about this article is that the author gives a sequence of closed-form expressions for the number at position (mn) in the grid.

Related posts

Looking back at Martin Gardner’s RSA article

Public key cryptography came to the world’s attention via Martin Gardner’s Scientific American article from August 1977 on RSA encryption.

The article’s opening paragraph illustrates what a different world 1977 was in regard to computation and communication.

… in a few decades … the transfer of information will probably be much faster and much cheaper by “electronic mail” than by conventional postal systems.

Gardner quotes Ron Rivest [1] saying that breaking RSA encryption by factoring the product of two 63-digit primes would take about 40 quadrillion years. The article included a challenge, a message encrypted using a 129-digit key, the product of a 64-digit prime and a 65-digit prime. Rivest offered a $100 prize for decrypting the message.

Note the tension between Rivest’s estimate and his bet. It’s as if he were saying “Based on the factoring algorithms and computational hardware now available, it would take forever to decrypt this message. But I’m only willing to bet $100 that that estimate remains valid for long.”

The message was decrypted 16 years later. Unbeknownst to Gardner’s readers in 1977, the challenge message was

THE MAGIC WORDS ARE SQUEAMISH OSSIFRAGE

encoded using 00 for space, 01 for A, 02 for B, etc.  It was decrypted in 1993 by a group of around 600 people using around 1600 computers. Here is a paper describing the effort. In 2015 Nat McHugh factored the key in 47 minutes using 8 CPUs on Google Cloud.

The RSA algorithm presented in Gardner’s article is much simpler than it’s current implementaiton, though the core idea remains unchanged. Now we use much larger public keys, the product of two 1024 bit (308 digit) primes or larger. Also, RSA isn’t used to encrypt messages per se; RSA is used to exchange symmetric encryption keys, such as AES keys, which are then used to encrypt messages.

RSA is still widely used, though elliptic curve cryptography (ECC) is taking its place, and eventually both RSA and ECC will presumably be replaced with post-quantum methods.

More RSA posts

[1] I met Ron Rivest at the Heidelberg Laureate Forum in 2013. When he introduced himself I said something like “So you’re the ‘R’ in RSA?” He’s probably tired of hearing that, but if so he was too gracious to show it.

Factoring RSA100

Earlier today I wrote about factoring four 255-bit numbers that I needed for a post. Just out of curiosity, I wanted to see how long it would take to factor RSA 100, the smallest of the factoring challenges posed by RSA Laboratories in 1991. This is a 100-digit (330-bit) number that is the product of two primes.

I used the CADO-NFS software. The software was developed in France, and CADO is a French acronym for Crible Algébrique: Distribution, Optimisation. NFS stands for number field seive, the fastest algorithm for factoring numbers with over 100 digits.

RSA 100 was first factored in 1991 using a few days of compute time on an MP1 MasPar computer, a machine that cost $500,000 at the time, equivalent to around $1,250,000 today.

My effort took about 23 minutes (1376 seconds) on a System 76 Meerkat mini that I paid $600 for in 2022.

The MP1 was about the size of a refrigerator. The Meerkat is about 3″ × 3″ × 1.5″.

Pairing-unfriendly curves

A couple days ago I wrote about two pair of closely related elliptic curves: Tweedledum and Tweedledee, and Pallas and Vesta.

In each pair, the order of one curve is the order of the base field of the other curve. The curves in each pair are used together in cryptography, but they don’t form a “pairing” in the technical sense of a bilinear pairing, and in fact none of the curves are “pairing-friendly” as described below.

An elliptic curve E/Fq is said to be pairing-friendly if r divides qk − 1 for some small k. Here r is the size of the largest prime-order subgroup of the curve, but since our curves have prime order p, r = p.

As for what constitutes a small value of k, something on the order of 10 would be considered small. The larger k is, the less pairing-friendly the curve is. We will show that our curves are extremely pairing-unfriendly.

Since q is not a multiple of p in our examples, there must be some power of q such at

qk = 1 mod p.

The question is whether k is large, i.e. whether the order of q mod p is large. We could try successive values of k, but that won’t get us very far. To be more clever, we use Lagrange’s theorem that says the order of an element divides the order of the group. So k must be one of the factors of p − 1. (We subtract 1 because we’re looking at the multiplicative group mod p, which removes 0.)

Finding the divisors of n − 1 requires factoring n − 1, which isn’t easy, but isn’t insurmountable either. The previous post reports the time required to do this in Python and in Mathematica for each of the following values of n.

p = 2254 + 4707489544292117082687961190295928833
q = 2254 + 4707489545178046908921067385359695873
r = 2254 + 45560315531419706090280762371685220353
s = 2254 + 45560315531506369815346746415080538113

Tweedledum has order p and its base field has order q.

k = 28948022309329048855892746252171976963322203655954433126947083963168578338816

Tweedledee has order q and its base field has order p.

k = 28948022309329048855892746252171976963322203655955319056773317069363642105856

Vesta has order r and its base field has order s.

k = 14474011154664524427946373126085988481681528240970780357977338382174983815168

Pallas has order s and its base field has order r.

k = 14474011154664524427946373126085988481681528240970823689839871374196681474048

It’s safe to say in each case k is not a small number.

 

Time to factor big integers Python and Mathematica

This post will look at the time required to factor n − 1 each of the following prime numbers in Python (SymPy) and Mathematica. The next post will explain why I wanted to factor these numbers.

p = 2254 + 4707489544292117082687961190295928833
q = 2254 + 4707489545178046908921067385359695873
r = 2254 + 45560315531419706090280762371685220353
s = 2254 + 45560315531506369815346746415080538113

Here are the timing results.

    |   |   Python | Mathematica |
    |---+----------+-------------|
    | p |    0.913 |       0.616 |
    | q |    0.003 |       0.002 |
    | r |  582.107 |      14.915 |
    | s | 1065.925 |      20.763 |

This is hardly a carefully designed benchmark, but it’s enough to suggest Mathematica can be a couple orders of magnitude faster than Python.

Here are the factorizations.

p − 1 = 234 × 3 × 4322432633228119 × 129942003317277863333406104563609448670518081918257
q − 1 = 233 × 3 × 5179 × 216901160674121772178243990852639108850176422522235334586122689
r − 1 = 232 × 32 × 463 × 539204044132271846773 × 8999194758858563409123804352480028797519453
s − 1 = 232 × 32 × 1709 × 24859 × 1690502597179744445941507 × 10427374428728808478656897599072717

Cycles of Elliptic Curves

The previous post gave two examples of pairs of elliptic curves in which

#(EFp) = q

and

#(EFq) = p.

That is, the curve E, when defined over integers mod p has q elements, and when defined over the integers mod q has p elements.

Pairs

Silverman and Stange [1] call this arrangement an amicable pair. They found a small pair of amicable elliptic curves:

y² = x³ + 2

with p and q equal to 13 and 19. They give many other examples, but this one is nice because it’s small enough for hand calculations, unlike the curves mentioned in the previous post that had on the order of 2254 elements.

Cycles

More generally, amicable curve pairs are amicable curve cycles with cycle length 3. Silverman and Stange give this example of a cycle of length 3:

y² = x³ − 25x − 8

with (p, q, r) = (83, 79, 73). That is, the curve over F83 has 79 elements, the curve over F79 has 73 elements, and the curve over F73 has 83 elements.

The authors show that there exist cycles of every length m ≥ 2.

Application

Why does any of this matter? Cycles of elliptic curves are useful in cryptography, specifically in zero-knowledge proofs. I hope to go into this further in some future post.

Confirmation

The curves mentioned above are small enough that we can compute the orders of the curves quickly by brute force. The following Python code confirms the claimed orders above.

def order(eqn, p):
    c = 1 # The point at infinity
    for x in range(p):
        for y in range(p):
            if eqn(x, y) % p == 0:
                c += 1
    return c

eqn = lambda x, y: y**2 - x**3 - 2
assert( order( eqn, 13) == 19 )
assert( order( eqn, 19) == 13 )

eqn = lambda x, y: y**2 - x**3 + 25*x + 8
assert( order( eqn, 83) == 79 )
assert( order( eqn, 79) == 73 )
assert( order( eqn, 73) == 83 )

Related posts

[1] Joseph H. Silverman and Katherine E. Stange. Amicable pairs and aliquot cycles for elliptic curves. Experimental Mathematics, 20(3):329–357, 2011.