Comparing Truncation to Differential Privacy

Traditional methods of data de-identification obscure data values. For example, you might truncate a date to just the year.

Differential privacy obscures query values by injecting enough noise to keep from revealing information on an individual.

Let’s compare two approaches for de-identifying a person’s age: truncation and differential privacy.

Truncation

First consider truncating birth date to year. For example, anyone born between January 1, 1955 and December 31, 1955 would be recorded as being born in 1955. This effectively produces a 100% confidence interval that is one year wide.

Next we’ll compare this to a 95% confidence interval using ε-differential privacy.

Differential privacy

Differential privacy adds noise in proportion to the sensitivity Δ of a query. Here sensitivity means the maximum impact that one record could have on the result. For example, a query that counts records has sensitivity 1.

Suppose people live to a maximum of 120 years. Then in a database with n records [1], one person’s presence in or absence from the database would make a difference of no more than 120/n years, the worst case corresponding to the extremely unlikely event of a database of n − 1 newborns and one person 120 year old.

Laplace mechanism and CIs

The Laplace mechanism implements ε-differential privacy by adding noise with a Laplace(Δ/ε) distribution, which in our example means Laplace(120/nε).

A 95% confidence interval for a Laplace distribution with scale b centered at 0 is

[b log 0.05, −b log 0.05]

which is very nearly

[−3b, 3b].

In our case b = 120/nε, and so a 95% confidence interval for the noise we add would be [−360/nε, 360/nε].

When n = 1000 and ε = 1, this means we’re adding noise that’s usually between −0.36 and 0.36, i.e. we know the average age to within about 4 months. But if n = 1, our confidence interval is the true age ± 360. Since this is wider than the a priori bounds of [0, 120], we’d truncate our answer to be between 0 and 120. So we could query for the age of an individual, but we’d learn nothing.

Comparison with truncation

The width of our confidence interval is 720/ε, and so to get a confidence interval one year wide, as we get with truncation, we would set ε = 720. Ordinarily ε is much smaller than 720 in application, say between 1 and 10, which means differential privacy reveals far less information than truncation does.

Even if you truncate age to decade rather than year, this still reveals more information than differential privacy provided ε < 72.

More data privacy posts

[1] Ordinarily even the number of records in the database is kept private, but we’ll assume here that for some reason we know the number of rows a priori.

Golden ratio primes

The golden ratio is the larger root of the equation

φ² − φ − 1 = 0.

By analogy, golden ratio primes are prime numbers of the form

p = φ² − φ − 1

where φ is an integer. To put it another way, instead of solving the equation

φ² − φ − 1 = 0

over the real numbers, we’re looking for prime numbers p where the equation can be solved in the integers mod p. [1]

Application

When φ is a large power of 2, these prime numbers are useful in cryptography because their special form makes modular multiplication more efficient. (See the previous post on Ed448.) We could look for such primes with the following Python code.

    from sympy import isprime

    for n in range(1000):
        phi = 2**n
        q = phi**2 - phi - 1
        if isprime(q):
            print(n)

This prints 19 results, including n = 224, corresponding to the golden ratio prime in the previous post. This is the only output where n is a multiple of 32, which was useful in the design of Ed448.

Golden ratio primes in general

Of course you could look for golden ratio primes where φ is not a power of 2. It’s just that powers of 2 are the application where I first encountered them.

A prime number p is a golden ratio prime if there exists an integer φ such that

p = φ² − φ − 1

which, by the quadratic theorem, is equivalent to requiring that m = 4p + 5 is a square. In that case

φ = (1 + √m)/2.

Here’s some code for seeing which primes less than 1000 are golden ratio primes.

    from sympy import primerange

    def issquare(m):
        return int(m**0.5)**2 == m

    for p in primerange(2, 1000):
        m = 4*p + 5
        if issquare(m):
            phi = (int(m**0.5) + 1) // 2
            assert(p == phi**2 - phi - 1)
            print(p)

By the way, there are faster ways to determine whether an integer is a square. See this post for algorithms.

(Update: Aaron Meurer pointed out in the comments that SymPy has an efficient function sympy.ntheory.primetest.is_square for testing whether a number is a square.)

Instead of looping over primes and testing whether it’s possible to solve for φ, we could loop over φ and test whether φ leads to a prime number.

    for phi in range(1000):
        p = phi**2 - phi - 1
        if isprime(p):     
            print(phi, p)

Examples

The smallest golden ratio prime is p = 5, with φ = 3.

Here’s a cute one: the pi prime 314159 is a golden ratio prime, with φ = 561.

The golden ratio prime that started this rabbit trail was the one with φ = 2224, which Mike Hamburg calls the Goldilocks prime in his design of Ed448.

Related posts

[1] If p = φ² − φ − 1 for some integer φ, then φ² − φ − 1 = 0 (mod p). But the congruence can have a solution when p is not a golden ratio prime. The following code shows that the smallest example is p = 31 and φ = 13.

from sympy import primerange
from sympy.ntheory.primetest import is_square

for p in primerange(2, 100):
    m = 4*p + 5
    if not is_square(m):
        for x in range(p):
            if (x**2 - x - 1) % p == 0:
                print(p, x)
                exit()

Goldilocks and the three multiplications

Illustration by Arthur Rackham, 1918. Public domain.

Mike Hamburg designed an elliptic curve for use in cryptography he calls Ed448-Goldilocks. The prefix Ed refers to the fact that it’s an Edwards curve. The number 448 refers to the fact that the curve is over a prime field where the prime p has size 448 bits. But why Goldilocks?

Golden primes and Goldilocks

The prime in this case is

p = 2448 − 2224 − 1,

which has the same form as the NIST primes. Hamburg says in his paper

I call this the “Goldilocks” prime because its form defines the golden ratio φ = 2224.

That sentence puzzled me. What does this have to do with the golden ratio? The connection is that Hamburg’s prime is of the form

φ² − φ − 1.

The roots of this polynomial are the golden ratio and its conjugate. But instead of looking for real numbers where the polynomial is zero, we’re looking for integers where the polynomial is prime. (See the followup post on golden ratio primes.)

The particular prime that Hamburg uses is the “Goldilocks” prime by analogy with the fairy tale: the middle term 2224 is just the right size. He explains

Because 224 = 32*7 = 28*8 = 56*4, this prime supports fast arithmetic in radix 228 or 232 (on 32-bit machines) or 256 (on 64-bit machines). With 16, 28-bit limbs it works well on vector units such as NEON. Furthermore, radix-264 implementations are possible with greater efficiency than most of the NIST primes.

Karatsuba multiplication

The title of this post is “Goldilocks and the three multiplications.” Where do the three multiplications come in? It’s an allusion to an algorithm for multi-precision multiplication that lets you get by with three multiplications where the most direct approach would require four. The algorithm is called Karatsuba multiplication [1].

Hamburg says “The main advantage of a golden-ratio prime is fast Karatsuba multiplication” and that if we set φ = 2224 then

\begin{align*} (a + b\phi)(c + d\phi) &= ac + (ad+bc)\phi + bd\phi^2 \\ &\equiv (ac+bd) + (ad+bc+bd)\phi \pmod{p} \\ &= (ac + bd) +((a+b)(c+d) - ac)\phi \end{align*}

Since the variables represent 224-bit numbers, removing a multiplication at the expense of an extra addition and subtraction is a net savings [2].

The most important line of the calculation above, and the only one that isn’t routine, is the second. That’s where the special form of p comes in. When you eliminate common terms from both sides, the calculation boils down to showing that

bd(\phi^2 - \phi - 1) \equiv 0 \pmod{p}

which is obviously true since p = φ² − φ − 1.

Curve Ed448-Goldilocks

Edwards curves only have one free parameter d (besides the choice of field) since they have the form

x² + y² = 1 + d x² y².

Hamburg chose d = −39081 for reasons explained in the paper.

Most elliptic curves used in ECC currently work over prime fields of order 256 bits, providing 128 bits of security. The motivation for developed Ed448 was much the same as for developing P-384. Both work over larger fields and so provide more bits of security, 224 and 192 bits respectively.

Unlike P-384, Ed448 is a safe curve, meaning that it lends itself to a secure practical implementation.

Related posts

[1] Here we’ve just applied the Karatsuba algorithm one time. You could apply it recursively to multiply two n-bit numbers in O(nk) time, where k = log2 3 ≈ 1.86. This algorithm, discovered in 1960, was the first multiplication algorithm faster than O(n²).

[2] Addition and subtraction are O(n) operations. And what about multiplication? That’s not an easy question. It’s no worse than O(n1.68) by virtue of the Karatsuba algorithm. In fact, it’s O(n log n), but only for impractically large numbers. See the discussion here. But in any case, multiplication is slower than addition for multi-precision numbers.

Tricks for arithmetic modulo NIST primes

The US National Institute of Standards and Technology (NIST) originally recommended 15 elliptic curves for use in elliptic curve cryptography [1]. Ten of these are over a field of size 2n. The other five are over prime fields. The sizes of these fields are known as the NIST primes.

The NIST curves over prime fields are named after the number of bits in the prime: the name is “P-” followed by the number of bits. The primes themselves are named p with a subscript for the number of bits.

The five NIST primes are

p192 = 2192 − 264 − 1
p224 = 2224− 296 + 1
p256 = 2256 − 2244 + 2192 + 296 − 1
p384 = 2384 − 2128 − 296 + 232 − 1
p521 = 2521 − 1

The largest of these, p521, is a Mersenne prime, and the rest are generalized Mersenne primes.

Except for p521, the exponents of 2 in the definitions of the NIST primes are all multiples of 32 or 64. This leads to efficient tricks for arithmetic modulo these primes carried out with 32-bit or 64-bit integers. You can find pseudocode implementations for these tricks in Mathematical routines for the NIST prime elliptic curves.

The elliptic curve Ed448 “Goldilocks” was not part of the original set of recommended curves from NIST but has been added. It employs a multiplication trick in the same spirit as the routines referenced above, but simpler. Ed448 uses

p = 2448 – 2224 – 1

which has the special form φ² – φ – 1 where φ = 2224. This enables a trick known as Karatsuba multiplication. More on that here.

Related posts

[1] FIPS PUB 186-4. This publication is dated 2013, but the curve definitions are older. I haven’t found for certain when the curves were defined. I’ve seen one source that says 1997 and another that says 1999.

Elliptic curve P-384

The various elliptic curves used in ellitpic curve cryptography (ECC) have different properties, and we’ve looked at several of them before. For example, Curve25519 is implemented very efficiently, and the parameters were transparently chosen. Curve1174 is interesting because it’s an Edwards curve and has a special addition formula.

This post looks at curve P-384. What’s special about this curve? It’s the elliptic curve that the NSA recommends everyone use until post-quantum methods have been standardized. It provides 192 bits of security, whereas more commonly used curves provide 128 bits.

Does the NSA recommend this method because they know how to get around it? Possibly, but they also need to recommend methods that they believe foreign governments cannot break.

The equation of the P-384 curve is

y² = x³ + ax + b

working over the field of integers modulo a prime p. We will go into each of the specific parameters ab, and p, and discuss how they were chosen.

Modulus p

Consisting with the naming conventions for elliptic curves used in cryptography, the name “P-384” tells you that the curve is over a prime field where the prime is a 384-bit integer. Specifically, the order of the field is

p = 2384 − 2128 − 296 + 232 − 1

For a given number of bits, in this case 384, you want to pick a prime that’s relatively near the maximum size for that number of bits. In our case, our prime p is a prime near 2384 with a convenient bit pattern. (The special pattern allows implementation tricks that increase efficiency.)

Hasse’s theorem says that the number of points on a curve modulo a large prime is on the order of magnitude equal to the prime, so P-384 contains approximately 2384 points. In fact, the number of points n on the curve is

39402006196394479212279040100143613805079739270465446667946905279627659399113263569398956308152294913554433653942643

or approximately 2384 − 2190. The number n is a prime, and so it is the order of P-384 as a group.

Linear coefficient a

According to a footnote in the standard defining P-384, FIPS PUB 186-4,

The selection a ≡ −3 for the coefficient of x was made for reasons of efficiency; see IEEE Std 1363-2000.

All the NIST elliptic curves over prime fields use a = −3 because this makes it possible to use special algorithms for elliptic curve arithmetic.

Constant coefficient b

The curve P-384 has Weierstrass form

y² = x³ − 3x + b

where b is

27580193559959705877849011840389048093056905856361568521428707301988689241309860865136260764883745107765439761230575.

The parameter b is between 2383 and 2384 but doesn’t have any particular binary pattern:

101100110011000100101111101001111110001000111110111001111110010010011000100011100000010101101011111000111111100000101101000110010001100000011101100111000110111011111110100000010100000100010010000000110001010000001000100011110101000000010011100001110101101011000110010101100011100110001101100010100010111011010001100111010010101010000101110010001110110111010011111011000010101011101111

The specification says that b was chosen at random. How can you convince someone that you chose a parameter at random?

The standard gives a 160-bit seed s, and a hash-based algorithm that s was run through to create a 384-bit parameter c. Then b is the solution to

b² c = −27 mod p.

The algorithm going from the s to c is given in Appendix D.6 and is a sort of key-stretching algorithm. The standard cites ANS X9.62 and IEEE Standard 1363-2000 as the source of the algorithm.

If b was designed to have a back door, presumably a tremendous amount of computation had to go into reverse engineering the seed s.

Koblitz and Menezes wrote a paper in which they suggest a way that the NSA might have picked seeds that lead to weak elliptic curves, but then argue against it.

It is far-fetched to speculate that the NSA would have deliberately selected weak elliptic curves in 1997 for U.S. government usage … confident that no one else would be able to discover the weakness in these curves in the ensuing decades. Such a risky move by the NSA would have been inconsistent with the Agency’s mission.

Related posts

Bessel function crossings

The previous post looked at the angles that graphs make when they cross. For example, sin(x) and cos(x) always cross with the same angle. The same holds for sin(kx) and cos(kx) since the k simply rescales the x-axis.

The post ended with wondering about functions analogous to sine and cosine, such as Bessel functions. This post will look at that question in more detail. Specifically we’ll look at the functions Jν and Yν.

Because these two Bessel functions satisfy the same second order linear homogeneous differential equation, the Strum separation theorem says that their zeros are interlaced: between each pair of consecutive zeros of Jν is exactly one zero of Yν, and between each pair of consecutive zeros of Yν there is exactly one zero of Jν.

Plotting Bessel functions J_3 and Y_3

In the following Python code, we find zeros of Jν, then look in between for places where Jν and Yν cross. Next we find the angle the two curves make at each intersection and plot the angles.

    from scipy.special import jn_zeros, jv, yv
    from scipy.optimize import bisect
    from numpy import empty, linspace, arccos
    import matplotlib.pyplot as plt
    
    n = 3 # bessel function order
    N = 100 # number of zeros
    
    z = jn_zeros(n, N) # Zeros of J_n
    crossings = empty(N-1)
    
    f = lambda x: jv(n,x) - yv(n,x)    
    for i in range(N-1):
        crossings[i] = bisect(f, z[i], z[i+1])
    
    def angle(n, x):
        # Derivatives of J_nu and Y_nu
        dj = 0.5*(jv(n-1,x) - jv(n+1,x))
        dy = 0.5*(yv(n-1,x) - yv(n+1,x))
        
        top = 1 + dj*dy
        bottom = ((1 + dj**2)*(1 + dy**2))**0.5
        return arccos(top/bottom)
        
    y = angle(n, crossings)
    plt.plot(y)
    plt.xlabel("Crossing number")
    plt.ylabel("Angle in radians")
    plt.show()

This shows that the angles steadily decrease, apparently quadratically.

Angles of crossing of J_3 and Y_3

This quadratic behavior is what we should expect from the asymptotics of Jν and Yν: For large arguments they act like shifted and rescaled versions of sin(x)/√x. So if we looked at √xJν and √xYν rather than Jν and Yν we’d expect the angles to reach some positive asymptote, and they do, as shown below.

Angles of crossing of √x J_3 and √xY_3

Related posts

Orthogonal graphs

Colin Wright posted a tweet yesterday that said that the plots of cosine and tangent are orthogonal. Here’s a plot so you can see for yourself.

Plot of cosine and tangen

Jim Simons replied with a proof so short it fits in a tweet: The product of the derivatives is −sin(x)sec²(x) = −tan(x)/cos(x), which is −1 if cos(x)=tan(x).

This made me wonder whether sine and cosine are orthogonal in the sense of graphs intersecting at right angles. They are orthogonal in the sense that their product integrates to zero over the interval [0, 2π] is zero, a fact that’s important fact in Fourier analysis, but are they orthogonal in the sense of their graphs intersecting at right angles? A graph makes this look plausible:

Plot of sine and cosine

But the graph is misleading. I made the plot without specifying the aspect ratio, using the default in Mathematica. This makes the angle between the graphs look smaller. Setting the aspect ratio to 1 shows the true picture. The two curves intersect at π/4 and 5π/4, and they intersect at an angle of 2π/3, not π/2.

The product of the slopes is not −1, but it is negative and constant, so you could multiply each function by some constant to make the product of slopes −1. Said another way, you could make the curves perpendicular by adjusting he aspect ratio.

Can you do this for other functions that are orthogonal in the inner product sense? Not easily. For example, sin(2x) and sin(3x) are orthogonal in the inner product (integral) sense, but the angles of intersection are different at different places where the curves cross.

What about other functions that are like sine and cosine? I looked at the Bessel functions J1 and J2, but the angles at the intersections vary. Ditto for Chebyshev polynomials. I suppose the difference is that sine and cosine are translations of each other, whereas that’s not the case for Bessel or Chebyshev functions. But it is true for wavelets, so you could find wavelets that are orthogonal in the sense of inner products and in the sense of perpendicular graphs.

Update: See more on how Bessel functions cross in the next post.

Fascination burnout

Here a little dialog from Anathem by Neal Stephenson that I can relate to:

“… I don’t care …”

Asribalt was horrified. “But how can you not be fascinated by—”

“I am fascinated,” I insisted. “That’s the problem. I’m suffering from fascination burnout. Of all the things that are fascinating, I have to choose just one or two.”

Area and volume of Menger sponge

The Menger sponge is the fractal you get by starting with a cube, dividing each face into a 3 by 3 grid (like a Rubik’s cube) and removing the middle square of each face and everything behind it. That’s M1, the Menger sponge at the 1st stage of its construction. The next stage repeats this process on all the little cubes that make up what’s left. That’s M2. Repeat the process over and over, and in the limit you get Menger’s sponge, a fractal with zero volume and infinite area!

This business of zero volume and infinite area may sound unreal, but the steps along the way to the limit are very tangible. Here’s a video by Aaron Benzel that let’s you fly through M3, and watch what happens if you split M3 apart.

You can compute the volume and area at each stage to show that

\mathrm{volume}(M_n) = \left(\frac{20}{27} \right )^n

and

\mathrm{area}(M_n) = 2\left(\frac{20}{9} \right )^n + 4 \left(\frac{8}{9} \right )^n

From these equations you can see that you can make the volume as small and you’d like, and the area as large as you like, by taking n big enough. And in case that sounds a little hand-wavey, we can get more concrete. Here’s a little code to find exactly how big a value of n is big enough.

    from math import log, ceil
    
    def menger_volume(n):
        return (20/27.)**n
    
    def menger_area(n):
        return 2*(20/9.)**n + 4*(8/9.)**n
    
    def volume_below(v):
        if v >=1:
            return 1
        else:
            n = log(v)/log(20/27.)
            return int(ceil(n)) 
    
    def area_above(a):
        if a < 2:
            return 1
        else:
            n = (log(a) - log(2))/log(20/9.)
            return int(ceil(n))
    
    n = volume_below(0.001)
    print( n, menger_volume(n) )
    
    n =  area_above(1000)
    print( n, menger_area(n) )

Related posts

Regular expression for ICD-9 and ICD-10 codes

Suppose you’re searching for medical diagnosis codes in the middle of free text. One way to go about this would be to search for each of the roughly 14,000 ICD-9 codes and each of the roughly 70,000 ICD-10 codes. A simpler approach would be to use regular expressions, though that may not be as precise.

In practice regular expressions may have some false positives or false negatives. The expressions given here have only false positives. That is, no valid ICD-9 or ICD-10 codes will go unmatched, but the regular expressions may match things that are not diagnosis codes. The latter is inevitable anyway since a string of characters could coincide with a diagnosis code but not be used as a diagnosis code. For example 1234 is a valid ICD-9 code, but 1234 in a document could refer to other things, such as a street address.

ICD-9 diagnosis code format

Most ICD-9 diagnosis codes are just numbers, but they may also start with E or V.

Numeric ICD-9 codes are at least three digits. Optionally there may be a decimal followed by one of two more digits.

An E code begins with E and three digits. These may be followed by a decimal and one more digit.

A V code begins with a V followed by two digits. These may be followed by a decimal and one or two more digits.

Sometimes the decimals are left out.

Here are regular expressions that summarize the discussion above.

    N = "\d{3}\.?\d{0,2}"
    E = "E\d{3}\.?\d?"
    V = "V\d{2}\.?\d{0,2}"
    icd9_regex = "|".join([N, E, V])

Usually E and V are capitalized, but they don’t have to be, so it would be best to do a case-insensitive match.

ICD-10 diagnosis code format

ICD-10 diagnosis codes always begin with a letter (except U) followed by a digit. The third character is usually a digit, but could be an A or B [1]. After the first three characters, there may be a decimal point, and up to three more alphanumeric characters. These alphanumeric characters are never U. Sometimes the decimal is left out.

So the following regular expression would match any ICD-10 diagnosis code.

    [A-TV-Z][0-9][0-9AB]\.?[0-9A-TV-Z]{0,4}

As with ICD-9 codes, the letters are usually capitalized, but not always, so it’s best to do a case-insensitive search. In addition to the pattern above, “special codes” may begin with U.

Testing the regular expressions

As mentioned at the beginning, the regular expressions here may have false positives. However, they don’t let any valid codes slip by. I downloaded lists of ICD-9 and ICD-10 codes from the CDC and tested to make sure the regular expressions here matched every code.

Regular expression features used

Character ranges are supported everywhere, such as [A-TV-Z] for the letters A through T and V through Z.

Not every regular expression implementation supports \d to represent a digit. In Emacs, for example, you would have to use[0-9] instead since it doesn’t support \d.

I’ve used \.? for an optional decimal point. (The . is a special character in regular expressions, so it needs to be escaped to represent a literal period.) Some people wold write [.]? instead on the grounds that it may be more readable. (Periods are not special characters in the context of a character classes.)

I’ve used {m} for a pattern that is repeated exactly m times, and {m,n} for a pattern that is repeated between m and n times. This is supported in Perl and Python, for example, but not everywhere. You could write \d\d\d instead of \d{3} and \d?\d? instead of \d{0,2}.

Related posts

[1] The only ICD-10 codes with a non-digit in the third position are those beginning with C4A, C7A, C7B, D3A, M1A, O9A, and Z3A.