DNA Sequence Alignment and Kings

This morning I wrote a post that included the central Delannoy numbers. The nth central Delannoy number Dn counts the number of ways a king can move from one corner of a chessboard to the diagonally opposite corner without backtracking.

The more general Delannoy numbers Dm,n are the analogy for an m × n rectangular board, not necessarily square.

Dm,n is also the number of possible sequence alignments for a strand of DNA with m base pairs and a strand with n base pairs [1]. At each step in the alignment process, you can introduce a gap in the first strand, the second strand or neither, which is analogous to the king who can move N, E, or NE at each step.

The Delannoy numbers can be computed recursively:

def D(m, n):
    if m == 0 or n == 0:
        return 1
    return D(m - 1, n) + D(m, n - 1) + D(m - 1, n - 1)

The code above can be sped up tremendously by adding the decorator

@lru_cache(maxsize=None)

above the function definition to turn on memoization. I did an experiment computing D12,15 with and without memoization and the times were 77.1805 seconds and 0.000062 seconds respectively, i.e. memoization made the code over a million times faster.

Incidentally, D12,15 = 2653649025 and so there are a lot of ways to align even short sequences unless you place some restriction on the permissible alignments.

Update: Here’s a heatmap plotting log10(Dm,n). Obviously the function increases with m and n: bigger chessboards have more possible paths. Moreover, it’s larger along the diagonal (i.e. the central Delannoy numbers). If you look along northeast to southwest diagonals, the function is largest in the middle where m = n.

[1] Torres, A., Cabada, A., & Nieto, J. J. (2003). An exact formula for the number of alignments between two DNA sequences. DNA Sequence, 14(6), 427–430. https://doi.org/10.1080/10425170310001617894

Distinguishing variables from parameters

Imagine the following dialog.

Professorf is a function of a real variable x that takes a real parameter k.

Student: What’s a parameter?

Professor: It’s a constant that can vary.

Student: Then if it can vary, isn’t it a variable?

Professor: Sorta, but no not really.

This conversation plays out over and over, and unfortunately it often ends as it does above, with the student confused. Here’s how I believe the conversation should continue.

Professor: You’re absolutely right that f is a function of two variables, x and k. But usually k is fixed in the context of a specific application and x is not. A different application might have a different, but also fixed, value of k. So it is helpful to think of f(xk), a function of x with a parameter k, rather than f(xk), a function of two variables. The former carries more information, giving a hint as to how the numbers are used.

Is there really a difference between a parameter and a variable? In a reductionistic sense, no. But in a practical sense, yes, absolutely.

It might sound pedantic to distinguish a variable from a parameter, and it is, in the best sense of the word. Pedant literally means teacher. Usually pedantic carries a negative connotation, such as making a distinction without a difference. But here the pedant would be making a helpful distinction.

For example, we might write a probability density function as f(x; μ, σ). The function gives the probability density at a point x. The density depends on parameters μ and σ, and these parameters change between applications, but for a given application they have fixed values.

You find the probability of a random variable taking on values in an interval [ab] by integrating f over that interval. When I say that, you know that I mean you’d integrate with respect to x, because f is a function of x. It is also, in an abstract sense, a function of μ and σ, but it’s typically not useful to think of it that way.

Hypergeometric functions have two sets of parameters, and so you may see two semicolons, such as f(xabc). This denotes a function of the variable x, with upper parameters a and b, and a lower parameter c. In some abstract sense this is a function of four variables, but it acts very differently with respect to x than with respect to ab, and c. There’s also a difference between a and b on the one hand and c on the other, one worth paying attention to, though it is less of a difference than between x and the parameters collectively.

Sometimes you’ll see a vertical bar rather than a semicolon to separate variables from parameters. This works out even better for probability densities because then f(x | μ, σ) suggests the probability density of x given μ and σ since the vertical bar is also used for conditional probability. You might also see f(xa, b; c) for hypergeometric functions, with the vertical bar separating variables from parameters and the semicolon separating two kinds of parameters.

When I first saw a semicolon separating variables from parameters, no explanation was given, and I figured I could mentally replace the semicolon with a comma. Then later I realized that the semicolon was an act of kindness by the author giving the reader additional information.

Silver Rectangles and the Ways of Kings

Golden rectangles

The defining property of golden rectangle is that if you stick a square on its longer side, you get another golden rectangle.

The smaller vertical rectangle is similar to the larger horizontal rectangle. This means

φ / 1 = (1 + φ) / φ

which tells us φ² = 1 + φ and so the golden ratio φ equals (1 + √5)/2.

Silver rectangles

A silver rectangle is one that if you stick two squares on its longer side you get another rectangle with the same aspect ratio.

This tells us

σ / 1 = (1 + 2σ) / σ

and so σ² = 1 + 2σ and the silver ratio is σ = 1 + √2.

Just as you can define a golden ratio and a silver ratio, there’s an analogous way to define a sequence of metallic ratios.

Kings and Delannoy numbers

The silver ratio has several connections to the ways of ways kings. By that I mean the number of ways a king can go from one corner of a chessboard to the diagonally opposite corner without backtracking.

A king can move one space in any direction. If we start with a king in the bottom left corner of the board, the no-backtracking requirement means the king can move up, right, or up and right.

The number of paths a king can take from one corner to the opposite corner of an n × n chessboard is the nth central Delannoy number Dn. more generally Delannoy numbers are defined for an m × n chessboard, but I’ll stick to the case mn called the central Delannoy number, or just Delannoy numbers for short.

The first Delannoy number is 1 because there’s only one way for a king to get from one corner to the other: do nothing, because the opposite corner is the same corner. The second Delannoy number is 3 because the king can move up then right, or right then up, or move diagonally up and right.

For a 3 × 3 grid things are significantly more complicated, and D3 = 13. For an 8 × 8 grid the number of paths is 48,639.

Generating function

How would you estimate the number of paths on an n × n board for large values of n without calculating it exactly? You might start by finding a generating function for the Delannoy numbers, which works out to be

(x² − 6x + 1)−1/2

The radius of convergence r for the generating function series is the distance from 0 to the closest singularity of the generating function, which is the smaller root of

x² − 6x + 1

which is

3 − √8 = (3 + √8)−1 = (1 + √2)−2 = 1/σ²

i.e. the radius of convergence is the reciprocal of the silver ratio squared.

Asymptotic estimate

The radius of convergence gives us a first approximation to the asymptotic size of the series coefficients. Since we’re working with the generating function of the Delannoy numbers, these coefficients are the Delannoy numbers. That is,

Dn ~ rn = (σ2)n = σ2n.

That’s as good as you can do just knowing the radius of convergence. A more careful analysis would refine this estimate by dividing by a factor proportional to √n.

Related posts

Derivative equals inverse

Here’s kind of a strange problem with an interesting solution: find a function f such that the derivative of f equals the inverse of f for all positive x.

f ′(x) = f−1(x)

This is a differential equation, but a very unusual one, one that cannot be solved using any of the techniques taught in a class on differential equations.

The unique solution is

f(x) = φ(x / φ)φ

where φ is the golden ratio. What an unexpected appearance of the golden ratio!

The problem was proposed by H. L. Nelson and solved by A. C. Hindmarsh. See The American Mathematical Monthly, Vol. 76, No. 6 p. 696.

Who you gonna believe: Grok or the docs?

The calculator utility bc has a minimal math library. For example, there’s no tangent function because you’re expected take the ratio of sine and cosine. (The Gnu version of bc does have a function for tangent, but the POSIX version does not.) And yet bc includes support for Bessel functions J(x).

The bc function j takes two arguments. Is the first argument n or x? Grok said the function arguments are j(n,x). I thought I should run man bc just to make sure, and it said

j(x, n) Returns the bessel integer order n (truncated) of x.

So Grok says j(n,x) and the documentation that ships with the software says j(x,n). Which one should you believe? Neither! You should run a little test.

~$ bc -l
>>> j(1, 0)
0
>>> j(0, 1)
.76519768655796655144

Now J1(0) = 0, so apparently the first argument is the order n. Grok was right and the man page was wrong.

Groucho Marx saysing

As further confirmation, let’s see which argument is truncated.

>>> j(1.2, 3.4)
.17922585168150711099
>>> j(1, 3.4)
.17922585168150711099
>>> j(1.2, 3)
.33905895852593645892

The first argument is truncated to an integer value, so that’s the order n.

Turns out there’s a bug in the man page. The man page text above comes from running man bc on my Macbook. On my Linux box, the documentation is correct. It says

j(n,x) The Bessel function of integer order n of x.

The software produces the same results on both computers. It’s just a documentation bug.

The version running on my Macbook is the version that ships with the OS. It’s not the Gnu version, though the documentation says “This bc is compatible with both the GNU bc and the POSIX bc spec.” It has a function t for tangent, for example, which a POSIX version does not. But if you run bc --standard -l attempting to call t produces an error.

Brace expansion tree

Here’s a crazy bash one-liner I found via an article by Peter Krumins:

echo {w,t,}h{e{n{,ce{,forth}},re{,in,fore,with{,al}}},ither,at}

This prints 30 English words:

when, whence, whenceforth, where, wherein, wherefore, wherewith, wherewithal, whither, what, then, thence, thenceforth, there, therein, therefore, therewith, therewithal, thither, that, hen, hence, henceforth, here, herein, herefore, herewith, herewithal, hither, hat

This post will explain how the one-liner works.

Bash brace expansion iterates through all possibilities listed within curly braces, with possibilities separated by a comma. Note that the comma is a separator and not a terminator. And so, for example, the expression {w,t,} is effectively {w,t,""}.

When bash sees two brace expressions, these expand to the cartesian product of the two expressions. For example,

echo {A,B}{1,2,3}

produces

A1 A2 A3 B1 B2 B3

In the expression above we have

{w,t,}h{e…,ither,at}

So the expansion will enumerate all possibilities of {w,h,} multiplied by all possibilities of {e…,ither,at} where e… is itself a brace expression.

A diagram will help a lot.

The brace expansion does a depth-first traversal of this tree.

When will the decimals in a/b repeat?

The previous post looked at how many digits are in the reduced fraction for the nth harmonic number. I was curious about how long the cycle of digits in a harmonic number might be.

I wrote about the period length for the digits of fractions almost a decade ago. This post includes code so I can apply it to harmonic denominators.

from sympy import lcm, factorint, n_order

def period(n):
    factors = factorint(n)
    exp2 = factors.get(2, 0)
    exp5 = factors.get(5, 0)
    r = max(exp2, exp5)

    d = n // (2**exp2 * 5**exp5)
    s = 1 if d == 1 else n_order(10, d)
    return (r, s)

This function returns two numbers: r is the number of non-repeating digits at the beginning and s is the length of the repeating part.

The following code

from functools import reduce

def lcm_range(n):
    return reduce(lcm, range(1, n + 1))

print( period( lcm_range(50) ) )

prints (5, 1275120) meaning that 1/lcm(1, 2, 3, …, 49, 50) has five non-repeating digits following by 1,275,120 digits that repeat ad infinitum. And so the decimals in the expansion of H50 have a cycle length of 1,275,120.

Height of harmonic numbers

The previous post looked at writing the harmonic numbers as reduced fractions and estimating the number of digits in the numerator and denominator based on asymptotics. This is a follow up post with plots.

We’ll choose our base b to be 2. And we’ll look at the total number of bits in both the numerator and denominator, which we will use as the height of the fractions.

First, let’s look at the actual and estimated heights, using the estimates from the previous post.

Next let’s look at the difference between the actual and estimated heights.

In the previous post I looked at n = 50, which was kind of a lucky choice, the error being smaller than usual. I had also looked at, but didn’t publish, n = 100, which would be an unlucky choice.

Finally, let’s look at the relative error in the estimates, and plot over a larger range of n.

The error goes to zero, as predicted by the asymptotic estimates. And it goes noisily, which you’d expect since the heights are related to the distribution of primes.

Writing down harmonic numbers

The nth harmonic number is the sum of the reciprocals of the first n positive integers.

Hn = 1 + 1/2 + 1/3 + 1/4 + … + 1/n

The product of all the denominators is n!, so you could write Hn as a fraction

Hn = p/q

where pn! Hn is an integer and qn!.

While p/q is a way to write Hn as a fraction, it’s not the most efficient because p and n! will have common factors.

If we write Hn as a reduced fraction, the denominator will be the least common multiple of the integers 1 through n. That number is asymptotically exp(n). That estimate follows from the prime number theorem.

So for large n the denominator will be roughly exp(n), and in base b it would have around

n/log(b)

digits.

The numerator will be exp(n) Hn, and since Hn is asymptotically log(n) + γ, the numerator for large n will be roughly

exp(n) (log(n) + γ)

and will have around

(n + log log(n) ) / log(b)

digits.

Let’s see how well our asymptotic estimates work for n = 50. The 50th harmonic number is

H50 = 13943237577224054960759 / 3099044504245996706400.

This fraction has 23 digits in the numerator and 22 in the denominator. We would have predicted around

(50 + log(log(50)))/log(10) = 22.3

digits in the numerator and

50/log(10) = 21.7

digits in the denominator.

Let’s try a larger example, looking at the 1000th harmonic number in binary. We’ll use the following Python code.

from fractions import Fraction

def bits(n):
    H = sum(Fraction(i, i+1) for i in range(1, n+1))
    p, q = H.numerator, H.denominator
    # subtract 2 because bin returns a string starting with 0b.
    return len(bin(p)) - 2, len(bin(q)) - 2

print(bits(1000))

This returns 1448 and 1438. We would have estimated

(1000 + log(log(1000)))/log(2) = 1445.4

bits in the numerator and

1000/log(2) = 1442.7

bits in the denominator.

Update: See the next post for plots as a function of n.

Related posts

Hart’s theorem

Hart’s theorem says

If a triangle be formed by the arcs of three circles, the inscribed and the three escribed circles are all tangent to a new circle or line.

Here “triangle” means a three-sided figure whose sides are portions of a circle. The inscribed circle is the largest circle that can fit inside the three-sided figure.

The “escribed” circles are analogous to the excircles in the previous post: you extend two sides and find a circle that is tangent to the triangle side and the extended side. The difference here being that the side extensions are now circles.