Dithered QR codes

I saw a post by Dave Richeson on Mastodon about making QR codes that look like images. Turns out you can shrink a black square in a QR code by up to a factor of three while keeping the code usable. This gives you the wiggle room to create dithered images.

I tried a few examples using this software. The code works with photos, but it works really well with simpler images like the ones below.

Here’s one for my data privacy company, Kingwood Data Privacy LLC.

kingwooddataprivacy.com

And here’s one for my account @CompSciFact on X.

https://x.com/CompSciFact

Probability of typing a wrong Bitcoin address

I heard someone say that Bitcoin is dangerous because you could easily make a typo when entering an address, sending money to the wrong person, and have no recourse. There are dangers associated with Bitcoin, such as losing a private key, but address typos are not a major concern.

Checksums

There are several kinds of Bitcoin addresses. Each is at least 20 bytes (160 bits) long, with at least 4 bytes (32 bits) of checksum. The chances of a typo resulting in a valid checksum are about 1 in 232.

Used addresses

Let’s ignore the checksum for this section.

Because addresses are formed by cryptographic hash functions, we can assume the values are essentially randomly distributed in the space of possible addresses. The addresses are deterministic, but for modeling purposes, random is as random does.

This means a typo of an actual address is no more or less likely to be another actual address than an address typed at random. This is unlike, say, English words: a mistyped English word is more likely to be another English word than random keystrokes would be.

There have been on the order of a billion Bitcoin addresses used, in a space of 2160 possibilities. (Actually more since some addresses have more than 160 bits.) There’s about a 1 in 1039 chance that a random 160-bit sequence corresponds to an address somewhere on the Bitcoin blockchain.

Addresses close in edit distance

Someone with the Caesarean handle Veni Vidi Vici on X asked

What about the odds that out of those 1B addresses, two of them are one character swap away from each other?

That’s an interesting question. Let’s assume the addresses are Base58-encoded strings of length 26. Addresses could be longer, but assuming the minimum length increases the probability of addresses being close.

How many addresses are within one or two character swaps of another? I addressed a similar question here a couple weeks ago. If all the characters were unique, the number of strings within k swaps of each other would be

|S1(26, 26 − k)|

where S1 denotes Stirling numbers of the first kind. For k = 1 this would be 325 and for k = 2 this would be 50,050. This assumes all the characters are unique; I haven’t thought through the case where characters are repeated.

For round numbers, let’s say there are a billion addresses, and for each address there are a million other addresses that are close in some sense, plausible typos of the address. That would be 1012 addresses and typos, spread out in a space of ≈1045 (i.e. 5826) possible addresses.

Now there’s an implicit Birthday Problem here. No particular address is likely to collide with another, even when you allow typos, but what about the likelihood that some address collides?

Say we partition our space of 1045 addresses into N = 1029 addresses with a million possible typos for each address. Then as a rule of thumb, you’d need around √N random draws before you have a 50-50 chance of seeing a collision. Since 109 is a lot less than 1014.5, it’s unlikely that any two addresses collide, even when you consider each address along with a million associated typos.

Why are CUDA kernels hard to optimize?

Explosive datacenter demand has caused developers to leave no stone unturned in search of higher efficiencies. The DeepSeek team, not satisfied with Nvidia’s CUDA libraries, used a virtualized form of assembly language (PTX) to write kernel codes to accelerate their AI computations. Others have attempted to generate optimized kernels using AI, though some results have been questioned (for various attempts, see also here, here, here, here and here).

Why is it hard to write peak-speed GPU code? Writing really fast code has always been arduous, but it seems especially so for modern GPUs.

To understand the issues, my colleagues and I performed a detailed study of GPU kernel performance, across eight different GPU models from three GPU vendors [1]. The test case we considered was low precision matrix multiply, a resource-intensive operation for LLM training. We ran many, many experiments to understand what causes performance variability and why kernels sometimes run slower than you’d think they should.

For the cases we studied, we found about half a dozen different factors, but the upshot is this: modern processors like GPUs have become so complex—notably their multi-layered hierarchical memory subsystems—that it is difficult to get consistently high performance across all problem sizes a user might want to run in practice. As a result, the performance for the target problem might be surprisingly and mysteriously less than the advertised peak performance for the operation in question. The reasons might be obvious—like cache line misalignment—or more opaque. For the matrix multiply case, various issues like the need for prefetching, caching, tiling and block size selection, make it difficult for the kernel developer to optimize for every input size a user might specify.

Below is an example graphic from our paper. The color indicates floating point operation rate (FLOPs) for a reduced precision matrix multiply on a representative GPU using a library call. The horizontal and vertical axes refer to the matrix dimensions for the problem (see paper for details). Though some regions show performance near the theoretical peak (red), other immediately adjacent regions show problem sizes that run dramatically less—in fact, only about half of peak performance, or less. Presumably this is because either individual kernel performance or the selection of kernels used by the library is suboptimal. The net outcome is, if your problem lands a “bad” region, you’re in for a big surprise, your performance will be much less than expected, and you may not understand why. All high-performing GPUs we tested showed irregular behaviors such as this [2] [3].

In the past this was not always a problem.  Older architectures like Sun Sparc or Cray vector processor, complex as they were, were simple enough that a reasonably well-tuned computational kernel might run well across most if not all inputs [4]. Today, performance is much harder to predict and can vary substantially based on the requested problem sizes.

This is a tough challenge for library developers. Whenever a new GPU model family comes out, new kernel optimization and tuning are required to give (hopefully) more consistently high performance, and some cases get more developer attention than others due to customer needs and limited developer resources. As a result, infrequently used operations do not get as much attention, but they may be the exact ones you need for your particular case [5].

Tools are available to help optimize for specific cases. The excellent Nvidia CUTLASS library exposes access to many more fine-grained options compared to the standard cuBLAS library. The not faint of heart can try programming Nvidia GPUs at the level of PTX, or (shudder) SASS. Superoptimization might help, but only for very small code fragments and even then there may be too many external factors influencing performance to make it effective.

Autotuning is a promising approach though it doesn’t seem to have reached its full potential in production. AI might really help here [6]; in our own paper we had some success using machine learning methods like decision trees and random forests to model performance as a function of problem size, though our work was exploratory and not production-ready. To make a well-crafted general solution it would seem would require a lot of effort to do right. Code sustainability and maintenance are also critical; a sustainable workflow would be needed to retrain on new GPUs, new CUDA releases and even site-specific and system-specific settings like GPU power and frequency cap policies.

Most recent AI-driven work focuses on optimizing performance for one or a few problem sizes only. A truly production-quality general purpose tool would give both 100% accurate results and also top achievable performance for any input problem size (even for corner cases) or data type. This would require both optimized GPU kernels and optimal kernel dispatcher for kernel selection. And the method would need to be robust to issues like power and frequency variabilities in production runs. This would seem to currently be an unsolved problem. Solving it would be of huge benefit to the hyperscaler community.

Notes

[1] For related work from a slightly different angle, see this excellent work from Matt Sinclair’s lab.

[2] It turned out this study was helpful to us for production runs, to help us to triage an odd performance conundrum we encountered when attempting an exascale run (see here, here).

[3] Incidentally this example shows the hazards of simplistic benchmark suites to measure GPU code performance. Unless the benchmark captures a truly large and varied set of input cases, any new optimization method proposed can artificially “overfit” performance on the tests and still underperform miserably on many user cases of interest.

[4] I once wrote a 1-D wavelet convolution kernel for a Sparc processor, using a circular register buffer and loop unrolling to minimize loads and stores, this achieving near-peak performance. The code was correctly compiled from C to assembly, and performance for a given problem was almost precisely predictable. That was before the days of complex memory hierarchies.

[5] One vendor I know of used to take customer requests for hand tuning expensive library calls and made them run fast at the specific customer problem sizes.

[6] LLM kernel generation seems like a natural fit, particularly since LLM-generated code quality has much improved in recent months. Kernel selection and parameter selection for block size, tiling etc. might be better solved by direct training of machine learning models, or methods like this. Comparative studies on this would be informative.

 

The biggest math symbol

The biggest math symbol that I can think of is the Riemann P-symbol

w(z)=P \left\{ \begin{matrix} a & b & c & \; \\ \alpha & \beta & \gamma & z \\ \alpha^\prime & \beta^\prime & \gamma^\prime & \; \end{matrix} \right\}

The symbol is also known as the Papperitz symbol because Erwin Papperitz invented the symbol for expressing solutions to Bernard Riemann’s differential equation.

Before writing out Riemann’s differential equation, we note that the equation has regular singular points at ab, and c. In fact, that is its defining feature: it is the most general linear second order differential equation with three regular singular points. The parameters ab, and c enter the equation in the as roots of an expression in denominators; that’s as it has to be if these are the singular points.

The way the Greek letter parameters enter Riemann’s equation is more complicated, but there is a good reason for the complication: the notation makes solutions transform as simply as possible under a bilinear transformation. This is important because Möbius transformations are the conformal automorphisms of the Riemann sphere.

To be specific, let

m(z) = \frac{Az + B}{Cz + D}

be a Möbius transformation. Then

P \left\{ \begin{matrix} a & b & c & \; \\ \alpha & \beta & \gamma & z \\ \alpha^\prime & \beta^\prime & \gamma^\prime & \; \end{matrix} \right\} = P \left\{ \begin{matrix} m(a) & m(b) & m(c) & \; \\ \alpha & \beta & \gamma & m(z) \\ \alpha^\prime & \beta^\prime & \gamma^\prime & \; \end{matrix} \right\}

Since the parameters on the top row of the P-symbol are the locations of singularities, when you transform a solution, moving the singularities, the new parameters have to be the new locations of the singularities. And importantly the rest of the parameters do not change.

Now with the motivation aside, we’ll write out Riemann’s differential equation in all its glory.

\frac{d^2w}{dz^2} + p(z) \frac{dw}{dz} + q(z) \frac{w}{(z-a)(z-b)(z-c)} = 0

where

p(z) = \frac{1-\alpha-\alpha^\prime}{z-a} + \frac{1-\beta-\beta^\prime}{z-b} + \frac{1-\gamma-\gamma^\prime}{z-c}

and

q(z) = \frac{\alpha\alpha^\prime (a-b)(a-c)} {z-a} +\frac{\beta\beta^\prime (b-c)(b-a)} {z-b} +\frac{\gamma\gamma^\prime (c-a)(c-b)} {z-c}

Related posts

You can’t have everything you want: beta edition

The beta distribution is a conjugate prior for a binomial likelihood function, so it makes posterior probability calculations trivial: you simply add your data to the distribution parameters. If you start with a beta(α, β) prior distribution on a proportion θ, then observe s successes and f failures, the posterior distribution on θ is

beta(α + s, β + f).

There’s an integration going on in the background, but you don’t have to think about it. If you used any other prior, you’d have to calculate an integral, and it wouldn’t have a simple closed form.

The sum α + β of the prior distribution parameters is interpreted as the number of observations contained in the prior. For example, a beta(0.5, 0.,5) prior is non-informative, having only as much information as one observation.

I was thinking about all this yesterday because I was working on a project where the client believes that a proportion θ is around 0.9. A reasonable choice of prior would be beta(0.9, 0.1). Such a distribution has mean 0.9 and is non-informative, and it would be fine for the use I had in mind.

Non-informative beta priors work well in practice, but they’re a little unsettling if you plot them. You’ll notice that the beta(0.9, 0.1) density has singularities at both ends of the interval [0, 1]. The singularity on the right end isn’t so bad, but it’s curious that a distribution designed to have mean 0.9 has infinite density at 0.

If you went further and used a completely non-informative, improper beta(0, 0) prior, you’d have strong singularities at both ends of the interval. And yet this isn’t a problem in practice. Statistical operations that are not expressed in Bayesian terms are often equivalent to a Bayesian analysis with an improper prior like this.

This raises two questions. Why must a non-informative prior put infinite density somewhere, and why is this not a problem?

Beta densities with small parameters have singularities because that’s just the limitation of the beta family. If you were using a strictly subjective Bayesian framework, you would have to say that such densities don’t match anyone’s subjective priors. But the beta distribution is very convenient to use, as explained at the top of this post, and so everyone makes a little ideological compromise.

As for why singular priors are not a problem, it boils down to the difference between density and mass. A beta(0.9, 0.1) distribution has infinite density at 0, but it doesn’t put much mass at 0. If you integrate the density over a small interval, you get a small probability. The singularity on the left is is hard to see in the plot above, almost a vertical line that overlaps the vertical axis. So despite the curve going vertical, there isn’t much area under that part of the curve.

You can’t always get what you want. But sometimes you get what you need.

Related posts

More on seed phrase words

Last week I wrote about how the English seed phrase words for crypto wallets, proposed in BIP39, are not ideal for memorization. This post gives a few more brief thoughts based on these words.

Prefix uniqueness

The BIP39 words have a nice property that I didn’t mention: the words are uniquely determined by their first four letters. This means, for example, that you could type in a seed phrase on a phone by typing the first four letters of each word and a wallet interface could fill in the rest of each word.

Incidentally, although the BIP39 words are unique in how they begin, they are not unique in how they end. For example, cross and across are both on the list.

Creating a list

I wondered how hard it would be to come up with a list of 2048 common words that are unique in their first four letters. So I started with Google’s list of 10,000 most common words.

I removed one- and two-letter words, and NSFW words, to try to create a list similar to the BIP39 words. That resulted in a list of 4228 words. You could delete over half of these words and have a list of 2048 words uniquely determined by their first four letters.

Comparing lists

I was curious how many of the BIP39 list of 2048 words were contained in Google’s list of 10,000 most common words. The answer is 1666, or about 81%. (Incidentally, I used comm to answer this.)

Venn diagram

Vocabulary estimation and overlap

Here’s something else I was curious about. The average adult has an active vocabulary of between 20,000 and 35,000 words. So it seems reasonable that a typical adult would know nearly all the words on Google’s top 10,000 list. (Not entirely all. For example, I noticed one word on Google’s list that I hadn’t seen before.)

Now suppose you had a list of the 20,000 most common words and a person with a vocabulary of 20,000 words. How many words on the list is this person likely to know? Surely not all. We don’t acquire our vocabulary by starting at the top of a list of words, ordered by frequency, and working our way down. We learn words somewhat at random according to our circumstances. We’re more likely to know the most common words, but it’s not certain. So how would you model that?

It’s interesting to think how you would estimate someone’s vocabulary in the first place. You can’t give someone a quiz with every English word and ask them to check off the words they know, and estimation complicated by the fact that word frequencies are far from uniform. Maybe the literature on vocabulary estimation would answer the questions in the previous paragraph.

Related posts

Intuition for Pick’s Theorem

Pick’s theorem is a surprising and useful to find the area of a region formed by connecting dots on a grid. The area is simply

A = ip/2 − 1

where i is the number of dots in the interior and p is the number of dots on the perimeter.

Example

For example, the in the figure below there are 11 black dots on the perimeter and 17 blue dots in the interior.

Therefore Pick’s theorem says the area of the figure is 17 + 11/2 − 1 =  21½.

Intuition

It makes sense that the area would be approximately i if the figure is very large. But why do you need to add half the dots on the perimeter and why do you subtract 1?

Pick’s theorem is simple, but it isn’t obvious. If it were obvious, someone closer to the time of Descartes (1596–1650) would have discovered it. Instead, the theorem was discovered by Georg Alexander Pick in 1899. However, the theorem is fairly obvious for a rectangle, and this post will illustrate that case. For a rectangle, you can easily see why you need to add half the perimeter dots and subtract 1.

Rectangular case

Imagine an m by n rectangular array of dots. If you were to draw a frame around those dots with the corners of the rectangle being exactly between dots, the area would be mn, the number of dots inside the frame.

Pick’s theorem does not apply here because the corners of our frame do not lie on the grid. So let’s shift our grid down and to the left so that its corners do lie on the grid.

Now some of our original dots, marked in blue, are on the perimeter of the frame. We could add the number of dots on the perimeter to correct for this, but that would be too much because the blue dots on the perimeter are only on the left and top sides. Each has a reflection on the right and bottom, marked with red dots. So we shouldn’t add all the dots on the perimenter, but only half the dots on the perimeter.

But that still overcounts slightly. There are two dots on the perimeter that do not correspond to a blue dot or to the reflection of a blue dot. These are the hollow circles in the top right and bottom left corners. So when we take half the perimeter, we need to subtract 1 to account for half of this pair of dots.

This doesn’t prove Pick’s theorem in general, but it does prove it for rectangular regions. Even if we can’t see why the formula generalizes to more complicated regions, we can be grateful that it does.

Related posts

Knuth’s Twindragon

A few days ago I wrote about a random process that creates a fractal known as the Twin Dragon. This post gives a deterministic approach to create the same figure.

As far as I can tell, the first reference to this fractal is in a paper by Davis and Knuth in the Journal of Recreational Mathematics from 1970. Unfortunately this journal is out of print and hard or impossible to find online [1]. Knuth presents the twindragon (one word, lowercase) fractal in TAOCP Vol 2, page 206.

Knuth defines the twindragon via numbers base b = 1 − i. Every complex number can be written in the form

z = \sum_{k=-\infty}^\infty a_k (1 - i)^k

where the “digits” ak are either 0 or 1.

The twindragon fractal is the set of numbers that only have non-zero digits to the right of the decimal point, i.e. numbers of the form

z = \sum_{k=1}^\infty a_k (1 - i)^{-k}

I implemented this in Python as follows.

import matplotlib.pyplot as plt
from itertools import product

for bits in product([0, 1], repeat=15):
    z = sum(a*(1-1j)**(-k) for k, a in enumerate(bits))
    plt.plot(z.real, z.imag, 'bo', markersize=1)
plt.show()

This produced the image below.

Related posts

[1] If you can find an archive of Journal of Recreational Mathematics, please let me know.

What’s hierarchical about a hierarchical wallet?

A few days ago I wrote about what’s in a crypto wallet. In that post I said that most crypto wallets now are hierarchical deterministic (HD) wallets.  And I said that HD wallets are deterministic in the sense that they derive all their keys from a seed phrase. But in what sense are HD wallets hierarchical? That’s the topic of this post.

A warm-up story

In the game of 20 questions, one person thinks of something and another tries to guess what it is by asking up to 20 yes-no questions. I once heard the physicist John Wheeler tell of a variation of this game in which the first person did not have a definite object in mind, but decided after each question what the answer should be. For example, if someone asks “Is this person a man?” the person would commit to the person being a man or woman, but would not decide on a particular man or woman yet.

Wheeler’s point was that quantum mechanics is like this variation on 20 questions in that the answers to questions don’t exist until the question is asked. What does this have to do with hierarchical deterministic wallets? Your private keys do not exist until you ask for them. But once you have created and used a key, a wallet will behave consistently with that creation.

The hierarchy

The hierarchy referred to in a hierarchical deterministic wallet is a set of five variables, as described in BIP-44:

  1. Purpose
  2. Coin type
  3. Account
  4. Change
  5. Address index

The meaning of the variables is explained in BIP-44. The lowest level, address index, is a sequential counter. So you can have separate sequential counters for each value of the four-tuple (purpose, coin type, account, change).

Your master key and the five variables above are inputs to a key derivation function used to create new keys as needed. Once you use a private key, a hash of its corresponding public key is memorialized on the blockchain. If it’s a Bitcoin transaction, it’s on the Bitcoin blockchain. If it’s an Ethereum transaction, it’s on the Ethereum blockchain, etc. (You can find a list of supported coin types here.)

You wallet does not (or at least logically need not) store all your keys. It can reason as follows. “If the master key and these hierarchical values were used, this would be the private key. And given this private key, this would be the public key, and this would be the corresponding address. Let me consult the blockchain to see whether in fact it was used.”

How would a wallet know how many transactions you’ve made under a particular branch of the hierarchy? It searches the corresponding blockchain. It first looks whether there is a ledger entry corresponding to address index 0. Then address index 1, etc. The algorithm allows for the possibility of gaps. If it cannot find a ledger entry corresponding to index 2, it looks for index 3, etc. up to a gap of 20. After looking ahead 20 index values and finding nothing, it concludes there is nothing else to be found.

Because everything is derived deterministically from the seed phrase and the hierarchical variables, you can back up a wallet by simply backing up the seed phrase.

In theory, you could carry out transactions using one brand of wallet, back it up by writing down the seed phrase, then restore the information to a different brand of wallet. In practice you may run into difficulty doing this.

Related posts