The previous post compared bits of information to answers in a game of Twenty Questions.
The optimal strategy for playing Twenty Questions is for each question to split the remaining possibilities in half. There are a couple ways to justify this strategy: mixmax and average.
The minmax approach is to minimize the worse thing that could happen. The worst thing that could happen after asking a question is for your subject to be in the larger half of possibilities. For example, asking whether someone is left-handed is not a good strategy: the worst case scenario is “no,” in which case you’ve only narrowed your possibilities by 10%. The best mixmax approach is to split the population in half with each question.
The best average approach is also to split the possibilities in half each time. With the handedness example, you learn more if the answer is “yes,” but there’s only a 10% change that the answer is yes. There’s a 10% chance of gaining 3.322 bits of information, but a 90% chance of only gaining 0.152 bits. So the expected number of bits, the entropy, is 0.469 bits. Entropy is maximized when all outcomes are equally likely. That means you can’t learn more than 1 bit of information on average from a yes/no question, and you learn the most when both possibilities are equally likely.
Now suppose you want to ask about height and sex. As in the previous post, we assume men and women’s heights are normally distributed with means 70 and 64 inches respectively, and both have standard deviation 3 inches.
If you ask whether a person is taller than 67 inches, you split a mixed population of men and women in half. You will learn 1 bit of information from this question, but you’ve put yourself in a suboptimal position for the next question. A height of at least 67 inches half of the adult population in general, but it selects a majority of men and a minority of women. And as we discussed above, uneven splits are suboptimal, in the worst case and on average.
If you’re going to ask about height and sex, ask about sex first. If the person is female, ask next whether her height is above 64 inches. But if the person is male, ask whether his height is above 70 inches. That is, you want to split the population evenly at each stepĀ conditioning on your previous answer. A cutoff of 67 inches is optimal unconditionally, but suboptimal if you condition on sex.
The optimal strategy for Twenty Questions is to ask a question with probability of 1/2 being true, conditional on all previous data. You might get lucky with uneven conditional splits, but on average, and in the worst case, you won’t do as well.
This is a great illustration of the principle at work, but it makes me hesitant to ever play twenty questions with you =)
So the “ask sex, then height” strategy raises the question of how to find the best strategy, if we know all the conditional probabilities beforehand. If we have two questions that are 50-50 we want to ask the one that, after conditioning on its answer, leaves more of the yet-unasked questions closer to 50-50. And how much do we value leaving more balanced questions after re-conditioning compared to the immediate value of being close to 50-50?
It’s starting to look like we cluster together questions that are more dependent on each other, and you want to successively pick off the centers of widely separated clusters. This is interesting!
I’ve used a similar strategy when trying to interpret signals from “novel” (noisy) “new physics” sensors. In its most basic form, I do lots of PCA runs in the lab to attempt to discern what is influencing the data values, then use those results as categorization/classification steps in the field.
In this case, the goal isn’t to find “the answer” in a field of possibilities: It is to quickly determine if I have a “non-answer” and move on to the next data set. That is, the current data set may be too noisy to contain “the answer”, meaning the solution space is null, and the optimal strategy is to fail fast and early.
Many data classification and categorization tasks have similar behaviors, whenever “none of the above” is a possible answer.
Dividing the solution space by halves is an optimal search strategy when “the answer” is known to exist. But not when the answer may not exist in the current solution space.
When the branches are unbalanced, one technique is to make the questions more complicated. For example, “are you a female at least 64 inches tall, or a male at least 70 inches tall?”. This lets you get closer to the 50% goal by shifting partial probabilities around. In fact, if you have an optimal decision tree using simple questions which depend on the previous answer, you can convert it to a list of static questions, where the questions get progressively more complicated because they include the previous conditionals.
There’s a related technique which was originally developed for punched cards in the 1940s. Instead of “questions” think “bits”. If you have N independent descriptors with probability p_i then each will get a random bit code containing n_i = -log2(p_i) bits, rounded up. You’ll need sum(n_i * p_i) / ln(2) bits overall. For each descriptor, randomly select the n_i bit positions from the available bits. (See Feldman and Hodes “An Efficient Design for Chemical Structure Searching. I. The Screens”, JCICS (1975), which is more concerned with how to handle hierarchical descriptors.)
The use of random codes can lead to false positives. This can be controlled by increasing the bit lengths, or by (according to Knuth, TAOCP, v3, under “superimposed coding”) a construction by Kautz and Singleton, IEEE Trans. IT-10 (1964) 363-377. I’ve not tried the latter construction.
Mixmax or minmax?