Do incremental improvements add, multiply, or something else?

Suppose you make an x% improvement followed by a y% improvement. Together do they make an (x + y)% improvement? Maybe.

The business principle of kaizen, based on the Japanese 改善 for improvement, is based on the assumption that incremental improvements accumulate. But quantifying how improvements accumulate takes some care.

Add or multiply?

Two successive 1% improvements amount to a 2% improvement. But two successive 50% improvements amount to a 125% improvement. So sometimes you can add, and sometimes you cannot. What’s going on?

An x% improvement multiplies something by 1 + x/100. For example, if you earn 5% interest on a principle of P dollars, you now have 1.05 P dollars.

So an x% improvement followed by a y% improvement multiplies by

(1 + x/100)(1 + y/100) = 1 + (x + y)/100 + xy/10000.

If x and y are small, then xy/10000 is negligible. But if x and y are large, the product term may not be negligible, depending on context. I go into this further in this post: Small probabilities add, big ones don’t.

Interactions

Now let’s look at a variation. Suppose doing one thing by itself brings an x% improvement and doing another thing by itself makes a y% improvement. How much improvement could you expect from doing both?

For example, suppose you find through A/B testing that changing the font on a page increases conversions by 10%. And you find in a separate A/B test that changing an image on the page increases conversions by 15%. If you change the font and the image, would you expect a 25% increase in conversions?

The issue here is not so much whether it is appropriate to add percentages. Since

1.1 × 1.15 = 1.265

you don’t get a much different answer whether you multiply or add. But maybe you could change the font and the image and conversions increase 12%. Maybe either change alone creates a better impression, but together they don’t make a better impression than doing one of the changes. Or maybe the new font and the new image clash somehow and doing both changes together lowers conversions.

The statistical term for what’s going on is interaction effects. A sequence of small improvements creates an additive effect if the improvements are independent. But the effects could be dependent, in which case the whole is less than the sum of the parts. This is typical. Assuming that improvements are independent is often overly optimistic. But sometimes you run into a synergistic effect and the whole is greater than the sum of the parts.

Sequential testing

In the example above, we imagine testing the effect of a font change and an image change separately. What if we first changed the font, then with the new font tested the image? That’s better. If there were a clash between the new font and the new image we’d know it.

But we’re missing something here. If we had tested the image first and then tested the new font with the new image, we might have gotten different results. In general, the order of sequential testing matters.

Factorial testing

If you have a small number of things to test, you can discover interaction effects by doing a factorial design, either a full factorial design or a fractional factorial design.

If you have a large number of things to test, you’ll have to do some sort of sequential testing. Maybe you do some combination of sequential and factorial testing, guided by which effects you have reason to believe will be approximately independent.

In practice, a testing plan needs to balance simplicity and statistical power. Sequentially testing one option at a time is simple, and may be fine if interaction effects are small. But if interaction effects are large, sequential testing may be leaving money on the table.

Help with testing

If you’d like some help with testing, or with web analytics more generally, we can help.

LET’S TALK

Jigs

In his book The World Beyond Your Head Matthew Crawford talks about jigs literally and metaphorically.

A jig in carpentry is something to hold parts in place, such as aligning boards that need to be cut to the same length. Crawford uses the term more generally to describe labor-saving (or more importantly, thought-saving) techniques in other professions, such as a chef setting out ingredients in the order in which they need to be added. He then applies the idea of jigs even more generally to cultural institutions.

Jigs reduce options. A craftsman voluntarily restricts choices, not out of necessity, but in order to focus attention where it matters more. Novices may chafe at jigs because they can work without them. Experts are even more capable of working without jigs than novices, but are also more likely to appreciate their use.

Style guides, whether in journalism or in software development, are jigs. They limit freedom of expression in minor details, ideally directing creativity into more productive channels.

Automation is great, but there’s a limit to how much we can automate our work. People often seek out a consulting firm precisely because there’s something non-standard about their project [1]. There’s more opportunity for jigs than automation, especially when delegating work. If I could completely automate a task, there would be no need to delegate it. Giving someone a jig along with a task increases the chances of the delegation being successful.

Related posts

[1] In my previous career, I sat through a presentation by a huge consulting company that promised to build software completely adapted to our unique needs, software which they had also built for numerous previous clients. This would be something they’ve never built before and something they have built many times before. I could imagine a more nuanced presentation that clarified what would be new and what would not be, but this presentation was blatantly contradictory and completely unaware of the contradiction.

Convert LaTeX to Microsoft Word

I create nearly all my documents in LaTeX, even documents that might be easier to create in Word. The reason is that even if a particular document would be easier to write in Word, my workflow is more efficient if everything is in LaTeX. LaTeX makes small, plain text files that work well with version control and searching, and I can edit them with the same editor I use for writing code and everything else I do.

Usually I send read-only documents to clients. They don’t know or care what program created the PDF I sent them. The fact that they cannot edit my reports is a feature, not a bug: if I’m going to sign off on something, I need to be sure that it doesn’t include any changes that someone else made that I’m unaware of.

But occasionally I do need to send clients a file they can edit, and this usually means Microsoft Word. Lawyers particularly want Word documents.

It’s possible to create a PDF using LaTeX and copy-and-paste the content into a Word document. This works, but you’ll have to redo all your formatting.

A better approach is to use Pandoc. The command

    pandoc foo.tex -o -s foo.docx

will convert the LaTeX file foo.tex directly to the Word document foo.docx. You may have to touch up the Word document a little, but it will retain more of the original formatting than if you when from LaTeX to Word via PDF.

You could wrap this in a script for convenience and so you don’t have to remember the pandoc syntax.

    #!/opt/local/bin/perl

    $tex = $ARGV[0];
    ($doc = $tex) =~ s/\.tex$/.docx/;
    exec "pandoc $tex -o $doc";

You could save this to tex2doc and run

    tex2doc foo.tex

to produce foo.docx.

Update: The syntax when I wrote this post did not work when I revisited this today (2023-11-30) but instead gave several warnings.  What worked today was

    pandoc foo.tex --from latex --to docx > foo.docx

Unfortunately I don’t have the version number that I used when I first wrote this post. Today I was using pandoc version 2.9.2.1.

Another problem with A/B testing: interaction effects

The previous post looked at a paradox with A/B testing: your final result may depend heavily on the order of your tests. This post looks at another problem with A/B testing: the inability to find interaction effects.

Suppose you’re debating between putting a photo of a car or a truck on your website, and you’re debating between whether the vehicle should be red or blue. You decide to use A/B testing, so you test whether customers prefer a red truck or a blue truck. They prefer the blue truck. Then you test whether customers prefer a blue truck or a blue car. They prefer the blue truck.

Maybe customers would prefer a red car best of all, but you didn’t test that option. By testing vehicle type and color separately, you didn’t learn about the interaction of vehicle type and color. As Andrew Gelman and Jennifer Hill put it [1],

Interactions can be important. In practice, inputs that have large main effects also tend to have large interactions with other inputs. (However, small main effects do not preclude the possibility of large interactions.)

Notice that sample size is not the issue. Suppose you tested the red truck against the blue truck with 1000 users and found that 88.2% preferred the blue truck. You can be quite confident that users prefer the blue truck to the red truck. Suppose you also used 1000 users to test the blue truck against the blue car and this time 73.5% preferred the blue truck. Again you can be confident in your results. But you failed to learn something that you might have learned if you’d split 100 users between four options: red truck, blue truck, red car, blue car.

Experiment size

This is an example of a factorial design, testing all combinations of the factors involved. Factorial designs seem impractical because the number of combinations can grow very quickly as the number of factors increases. But if it’s not practical to test all combinations of 10 factors, for example, that doesn’t mean that it’s impractical to test all combinations of two factors, as in the example above. It is often practical to use a full factorial design for a moderate number of factors, and to use a fractional factorial design with more factors.

If you only test one factor at a time, you’re betting that interaction effects don’t matter. Maybe you’re right, and you can optimize your design by optimizing each variable separately. But if you’re wrong, you won’t know.

Agility

The advantage of A/B tests is that they can often be done rapidly. Blue or red? Blue. Car or truck? Truck. Done. Now let’s test something else.

If the only options were between a rapid succession of tests of one factor at a time or one big, complicated statistical test of everything, speed might win. But there’s another possibility: a rapid succession of slightly more sophisticated tests.

Suppose you have 9 factors that you’re interested in, and you understandably don’t want to test several replications of 29 = 512 possibilities. You might start out with a (fractional) factorial design of 5 of the factors. Say that only one of these factors seems to make much difference, no matter what you pair it with. Next you do another experiment testing 5 factors at a time, the winner of the first experiment and the 4 factors you haven’t tested yet. This lets you do two small experiments rather than one big one.

Note that in this example you’re assuming that the factors that didn’t matter in the first experiment wouldn’t have important interactions with the factors in the second experiment. And your assumption might be wrong. But you’re making an educated guess, based on data from the first experiment. This is less than ideal, but it’s better than the alternative of testing every factor one at a time, assuming that no interactions matter. Assuming that some interactions don’t matter, based on data, is better than making a blanket assumption that no interactions matter, based on no data.

Testing more than one factor at a time can be efficient for screening as well as for finding interactions. It can help you narrow in on the variables you need to test more thoroughly.

Related posts

[1] Andrew Gelman and Jennifer Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, 2007.

A/B testing and a voting paradox

A B testing

One problem with A/B testing is that your results may depend on the order of your tests.

Suppose you’re testing three options: X, Y, and Z. Let’s say you have three market segments, equal in size, each with the following preferences.

Segment 1: X > Y > Z.

Segment 2: Y > ZX.

Segment 3: Z > X > Y.

Now suppose you test X against Y in an A/B test, then test the winner against Z. Segments 1 and 3 prefer X to Y, so X wins the first round of testing. Now you compare X to Z. Segments 2 and 3 prefer Z to X, so Z wins round 2 and is the overall winner.

Now let’s run the tests again in a different order. First we test Y against Z. Segments 1 and 2 will go for Y. Then in the next round, Y against X, segments 1 and 3 prefer X, so X is the overall winner. So one way of running the tests results in Z winning, and another way results in X winning.

Can we arrange our tests so that Y wins? Yes, by testing X against Z first. Z wins the first round, and Y wins in the second round.

The root of the problem is that group preferences are not transitive. We say that preferences are transitive if when someone prefers a to b, and they prefer b to c, then they prefer a to c. We implicitly assumed that each segment has transitive preferences. For example, when we said that the first segment’s preferences are X > Y > Z, we meant that they would rank X > Y,  Y > Z, and X > Z.

Individuals (generally) have transitive preferences, but groups may not. In the example above, the market at a whole prefers X to Y, prefers Y to Z, but prefers Z to X. The segments have transitive preference but the market does not. This is known as the Condorcet voting paradox.

Voting

This is not purely hypothetical. Our example is simplified, but it reflects a phenomenon that does happen in practice. It has been observed in voting. Constituencies in a legislature may have transitive preferences while the legislature as a whole does not. This opens the possibility of manipulating the final outcome by controlling the order in which items are voted on. In the example above, someone who knows the preferences of the groups could make any of the three outcomes the winner by picking the order of A/B comparisons.

Political scientists have looked back at congressional voting records and found instances of this happening, and can roughly determine when someone first discovered the technique of rigging sequential votes. They can also roughly point to when legislators became aware of the manipulation and learned that they sometimes need to vote against their actual preferences in one vote in order to get a better outcome at the end of the sequence of votes. (I think this was around 1940, but my memory could be wrong.) Political scientists call this sophisticated voting, as opposed to naive voting in which one always votes according to honest preferences.

Market research

The voting example is relevant to market research because it shows that intransitive group preferences really happen. But unlike in voting, customers respond honestly to A/B tests. They don’t even know that they’re part of an A/B test.

In the example above, we come away from our test believing that we have a clear winner. In both rounds of testing, the winner gets twice as many responses as the loser. The large margin in each test is misleading.

Any of the three options could be the winner, depending on the order of testing, but none of the options is any better than the others. So in the example we don’t so much make a bad choice, but we have too much confidence in our choice.

But now suppose the groups are not all the same size. Suppose the three segments represent 45%, 35%, and 20% of the market respectively. We can still have any option be the final winner, depending on the order of testing. But now some rests are better than others. If we tested all three options at once in an A/B/C test, we’d learn that a plurality of the market prefers X, and we’d learn that there is no option that the market as a whole prefers.

Related posts

Illegible work

When James Scott uses the word legible, he doesn’t refer to handwriting that is clear enough to read. He uses the word more broadly to mean something that is easy to classify, something that is bureaucrat-friendly. A thing is illegible if it is hard to pigeonhole. I first heard the term from Venkatesh Rao’s essay A Big Little Idea Called Legibility.

Much of the work I do is illegible. If the work were legible, companies would have an employee who does it [1] and they wouldn’t call a consultant.

Here’s a template for a conversation I’ve had a few times:

“We’ve got kind of an unusual problem. It’s related to some things I’ve seen you write about. Have you heard of …?”

“No, what’s that?”

“Never mind. We’d like you to help us with a project. …”

Years ago, when people heard that I worked in statistics they’d ask what programming language I worked in. They expected me to say R or SAS or something like that, but I’d say C++. Not that I recommend doing statistics in C++ [2] in general, but people came to me with unusual projects that they couldn’t get done with standard tools. If an R module would have done what they wanted, they wouldn’t have knocked on my door.

Doing illegible work is a lot of fun, but it’s hard to market. Try typing “Someone who can help with a kinda off the wall math / computer science project” into Google. It’s not helpful. Search engines can only answer questions that are legible to search engines. Illegible work is more likely to come from word of mouth than from a search engine.

***

[1] Sometimes companies call a consultant because they have occasional need for some skill, something they do not need often enough to justify hiring a full-time employee to do. Or maybe they have the skills in house to do a project but don’t have anyone available. Or maybe they want an outside auditor. But in this post I’m focusing on weird projects.

[2] When I mention C++, I know some people are thinking “But isn’t C++ terribly complex?” Why yes, yes it is. But my colleagues and I already knew C++, and we stuck to a sane subset of the language. It was not unusual to rewrite R code in C++ and make it 100x faster.

“Why don’t you just use C?” These days I’m more likely to write C than C++.  Clients don’t want me to write enterprise applications, just small numerical libraries, and they usually ask for C.

What use is mental math today?

Now that most people are carrying around a powerful computer in their pocket, what use is it to be able to do math in your head?

Here’s something I’ve noticed lately: being able to do quick approximations in mid-conversation is a superpower.

Zoom call

When I’m on Zoom with a client, I can’t say “Excuse me a second. Something you said gave me an idea, and I’d like to pull out my calculator app.” Instead, I can say things like “That would require collecting four times as much data. Are you OK with that?”

There’s no advantage to being able to do calculations to six decimal places on the spot like Mr. Spock, and I can’t do that anyway. But being able to come up with one significant figure or even an order-of-magnitude approximation quickly keeps the conversation flowing.

I have never had a client say something like “Could you be more precise? You said between 10 and 15, and our project is only worth doing if the answer is more than 13.2.” If they did say something like that, I’d say “I will look at this more carefully offline and get back to you with a more precise answer.”

I’m combining two closely-related but separate skills here. One is the ability to simple calculations. The other is the ability to know what to calculate, how to do so-called Fermi problems. These problems are named after Enrico Fermi, someone who was known for being able to make rough estimates with little or no data.

A famous example of a Fermi problem is “How many piano tuners are there in New York?” I don’t know whether this goes back to Fermi himself, but it’s the kind of question he would ask. Of course nobody knows exactly how many piano tuners there are in New York, but you could guess about how many piano owners there are, how often a piano needs to be tuned, and how many tuners it would take to service this demand.

The piano tuner example is more complicated than the kinds of calculations I have to do on Zoom calls, but it may be the most well-known Fermi problem.

In my work with data privacy, for example, I often have to estimate how common some combination of personal characteristics is. Of course nobody should bet their company on guesses I pull out of the air, but it does help keep a conversation going if I can say on the spot whether something sounds like a privacy risk or not. If a project sounds feasible, then I go back and make things more precise.

Related links

A Bayesian approach to pricing

Suppose you want to determine how to price a product and you initially don’t know what the market is willing to pay. This post outlines some of the things you might think about, and how Bayesian modeling might help.

This post is not the final word on the subject, or even my final word on the subject. It is essentially a reply to a friend’s question turned into a blog post rather than an email.

Prior information

You must have some idea, however vague, what the market value of your product is. If you had absolutely no idea what a product is worth, you wouldn’t be considering it as a business opportunity.

There is always prior information. As a former colleague would say, when you want to measure the distance to the moon, you know not to pick up a yard stick. Whenever you do an experiment, something motivated you to do the experiment.

This is an ideal application of Bayesian statistics because you have valuable prior information before you have data. Until you have a moderate amount of data, your prior information may be more useful than your data.

Some people will say you should only act on data, not on subjective prior knowledge, but this is impossible. When you offer your product for the first time, you have no data. All you have to go on is prior information. You could hide your prior information in the design of an experiment rather than making it explicit in a prior distribution, but it’s still there.

Model

Assume the market price for some product can be modeled by a random variable X that depends on some parameter θ. I’m not saying that the price is random in any philosophical sense, only that it is useful to model it as random. More on this line of thinking here.

By modeling market price as a random variable rather than a single number, we’re acknowledging that it has some fuzziness to it. Different customers are willing to pay different amounts for the same product. Maybe the prices they’re willing to pay are tightly distributed around some center, or maybe there’s substantial variance.

When we make a sale, or fail to make a sale, we learn something about θ. But notice that we don’t observe X per se, we observe whether a particular sample from X was above or below the offer price. You’re not conducting a survey where you ask “What is the highest price p that you’d be willing to pay” and get a candid answer. You make an offer x, and it is either accepted or rejected. You observe whether x < p or x > p.

This means the likelihood function is similar to what you’d see in modeling survival data, but a little different. When someone dies, you fully observe their survival time. But if you follow up with someone and they’re still alive, you only know a lower bound on their survival time, not the survival time itself. We say the data is censored because we haven’t yet observed everything we want to know.

Survival data is usually asymmetric, censored on one side but not the other. You could have two-sided censoring, but that’s less common.

With pricing your data is always censored in both directions. You either get a lower bound or an upper bound on what someone would have been willing to pay.

After each offer and response, you can update your estimate of θ. Each interaction gives you a better idea of the distribution on θ.

Pricing

Now suppose after numerous observations you’re moderately confident in your knowledge of θ. Now what?

One response is “Well then you charge what the market is most likely to bear.” That’s kind of a simplistic optimization. It implicitly assumes you’re OK with a 50% chance of a sale going through. Maybe your business is struggling and you don’t have many leads. Then you want a higher conversion rate. Or maybe you’re doing well, have plenty leads, and are OK with a low conversion rate. This is especially the case if the distribution on market price has a lot of variance; if the variance is low it makes more sense to think of “the” price as if it were a single number.

So far I have implicitly assumed that the only consequence of asking too much is a lost sale. But if you ask for too much, you might lose future sales, even if you get the current sale. I’ve also assumed that customers always prefer lower prices. That’s not the case. Asking too little for a product can hurt your credibility, for example.

Estimating willingness to pay is complicated, and determining what to do once you’ve made that estimate is complicated as well. This post is just a sketch of the thought process a company might go through.

Related posts

Wire gauge and user perspective

wire gauge measurement device

Wire gauge is a perennial source of confusion: larger numbers denote smaller wires. The reason is that gauge numbers were assigned from the perspective of the manufacturing process. Thinner wires require more steps in production. This is a common error in user interface design and business more generally: describing things from your perspective rather than from the customer’s perspective.

Restaurants

When you order food at a restaurant, the person taking your order may rearrange your words before repeating them back to you. The reason may be that they’re restating it in manufacturing order, the order in which the person preparing the food needs the information.

Rheostats

A rheostat is a device for controlling resistance in an electrical circuit. It would seem natural for an engineer to give a user a control to vary resistance in Ohms. But Ohm’s law says

V = IR,

i.e. voltage equals current times resistance. Users expect that when they turn a knob clockwise they get more of something—brighter lights, louder music, etc.—and that means more voltage or more current, which means less resistance. Asking users to control resistance reverses expectations.

If I remember correctly, someone designed a defibrillator once where a knob controlled resistance rather than current. If that didn’t lead to someone dying, it easily could have.

Research

When I worked for MD Anderson Cancer Center, I managed the development of software for clinical trial design and conduct. Our software started out very statistician-centric and became more user-centric. This was a win, even for statisticians.

The general pattern was to move from eliciting technical parameters to eliciting desired behavior. Tell us how you want the design to behave and we’ll solve for the parameters to make that happen. Sometimes we weren’t able to completely automate parameter selection, but we were at least able to give the user a head start in knowing where to look.

Relax

Technical people don’t always want to have their technical hat on. Sometimes they want to relax and be consumers. When statisticians wanted to crank out a clinical trial design, they wanted software that was easy to use rather than technically transparent. That’s the backstory to this post.

It’s generally a good idea to conceal technical details, but provide a “service panel” to expose details when necessary.

Related posts

Black Swan Gratification

Psychologists say that random rewards are more addictive than steady, predictable rewards. But I believe this only applies to relatively frequent feedback. If rewards are too infrequent, there’s no emotional connection between behavior and reward. The connection becomes more intellectual and less visceral as feedback becomes less frequent and less predictable.

Nassim Taleb distinguishes between delayed gratification and random gratification in his foreword to the book Safe Haven by Mark Spitznagel.

There are activities with remote payoff and no feedback that are ignored by the common crowd. … So what this idea is about isn’t delayed gratification but the ability to operate without gratification — or rather, with random gratification.

Choosing a course of action that is certain to pay off a year from now is opting for delayed gratification. Choosing something that is likely to pay off eventually, maybe two years from now, or maybe next week, is opting for random gratification.

Random rewards encourage an addictive response to frequent feedback, and discourage a rational response to infrequent feedback.

The solution is to act on principle, rather than respond like the rats in the psychological studies alluded to above.