Typesetting Sha and Bitcoin

I went down a rabbit hole this week using two symbols in LaTeX. The first was the Russian letter Sha (Ш, U+0248), and the second was the currency symbol for Bitcoin (₿, U+20BF).

Sha

I thought there would be a LaTeX package that would include Ш as a symbol rather than as a Russian letter, just as \pi produces π as a symbol rather than as a Greek letter per se, but apparently there isn’t. I was surprised, since Ш is used in math for at least three different things [1].

When I post on @TeXtip how to produce various symbols in LaTeX, I often get a reply telling me I should simply paste in the Unicode character and use XeTeX. That’s what I ended up doing, except one does not simply use XeTeX.

You have to set the font to one that contains a glyph for the character you want, and you have to use a font encoding package. I ended up adding these two lines to my file header:

    \usepackage[T2A]{fontenc}
    \usepackage{eulervm}

That worked, but only when I compiled with pdflatex, not xelatex.

Bitcoin

I ended up using a different but analogous tactic for the Bitcoin symbol. I used fontspec, Liberation Sans, and xelatex rather than fontenc, Euler, and pdflatex. These were the lines I added to the header:

    \usepackage{fontspec}
    \setmainfont{Liberation Sans}

Without these two lines I get the error message

    Missing character: There is no ₿ (U+20BF) in font ...

I didn’t need to use ₿ and Ш in the same document, but the approach in this section works for both. The approach in the previous section will not work for ₿ because the Euler font does not contain a gylph for ₿.

Related posts

[1] The three mathematical uses of Ш that I’m aware of are the shuffle product, the Dirac comb distribution, and Tate–Shafarevich group.

Writing math with Unicode

A LaTeX document looks better than an HTML document, but an HTML document looks better than an awkward hybrid of HTML and inline images created by LaTeX.

My rule is to only use LaTeX-generated images for displayed equations and not for math symbols in the middle of a sentence. This works pretty well, but it’s less than ideal. I use HTML for displayed equations too when I can. Over time I’ve learned how to do more in HTML, and browser support for special characters has improved [1].

My personal prohibition on inline images requires saying in words what I would say in symbols if I were writing LaTeX. For example, in the context of complex variables I have written “z conjugate” rather than put a conjugate bar over z.

There’s a way to fix the particular problem of typesetting conjugates: the Unicode character U+0305 will put a bar (i.e. conjugation symbol) over a character. For example:

If z = a + bi then bi.

Here’s the HTML code for the sentence above:

    If <em>z</em> = <em>a</em> + <em>b</em>&#x200a;<em>i</em> 
    then <em>z&#x0305;</em> &#x2212; <em>b</em>&#x200a;<em>i</em>.

Note that the character U+0305, written in HTML as &#x0305;, goes inside the em tags. A couple other things: I put a hair space (U+200A) between b and i, and I used a minus sign (U+2212) rather than a hyphen.

I normally just use a hyphen for a minus sign when I’m blogging about math, but sometimes this doesn’t look right. For example, yesterday’s post about fractional factorial designs had three tables filled with plus signs and minus signs. I first used a hyphen, but that didn’t look right because it was too narrow to visually pair with the plus signs.

Just as you can use U+0305 to put a bar on top of a character, you can use U+20D7 to put a vector on top. For example,

x⃗ = (x₁, x₂, x₃).

This was created with

    <em>x&#x20d7;</em> = (<em>x</em>&#x2081;, 
    <em>x</em>&#x2082;, <em>x</em>&#x2083;).

Here I used the Unicode characters for subscript 1, subscript 2, and subscript 3. Sometimes these look better than <sub>1</sub> etc, but not always. Here’s the equation for x⃗ using sub tags:

x⃗ = (x1, x2, x3).

Unicode typically has all the symbols you need to write mathematics. You can use this page to find the Unicode counterpart to most LaTeX symbols. But text is inherently linear, and you need more than text to lay out typesetting in two dimensions.

Update: Looks like I spoke too soon. The tricks presented here work well on Linux and Mac but not on Windows. Some readers are saying the vector symbol is missing on Windows. On my Windows laptop the bar and vector appear but are not centered over the intended character.

Related posts

[1] When I started blogging you couldn’t count on browsers having font support for all the mathematical symbols you might want to use. (This post summarizes my experience as of nine years ago.) Now I believe you can, especially for any symbol in the BMP (Basic Multilingual Plane, code points below FFFF). I haven’t gotten feedback from anyone saying they’re missing symbols that I use.

Including a little Hebrew in an English LaTeX document

I was looking up how to put a little Hebrew inside a LaTeX document and ran across a good answer on tex.stackexchange. Short answer: use the cjhebrew package.

In a nutshell, you put your Hebrew text between \< and > using the cjhebrew package’s transliteration. You write left-to-right, and the text will appear right-to-left. For example, \<'lp> produces

aleph in Hebrew

using ‘ for א, l for ל, and p for ף.

The code for each Hebrew letter is its English transliteration, with three footnotes.

First, when two Hebrew letters roughly correspond to the same English letter, one form may have a dot in front of it. For example, ט and ת both make a t sound; the former is encoded as .t and the latter as t.

Second, five Hebrew letters have a different form when used at the end of a word [1]. For such letters the final form is the capitalized value of the regular form. For example, פ and its final form ף are denoted by p and P respectively. The package will automatically choose between regular and final forms, but you can override this by using the capital letter in the middle of a word or by using a | after a regular form at the end of a word.

Finally, the letter ש is written with a /s The author already used s for ס and .s for צ, so he needed a new symbol to encode a third letter corresponding to s [2]. Also ש has a couple other forms. The letter can make either the sh or s sound, and you may see dots on top of the letter to distinguish these. The cjhebrew package uses +s for ש with a dot on the top right, the sh sound, and ,s for ש with a dot on the top left, the s sound.

Here is the complete consonant transliteration table from the cjhebrew documentation.

Note that the code for א is a single quote ' and the code for ע is a back tick (grave accent) `.

You can also add vowel points (niqqudim). These are also represented by their transliteration to English sounds, with one exception. The sh’va is either silent or represents a schwa sound, so there’s not a convenient transliterations. But the sh’va looks like a colon, so it is represented by a colon. See the package documentation for more details.

Related posts

[1] You may have seen something similar in Greek with sigma σ and final sigma ς. Even English had something like this. For example, people used to use a different form of t at the end of a word when writing cursive. My mother wrote this way.

[2] It would be more phonetically faithful to transliterate צ as ts, but that would make the LaTeX package harder to implement since it would have to disambiguate whether ts represents צ or תס.

Font Fingerprinting

Browser fingerprint

Websites may not be able to identify you, but they can probably identify your web browser. Your browser sends a lot of information back to web servers, and the combination of settings for a particular browser are usually unique. To get an idea what information we’re talking about, you could take a look at Device Info.

Installed fonts

One of the pieces of information that gets sent back to servers is the list of fonts installed on your device. Your font fingerprint is just one component of your browser fingerprint, but it’s an easy component to understand.

Application fonts

Various applications install their own fonts. If you’ve installed Microsoft Office, for example, that would be evident in your list of fonts. However, Office is ubiquitous, so that information doesn’t go very far to identifying you. Maybe the lack of fonts installed with Office would be more conspicuous.

Less common software goes further toward identifying you. For example, I have Mathematica on one of my computers, and along with it Mathematica fonts, something that’s not too common.

Personal fonts

Then there are the fonts you’ve installed deliberately, many of them free. Maybe you’ve installed fonts to support various languages, such as Hebrew and Greek fonts for Bible scholars. Maybe you have dyslexia and have installed fonts that are easier for you to read. Maybe you’ve installed a font because it contains technical symbols you need for your work. These increase the chances that your combination of fonts is unique.

Commercial fonts

Maybe you have purchased a few commercial fonts. One of the reasons to buy fonts is to have something that doesn’t look so common. This also makes the font fingerprint of your browser less common.

Moderate obscurity

Servers have to query whether particular fonts are installed. An obscure font would go a long way toward identifying you. But if a font is truly obscure, the server isn’t likely to ask whether it’s installed. So the greatest privacy risk comes from moderately uncommon fonts [1].

Advertising

Your browser fingerprint is probably unique, unless you have a brand new device, or you’ve made a deliberate effort to keep your fingerprint generic. So while a site may not know who you are, it can recognize whether you’ve been there before, and customize the content you receive accordingly. Maybe you’ve looked at the same product three times without buying, and so you get a nudge to encourage you to go ahead and buy.

(It’ll be interesting to see what effect the California Consumer Privacy Act has on this when it goes into effect the beginning of next year.)

What about changes?

Since there’s more than enough information to uniquely identify your browser, fingerprints are robust to changes. Installing a new font won’t throw advertisers off your trail. If you still have the same monitor size, same geographic location, etc. then advertisers can update your fingerprint information to include the new font. You might even get an advertisement for more fonts if they infer you’re a typography aficionado.

More privacy posts

[1] Except for a spearphishing attack. A server might check for the presence of fonts that, although uncommon in general, are likely to be on the target’s computer. For example, if someone wanted to detect my browser in particular, they know I have Mathematica fonts installed because I said so above. And they might guess that I have installed the Greek and Hebrew fonts I mentioned. They might also look for obscure fonts I’ve mentioned in the blog, such as Unifont, Andika, and Inconsolata.

Which Unicode characters can you depend on?

Unicode is supported everywhere, but font support for Unicode characters is sparse. When you use any slightly uncommon character, you have no guarantee someone else will be able to see it.

I’m starting a Twitter account @MusicTheoryTip and so I wanted to know whether I could count on followers seeing music symbols. I asked whether people could see ♭ (flat, U+266D), ♮ (natural, U+266E), and ♯ (sharp, U+266F). Most people could see all three symbols, from desktop or phone, browser or Twitter app. However, several were unable to see the natural sign from an Android phone, whether using a browser or a Twitter app. One person said none of the symbols show up on his Blackberry.

[Update: I gave @MusicTheoryTip over to someone else, and they didn’t keep it up for long.]

I also asked @diff_eq followers whether they could see the math symbols ∂ (partial, U+2202), Δ (Delta, U+0394), and ∇ (gradient, U+2207). One person said he couldn’t see the gradient symbol, but the rest of the feedback was positive.

So what characters can you count on nearly everyone being able to see? To answer this question, I looked at the characters in the intersection of several common fonts: Verdana, Georgia, Times New Roman, Arial, Courier New, and Droid Sans. My thought was that this would make a very conservative set of characters.

There are 585 characters supported by all the fonts listed above. Most of the characters with code points up to U+01FF are included. This range includes the code blocks for Basic Latin, Latin-1 Supplement, Latin Extended-A, and some of Latin Extended-B.

The rest of the characters in the intersection are Greek and Cyrillic letters and a few scattered symbols. Flat, natural, sharp, and gradient didn’t make the cut.

There are a dozen math symbols included:

0x2202 ∂
0x2206 ∆
0x220F ∏
0x2211 ∑
0x2212 −
0x221A √
0x221E ∞
0x222B ∫
0x2248 ≈
0x2260 ≠
0x2264 ≤
0x2265 ≥

Interestingly, even in such a conservative set of characters, there are a three characters included for semantic distinction: the minus sign (i.e. not a hyphen), the difference operator (i.e. not the Greek letter Delta), and the summation operator (i.e. not the Greek letter Sigma).

And in case you’re interested, here’s the complete list of the Unicode characters in the intersection of the fonts listed here. (Update: Added notes to indicate the start of a new code block and listed some of the isolated characters.)

0x0009           Basic Latin
0x000d
0x0020 - 0x007e 
0x00a0 - 0x017f  Latin-1 supplement
0x0192	         
0x01fa - 0x01ff
0x0218 - 0x0219  
0x02c6 - 0x02c7  
0x02c9
0x02d8 - 0x02dd 
0x0300 - 0x0301 
0x0384 - 0x038a  Greek and Coptic
0x038c
0x038e - 0x03a1
0x03a3 - 0x03ce
0x0401 - 0x040c 
0x040e - 0x044f  Cyrillic
0x0451 - 0x045c
0x045e - 0x045f
0x0490 - 0x0491
0x1e80 - 0x1e85  Latin extended additional
0x1ef2 - 0x1ef3
0x200c - 0x200f  General punctuation
0x2013 - 0x2015
0x2017 - 0x201e
0x2020 - 0x2022
0x2026
0x2028 - 0x202e
0x2030
0x2032 - 0x2033
0x2039 - 0x203a
0x203c
0x2044
0x206a - 0x206f  
0x207f           
0x20a3 - 0x20a4  Currency symbols ₣ ₤
0x20a7           ₧
0x20ac           €
0x2105           Letterlike symbols ℅
0x2116           №
0x2122           ™
0x2126           Ω
0x212e           ℮
0x215b - 0x215e  ⅛ ⅜ ⅝ ⅞
0x2202 	         Mathematical operators ∂
0x2206           ∆
0x220f           ∏
0x2211 - 0x2212  ∑ −
0x221a           √
0x221e           ∞
0x222b           ∫
0x2248           ≈
0x2260           ≠
0x2264 - 0x2265  ≤ ≥
0x25ca           Box drawing ◊
0xfb01 - 0xfb02  Alphabetic presentation forms fi fl

A web built on LaTeX

The other day on TeXtip, I threw this out:

Imagine if the web had been built on LaTeX instead of HTML …

Here are some of the responses I got:

  • It would have been more pretty looking.
  • Frightening.
  • Single tear down the cheek.
  • No crap amateurish content because of the steep learning curve, and beautiful rendering … What a dream!
  • Shiny math, crappy picture placement: glad it did not!
  • Overfull hboxes EVERYWHERE.
  • LaTeX would have become bloated, and people would be tweeting about HTML being so much better.
  • Noooo! LaTeX would have been “standardised”, “extended” and would by now be a useless pile of complexity.

Use typewriter font for code inside prose

There’s a useful tradition of using a typewriter font, or more generally some monospaced font, for bits of code sprinkled in prose. The practice is analogous to using italic to mark, for example, a French mot dropped into an English paragraph. In HTML, the code tag marks content as software code, which a browser typically will render in a typewriter font.

Here’s a sentence from a new article on Python at Netflix that could benefit a few code tags.

These features (and more) have led to increasingly pervasive use of Python in everything from small tools using boto to talk to AWS, to storing information with python-memcached and pycassa, managing processes with Envoy, polling restful APIs to large applications with requests, providing web interfaces with CherryPy and Bottle, and crunching data with scipy.

Here’s the same sentence with some code tags.

These features (and more) have led to increasingly pervasive use of Python in everything from small tools using boto to talk to AWS, to storing information with python-memcached and pycassa, managing processes with Envoy, polling restful APIs to large applications with requests, providing web interfaces with CherryPy and Bottle, and crunching data with scipy.

It’s especially helpful to let the reader know that packages like requests are indeed packages. It helps to clarify, for example, whether Wes McKinney has been stress testing pandas or pandas. That way we know whether to inform animal protection authorities or to download a new version of a library.

Unicode to LaTeX

I’ve run across a couple websites that let you enter a LaTeX symbol and get back its Unicode value. But I didn’t find a site that does the reverse, going from Unicode to LaTeX, so I wrote my own.

Unicode / LaTeX Conversion

If you enter Unicode, it will return LaTeX. If you enter LaTeX, it will return Unicode. It interprets a string starting with “U+” as a Unicode code point, and a string starting with a backslash as a LaTeX command.

screenshot of www.johndcook.com/unicode_latex.png

For example, the screenshot above shows what happens if you enter U+221E and click “convert.” You could also enter infty and get back U+221E.

However, if you go from Unicode to LaTeX to Unicode, you won’t always end up where you started. There may be multiple Unicode values that map to a single LaTeX symbol. This is because Unicode is semantic and LaTeX is not. For example, Unicode distinguishes between the Greek letter Ω and the symbol Ω for ohms, the unit of electrical resistance, but LaTeX does not.

Letters that fell out of the alphabet

Mental Floss had an interesting article called 12 letters that didn’t make the alphabet. A more accurate title might be 12 letters that fell out of the modern English alphabet.

I thought it would have been better if the article had included the Unicode values of the letters, so I did a little research and created the following table.

Name Capital Small
Thorn U+00DE U+00FE
Wynn U+01F7 U+01BF
Yogh U+021C U+021D
Ash U+00C6 U+00E6
Eth U+00D0 U+00F0
Ampersand U+0026
Insular g U+A77D U+1D79
Thorn with stroke U+A764 U+A765
Ethel U+0152 U+0153
Tironian ond U+204A
Long s U+017F
Eng U+014A U+014B

 

Once you know the Unicode code point for a symbol, you can find out more about it, for example, here.

The paper is too big

In response to the question “Why are default LaTeX margins so big?” Paul Stanley answers

It’s not that the margins are too wide. It’s that the paper is too big!

This sounds flippant, but he gives a compelling argument that paper really is too big for how it is now used.

As is surely by now well-known, the real question is the size of the text block. That is a really important factor in legibility. As others have noted, the optimum line length is broadly somewhere between 60 characters and 75 characters.

Given reasonable sizes of font which are comfortable for reading at the distance we want to read at (roughly 9 to 12 point), there are only so many line lengths that make sense. If you take a book off your shelf, especially a book that you would actually read for a prolonged period of time, and compare it to a LaTeX document in one of the standard classes, you’ll probably notice that the line length is pretty similar.

The real problem is with paper size. As it happens, we have ended up with paper sizes that were never designed or adapted for printing with 10-12 point proportionally spaced type. They were designed for handwriting (which is usually much bigger) or for typewriters. Typewriters produced 10 or 12 characters per inch: so on (say) 8.5 inch wide paper, with 1 inch margins, you had 6.5 inches of type, giving … around 65 to 78 characters: in other words something pretty close to ideal. But if you type in a standard proportionally spaced font (worse, in Times—which is rather condensed because it was designed to be used in narrow columns) at 12 point, you will get about 90 to 100 characters in the line.

He then gives six suggestions for what to do about this. You can see his answer for a full explanation. Here I’ll just summarize his points.

  1. Use smaller paper.
  2. Use long lines of text but extra space between lines.
  3. Use wide margins.
  4. Use margins for notes and illustrations.
  5. Use a two column format.
  6. Use large type.

Given these options, wide margins (as in #3 and #4) sound reasonable.