Yesterday I wrote about cjhebrew
, a LaTeX package that lets you insert Hebrew text by using a sort of transliteration scheme. That reminded me of unidecode
, a Python package for transliterating Unicode to ASCII, that I wrote about before. I wondered how the two compare, and so this post will answer that question.
Transliteration is a crude approximation. I started to say it’s no substitute for a proper translation, but in fact sometimes it is a substitute for a proper translation. It takes in the smallest context possible—one character—and is utterly devoid of nuance, but it still might be good enough for some purposes. It might, for example, help in searching some text for relevant content worth the effort of a proper translation.
Here’s a short bit of code to display unidecode
‘s transliterations of the Hebrew alphabet.
for i in range(22+5): ch = chr(i + ord('א')) print(ch, unidecode.unidecode(ch))
I wrote 22 + 5 rather than 27 above to give a hint that the extra values are the final forms of five letters [1]. Also if ord('א')
doesn’t work for you, you can replace it with 0x05d0
.
Here’s a comparison of the transliterations used in cjhebrew
and unidecode
. I’ve abbreviated the column headings to make a narrower table.
|---------+---+----+----| | Unicode | | cj | ud | |---------+---+----+----| | U+05d0 | א | ' | A | | U+05d1 | ב | b | b | | U+05d2 | ג | g | g | | U+05d3 | ד | d | d | | U+05d4 | ה | h | h | | U+05d5 | ו | w | v | | U+05d6 | ז | z | z | | U+05d7 | ח | .h | KH | | U+05d8 | ט | .t | t | | U+05d9 | י | y | y | | U+05da | ך | K | k | | U+05db | כ | k | k | | U+05dc | ל | l | l | | U+05dd | ם | M | m | | U+05de | מ | m | m | | U+05df | ן | N | n | | U+05e0 | נ | n | n | | U+05e1 | ס | s | s | | U+05e2 | ע | ` | ` | | U+05e3 | ף | P | p | | U+05e4 | פ | p | p | | U+05e5 | ץ | .S | TS | | U+05e6 | צ | s | TS | | U+05e7 | ק | q | q | | U+05e8 | ר | r | r | | U+05e9 | ש | /s | SH | | U+05ea | ת | t | t | |---------+---+----+----|
The transliterations are pretty similar, despite different design goals. The unidecode
module is trying to pick the best mapping to ASCII characters. The cjhebrew
package is trying to use mnemonic ASCII sequences to map into Hebrew. The former doesn’t need to be unique, but the latter does. The post on cjhebrew explains, for example, that it uses capital letters for final forms of Hebrew letters.
Here’s the corresponding table for vowel points (niqqudim).
|---------+---+----+----| | Unicode | | cj | ud | |---------+---+----+----| | U+05b0 | ְ | : | @ | | U+05b1 | ֱ | E: | e | | U+05b2 | ֲ | a: | a | | U+05b3 | ֳ | A: | o | | U+05b4 | ִ | i | i | | U+05b5 | ֵ | e | e | | U+05b6 | ֶ | E | e | | U+05b7 | ַ | a | a | | U+05b8 | ָ | A | a | | U+05b9 | ֹ | o | o | | U+05ba | ֺ | o | o | | U+05bb | ֻ | u | u | |---------+---+----+----|
Related posts
[1] Unicode lists the final forms of letters come before the ordinary form. For example, final kaf has Unicode value U+05da and kaf has value U+05db.
This post and the previous one on Hebrew in LaTeX appear right-justified in The Old Reader. Any idea why? I assume it’s not coincidental that the affected posts relate to Hebrew.