Unicode resources
Unicode is essentially a universal character set. It contains nearly every character in every human language. However, Unicode is subtle. As I point out in my blog article on Unicode, it's hard to say anything pithy about Unicode that is entirely correct. Every simple statement requires footnotes. Here are some resources I've found useful in understanding and using Unicode.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
by Joel Spolsky
Great introduction to Unicode for developers, as the title suggests.
Unicode Standard by the Unicode Consortium
The 1472-page tome is the indispensable reference for Unicode. The
Unicode Consortium has made much of the information in this book available
online.
In general, Unicode characters can be inserted into HTML by putting their
hexadecimal representation between &#x
and a semicolon. For example, the
Greek theta
(θ) can be inserted into HTML by typing θ
. Some commonly used
characters have mnemonic counterparts, such as θ
for θ. However, there
are only 252 such HTML entities and over 40,000 Unicode characters. Also, in
general HTML mnemonic entities cannot be used in XML. There are four
exceptions: &
, >
, <
, and
"
. Note that just because a character is legal HTML
does not mean the client's browser will display it or display it correctly.
See also math symbols and
Greek letters.
Unicode in XML
Unicode characters can be inserted into XML by quoting their code point
numbers in hexadecimal, much like HTML. However, some characters are illegal
or at least discouraged because they could confuse XML processors.
XeTeX
XeTeX is a version of TeX that works with Unicode. There is a XeLaTeX
version of LaTeX as well.
There
Ain't No Such Thing as Plain Text by Jeff Atwood
Mostly about Unicode encodings such as UTF-8.
Unicode and ISO
10646
Why these are not exactly the same thing and just what the relationship
between the two is.
Unicode Explained by Jukka Korpela.
This book gets into many of the issues surrounding Unicode that are not part
of Unicode per se, such as internationalization and software compatibility.