[an error occurred while processing this directive]

What is Unicode? :: Technologies How does a computer recognize and display characters (fonts)? What happens when we go beyond English? Why is Unicode important? Some other questions Know more

We keep hearing the word 'Unicode' and people are saying that it is better if a font is 'Unicode'. What is this 'Unicode' thing, and why is it better?

Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language. ^

If we want to understand Unicode, we need to have some idea of how a computer uses characters ("letters" or "fonts").

How does a computer recognize and display characters?

A computer does not know anything about "letters" or "symbols". All a computer understands is numbers. We come to believe that the computer knows about letters because it can write words on the screen, but the computer is really only translating a sequence of codes (numbers) into drawings on the screen (the letters that you see).

For example, if the computer wants to show the word "cat" on the screen, it finds the codes for characters "C", "A", and "T" in its memory and replaces them on the screen with the drawings that look like "C", "A", and "T".

What are these codes? Originally it was decided to use the numbers from 1-127 (that's 7 bits) to represent a different symbol each. So for example, those letters "CAT" were given the codes 99, 97, and 116, respectively.

That worked well in the beginning when the only language that computer scientists cared about was English. They didn't need more than 127 codes to show all the characters in the English language: 26 for lower-case letters, 26 for upper-case letters, 10 for numbers, and plenty of extra for symbols and "control characters".

What happens when we go beyond English?

But — English isn't the only language in the world! As time went on, it became evident that people were wanting to use languages other than English on their computers. The range of codes was expanded to 1-255, and most of the european symbols were added. This range is what you will most often see referred to as ASCII code, today.

Now that worked fine for European languages. But there are many more languages besides just the European ones! With many many more characters. 255 codes is not enough to give each of the world's symbols (characters) a unique number.

This is where Unicode steps in ... It provides more than 4 BILLION (!) codes. So more than 4 billion different symbols can each be given its own, unique number. (Technically speaking: Unicode expands the number of bits used by each symbol to 32.)

The Tibetan letters and symbols have been given the range 0xF00-0xFFF (hexadecimal representation) ... that is, there are 255 unique numbers for representing Tibetan letters, digits, and other symbols.

Why is it good that Tibetan (and other scripts) have their own numbers?

Remember that the computer stores the word "cat" as the numbers 99, 97, and 116. If you use a non-Unicode Tibetan font, the numbers 99, 97, and 116 would be drawn on the screen as some Tibetan characters and not as "C-A-T". OR, if I choose another language's non-Unicode font, the symbols drawn on the screen would be totally different again — these non-sense letters, are what you see in a browser or other program when it does not recognize the font.

This means that the user has to tell the computer what language the document is in, because the computer does not know that the document is english, tibetan, or hindi ... it just draws the symbols from the selected font for the given numbers.

But if we now use Unicode fonts, exclusively, we would find that the letters "C-A-T" would always be drawn on the screen because the computer uses other codes to represent Tibetan, English, and Hindi ... so the computer now knows that the document is English, because it is only using the codes for English.

Why is all this important?

  • If you make your document in Unicode (that is, type it in some Unicode font), then even if another person doesn't have the font you used, as long as they have any Unicode font in the same language, they can still read and use your file.
  • What happens if the non-Unicode fonts disappear or stop being maintained? Nobody can read your files, your data!
    Using Unicode is future-proofing your data. If you use Unicode fonts for all your documents, you encoding your data in a format where the user does not have to guess what language it may be in, and is not dependent on arbitrary mappings of of codes in the range 1-127 to symbols.
  • An additional benefit: when the computer knows what language it is dealing with, it can automatically make other smart choices, such as which font to use, which spell-checker to use, what grammar-checker to use, whether to write from right-to-left (Hebrew) or from left-to-right (Tibetan), etc.

Some other questions

What is ASCII?

It is the set of binary representations of the standard roman characters (A-Z, a-z, etc.) that most computers and peripherals use.
Stands for American Standard Code for Information Interchange.

What is hexadecimal?

It is a numeric notation that uses 16 possible values, 0-9 and A-F. That is, you start counting at 0, go up to 9, and then the next number is A, and the last number is F. It is the most common notation used in the computer world.

What is UTF-8?

Whereas Unicode defines which symbol goes with which number in the range 1 - 4 billion, the computer is free to store that number in any way that it thinks is best. UTF-8 is a way of storing numbers that makes the old ASCII codes (1-127) compatible with the new Unicode codes... you can Google for more technical detail, but briefly: it allows old, non-Unicode, english-language programs to work in a Unicode environment without modification.

Many thanks to Jonas Bonn, who kindly allowed adaptation of his 21 May 2005 post to the DITG mail list

Know more

http://unicode.org/standard/WhatIsUnicode.html
A simple explanation
http://unicode.org/
The official Unicode web site provides extensive information and resources for programmers, implementers and others involved in globalization work.
http://en.wikipedia.org/wiki/Unicode
Unicode - Wikipedia, the free encyclopedia
A description of the basic concept of Unicode plus links to related resources.