Sunday, January 11, 2009

What are the differences between ASCII and Unicode?

ASCII is a seven-bit encoding technique which assigns a number to each of the 128 characters
used most frequently in American English. This allows most computers to record and display
basic text. ASCII does not include symbols frequently used in other countries, such as the British
pound symbol or the German umlaut. ASCII is understood by almost all email and
communications software.


Unicode is an attempt by ISO and the Unicode Consortium to develop a coding system for electronic text that includes every written alphabet in existence. Unicode uses 8-, 16-, or
32-bitcharacters depending on the specific representation, so Unicode documents often require
up to twice as much disk space as ASCII or Latin-1 documents. The first 256 characters of Unicode are identical to Latin-1. Not all email or communications software can understand the Unicode character set.


What is ASCII?

ASCII is an acronym for American Standard Code for Information Interchange, a widely used standard for encoding text documents on computers. Usually, a file described as "ASCII" does not contain any special embedded control characters; you can view the contents of the file, change it with an editor, or print it with a printer.

In ASCII, every letter, number, and punctuation symbol has a corresponding number, or ASCII code. For example, the character for the number 1 has the code 49, capital letter A has the code 65, and a blank space has the code 32. This encoding system not only lets a computer store a document as a series of numbers, but also lets it share such documents with other computers that use the ASCII system. For a complete list of ASCII codes, see A table of ASCII character codes.

Documentation files or program source code files are usually stored as ASCII text. In contrast, binary files, such as executable programs, graphical images, or word processing documents, contain other characters that cannot be normally displayed or printed, and are usually illegible to human beings.

The format of a file, whether ASCII or binary, becomes important when you are transferring files between computers. For example, when using FTP, you can transfer ASCII text files without any special consideration. To exchange binary files, however, you may need to enter the command set binary or otherwise prepare the client to transfer binary files, so that the computer will correctly transmit the special characters in the file.

Note: Most current FTP software will automatically transfer ASCII and binary files correctly.

What is Unicode?

Developed in cooperation between the Unicode Consortium and the International Organization
for Standardization (ISO), Unicode is an attempt to consolidate the alphabets and ideographs of
the world's languages into a single, international character set. It focuses on the characters
themselves rather than on languages. Thus, a letter shared between English and Russian (or for
that matter, an ideograph shared between kanji and Han script) would have the same Unicode
character. As a multilingual standard, Unicode makes it possible for developers to create
applications without having to resort to the costly, time-consuming task of releasing localized
versions for each language.

Most Western character sets are 7-bit (e.g., US ASCII) or 8-bit (e.g., Latin-1), limiting them,
respectively, to 128 or 256 characters. This limitation has resulted in a slew of sets customized
for each language. For languages like Chinese, Korean, and Japanese, which use heavily
ideographic (i.e., based on the content of a word rather than its component sounds) writing
systems consisting of thousands of characters, traditional 7- and 8-bit character sets are not
adequate. Therefore, to include the character sets of the world's principal writing systems,
Unicode uses primarily a 16-bit set, allowing up to 65,536 characters. This does have the
consequence that Unicode text takes up twice as much disk space as text using an 8-bit
character set.

As a character set, Unicode does not concern itself with the specific appearance, or glyph, of a
character. Instead, it includes only a code and name for each character. Individual fonts are
assigned the tasks of rendering characters into glyphs, with the exact appearance of glyphs
varying between fonts. Similarly, Unicode does not, for the most part, distinguish between plain
and rich text, instead allowing applications to apply their own text processing and formatting.