Data Representation: Lecture Notes

Character Representation/Encoding

As we've seen, the ASCII encoding contains representations for only 256 character types, mostly focusing on keyboard characters and Latin characters. How can letters in other languages be represented?

Unicode was created in 1991 to provide a single, consistent way to encode text from all languages, solving the problems of incompatibility and limited character representation that ASCII couldn't address:

Before Unicode, there were many different character encodings used by different systems to represent non-English text, and these encodings could conflict. Text that is displayed correctly on one computer might appear garbled on another because they used different encoding standards.
Unicode was developed to be a universal character encoding standard that could represent characters from all languages, along with symbols, emojis, and more. It assigns a unique code point to every character from every writing system, and it currently supports over 143,000 characters across multiple scripts and languages.
Unicode uses different encoding forms like UTF-8, UTF-16, and UTF-32, allowing it to handle everything from simple ASCII characters (in 1 byte) to complex, multi-byte characters efficiently, depending on the use case.