The podcast about learning Japanese.

Seeing boxes and gibberish where you should be seeing Japanese?

Posted July 13th, 2012 by Enrico in Uncategorized

Before I took a somewhat long, unannounced hiatus from writing for the blog (申し訳ございません), I wrote about how you can type Japanese into your computer. Now I’m going to get a lot more technical (I am a software engineer by day, after all) and explain how Japanese is represented in a computer. There’s a bit of history, a bit of technology, and a bit of sociolinguistics, all in this single concept of rendering Japanese text on a computer screen.

Before I get into Japanese text though, I’d like to talk about ASCII. To a computer, everything is bits: ones and zeroes. What you see, though, is a variety of types of files including text, sound, and video. So how does a computer know which is which? Largely by convention. For different kinds of files, a particular set of conventions exist for interpreting the bytes (groups of 8 bits) and converting that into something that humans can understand and interact with. That might be letters on a screen, or it might be sound sent to your computer speakers. ASCII is one such convention, first published in the 60’s and still surviving even to this day. It is also the most logical place to start understanding the history of multilingual text in computers.

ASCII defines 128 characters by different combinations of 1’s and 0’s in the first 7 bits of each byte. Basically, every number from 0 to 127 is mapped to a character or control code (these codes don’t get printed on screen but instruct the computer about the text in other ways). If you count it up, you can imagine that it is quite enough to represent all of the characters one would typically need to type in English. But since only 7 bits are used, that leaves an 8th bit completely open and with that single bit, another 128 characters and control codes can be represented. English doesn’t need it, but consider Japanese. 128 is only just enough to cover the syllabic alphabets and there are still thousands and thousands of kanji characters that also need to be represented.

Japanese approached this problem the way that many other languages did: cleverly using the 8th bit to greatly expand the number of characters that can be represented. This is sometimes referred to as Extended ASCII. But Enrico, you say, even if you added another 128 characters, that wouldn’t nearly be enough to represent all of the Japanese characters. You’re absolutely right. But here’s the clever part: by using the extra numbers made available by that 8th bit, one can create a standard that allows two bytes to represent a character instead of one!

One of the standards for extending ASCII to represent Japanese is Shift JIS. I won’t go into too much detail about how this works, but in short it is a set of rules that allow the first byte to sometimes signal that it is not, by itself, a character, and it must be combined with the next byte to determine what character to draw on the screen. As with all ASCII extensions, the basic ASCII characters are still included, which means that English text can also be represented perfectly fine. Shift JIS doesn’t actually use all of the numbers that are available in two bytes, which ranges from 0 to 65,535. And really, it doesn’t need to. Even just a fraction of that is enough to represent virtually all of written Japanese.

But here’s the problem: you need to inform the computer that the text you’re reading is using Shift JIS, as opposed to standard ASCII or any other of the myriad extensions of ASCII. If you tell the computer to use the wrong one of these standards, it misinterprets the bytes and you see gibberish. The Japanese have a word for this: 文字化け (もじばけ). Examining the kanji, “mojibake” literally means “corruption of characters.” There are programs that can examine bytes and take a really good guess as to how the bytes should be interpreted, but none of them are perfect. This puts a real hamper on multilingual computing, and the problem becomes simply impossible when you want to combine multiple languages in a single piece of text, because each language extends ASCII in a different way, and none of the standards includes any sort of signal as to when to stop using one and start using another.

That’s where Unicode comes in. The goal of Unicode is to create one standard for characters in all of the world’s written languages. There are a few things that Unicode does quite differently from ASCII. The first is that each character or glyph is assigned a code point. These code points may be encoded in different ways using up to 4 bytes. With 4 bytes, you can count from 0 to 4,294,967,295. That’s way more than I could possibly imagine anybody ever needing to represent all of the world’s written characters. The comparison I sometimes hear is that with four bytes, you could give a unique combination to every grain of sand on Earth and probably still have some left over.

But since English text only needs 1 of those 4 bytes, it’s very wasteful to represent every character using a full 4 bytes. So Unicode defines different encoding standards to instruct the computer as to how the code points are represented in bytes. UTF-8 is the most common of these. UTF-8 is most convenient for representing English text and is easily the most common Unicode encoding in the wild, since it actually walks and talks like ASCII if you don’t need characters outside of the basic ASCII set (very convenient for programmers who are used to ASCII and haven’t learned Unicode). But all of the characters in the Basic Multilingual Plane (where most of the characters in Unicode are defined so far) can be represented in between 1 and 3 bytes using UTF-8, so the encoding is definitely not lacking multilingual support. The details of how UTF-8 represents all of these code points is complicated and I don’t fully understand it myself, but if you’re curious there are many books and web pages on the subject.

But not everybody is happy with Unicode. One of the most controversial parts of the standard is the Han Unification. This is the Unicode standard’s attempt to represent multiple languages that use the Chinese characters into one set of code points. The intention of this unification is to help all of the characters in Chinese, Japanese, and Korean fit happily into the Basic Multilingual Plane. It is an amusing coincidence that the Japanese are one of the biggest opponents of the Han Unification. Some Japanese scholars are of the opinion that historically and culturally significant variants of kanji are culled out by trying to combine all variants into a single code point. It also causes oddities like the Chinese variants of kanji characters appearing in Japanese text because the Unicode font being used to render the text has Chinese variants instead of Japanese.

So I hope this taught you everything you ever wanted to know and more about how Japanese text is represented in a computer.

If you have more questions, you can get in touch with Enrico by e-mail (enrico at or you can leave a comment on Facebook and Google+.

Post a Comment