As many of you know, languages such as Japanese, Chinese & Korean use more than one byte (8-bits) to represent single characters. Because there are so many of them, 8-bits simply does not provide a large enough code space to map them all.
Here is a short treatise on how to deal with these issues!
The Simple Rules of Moji-Ba-Ke
Garbage In = Garbage Out ... if the data you send to the browser becomes mangled (moji-bake) your user will see trash characters. If you specify the wrong character set in your META headers, your browser will render the page incorrectly, again causing moji-ba-ke, sometimes in random places on the page. This can also affect your AJAX by destroying the integrity of DOM elements.
When handling CJK character sets, you must be sure to use UTF8 character encoding throughout the life cyle of your program. This includes database storage & retrieval, string manipulation in your code and displaying in the browser.
So What is UTF8?
UTF8 handles binary streams of data, not strings. As you know characters are represented by combinations of 0s and 1s. ASCII characters have a maximum of 8 bits (a combination of 8 zeroes and/or ones). However UTF8 characters can be composed of 6bits, 8bits, 12bits, etc… As such, UTF8 is prone to what Japanese call “moji-ba-ke” …. roughly pronounced “moe gee bah keh”.
As a programmer, from database to codebase to browser, you should try and use UTF8 completely. For email you can use UTF8, but you will probably find most mail servers and clients are still old and use a mishmash of different character sets (e.g. ISO9022X for Japanese).
Database Settings
If you are a mysql user, then make sure you have to ensure all connections to the DB use UTF8, and that all tables/fields use UTF8. By default mysql uses Latin (Swedish) character sets. Those kooky swedes love their jokes!!
Checking your Codebase
In my experience editors like Notepad++, Notepad2, UltraEdit, e, TextMate, etc… all have UTF8 support problems. They mostly work, but since their developers don’t use CJK languages themselves, they are not perfected. Issues like turning off BOM (Byte Order Mark), mangled tabs, poor character set conversion, etc … all present problems.
I highly recommend using a proven UTF8 editor like Maruo. This is made by a Japanese company, but there is an English version (and a trial version) at http://hide.maruo.co.jp/software/bin/maruo614_signed.exe
Lastly, you may need to convert your source files into UTF8. Especially if the codebase itself has CJK language strings contained therein.
Manipulating Strings
Any string function need to multibyte safe. Notice I didn’t say double-byte. UTF8 is not a double byte but multibyte, depending on the total number of bits used to represent a character. In PHP you need to call the MB string functions specifically. Ruby and other languages have more transparent support, but you need to check the docs for your flavour of application server!
META Tags
Check out google.co.jp or yahoo.co.jp for their META headers. These are sites that know how to to it properly. Basically include the following META tag the doucment <HEAD>
<meta http-equiv=”content-type” content=”text/html; charset=utf-8″>
It is usually safe to mix English HTML document type attributes with the above character too. So adding the META tag above seems to work in a HTML document that has:
<html xmlns=”http://www.w3.org/1999/xhtml” xml:lang=”en” lang=”en”>
Email
This is a wholly different can of worms. UTF8 works a lot, but many older Japanese clients use ISO2022X more. This is not worth covering here (well, not for me!)
Debugging UTF8 Issues
Once you have a reliable UTF8 editor like Maruo, you can create static pages and resolve your issues.