I recently received some comments regarding the Sphinx Search Japanese character table. This prompted me to document it a little better. To be honest, I originally put the character table together during a late night coding binge in 2008. As such my memory was a little fuzzy.
After visiting the Oracle of Unicode (now resident at unicode.org), I was able to clear up somethings for myself. Hopefully it helps you too, where-ever you are!
Sphinx Character Mapping
Sphinx allows you to specify “allowed” characters as well as match one type of character to another. In Japanese this is useful for matching small characters, or voiced characters, to their unadorned cousins.
Example 1: か (unadorned, main form 'KA') が (voiced character, diacritically marked)
Example 2: あ (unadorned, main form 'A') ぁ (small form of 'A')
Here is an example of a mapping U+3042->U+3041. This declares the left-hand character as “allowed”, and also matches it to the right-hand character when searching. However, this alone does not declare the right-hand character to as “allowed”.
In the Sphinx Japanese Character Table, we transform one way only. So the simpler, main form matches to the rarer forms (e.g. か==が, or あ==ぁ), but not the other way round (e.g. が!=か, and ぁ!=あ).
Here is the rest of the file explained.
### Standard ASCII Inclusions ### 0..9, A..Z->a..z, a..z ### Include/Transform full-width (zenkaku) to half-width (hankaku) roman forms ### # See http://unicode.org/charts/PDF/UFF00.pdf U+FF10..U+FF19->0..9, U+FF21..U+FF3A->a..z, U+FF41..U+FF5A->a..z, ### Japanese Character Inclusions ### # Unified Kanji U+4E00..U+9FCF # Extended Kanji U+3400..U+4DBF # Extension B U+20000..U+2A6DF # Hiragana U+3040..U+309F # Katakana U+30A0..U+30FF # Punctuation U+3000..U+303F U+4E00..U+9FCF, U+3400..U+4DBF, U+20000..U+2A6DF, U+3040..U+309F, U+30A0..U+30FF, U+3000..U+303F, ### Character Transformations ### # Allow large-hiragana to be matched against small-hiragana equivalents # Allow non-voiced-hiragana to be matched against voiced-hiragana equivalents # See http://unicode.org/charts/PDF/U3040.pdf # Map hiragana small/voiced characters to main form U+3042->U+3041, U+3044->U+3043, U+3046->U+3045, U+3048->U+3047, U+304A->U+3049, U+304C->U+304B, U+304E->U+304D, U+3050->U+304F, U+3052->U+3051, U+3054->U+3053, U+3056->U+3055, U+3058->U+3057, U+305A->U+3059, U+305C->U+305B, U+305E->U+305D, U+3060->U+305F, U+3062->U+3061, U+3064->U+3063, U+3065->U+3063, U+3067->U+3066, U+3069->U+3068, U+3070->U+306F, U+3071->U+306F, U+3073->U+3072, U+3074->U+3072, U+3076->U+3075, U+3077->U+3075, U+3079->U+3078, U+307A->U+3078, U+307C->U+307B, U+307D->U+307B, U+3084->U+3083, U+3086->U+3085, U+3088->U+3087, U+308F->U+308E, U+3094->U+3046, U+3095->U+304B, U+3096->U+3051, # Map katakana small/voiced characters to main form U+30A2->U+30A1, U+30A4->U+30A3, U+30A6->U+30A5, U+30A8->U+30A7, U+30AA->U+30A9, U+30AC->U+30AB, U+30AE->U+30AD, U+30B0->U+30AF, U+30B2->U+30B1, U+30B4->U+30B3, U+30B6->U+30B5, U+30B8->U+30B7, U+30BA->U+30B9, U+30BC->U+30BB, U+30BE->U+30BD, U+30C0->U+30BF, U+30C2->U+30C1, U+30C5->U+30C4, U+30C7->U+30C6, U+30C9->U+30C8, U+30D0->U+30CF, U+30D1->U+30CF, U+30D3->U+30D2, U+30D4->U+30D2, U+30D6->U+30D5, U+30D7->U+30D5, U+30D9->U+30D8, U+30DA->U+30D8, U+30DC->U+30DB, U+30DD->U+30DB, U+30E4->U+30E3, U+30E6->U+30E5, U+30E8->U+30E7, U+30EF->U+30EE, U+30F4->U+30A6, U+30AB->U+30F5, U+30B1->U+30F6, U+30F7->U+30EF, U+30F8->U+30F0, U+30F9->U+30F1, U+30FA->U+30F2, # Map “katakana phonetic extension” characters to main forms # NB: Out of sequence b/c they were added late to the Unicode specification # See http://unicode.org/charts/PDF/U31F0.pdf U+30AF->U+31F0, U+30B7->U+31F1, U+30B9->U+31F2, U+30C8->U+31F3, U+30CC->U+31F4, U+30CF->U+31F5, U+30D2->U+31F6, U+30D5->U+31F7, U+30D8->U+31F8, U+30DB->U+31F9, U+30E0->U+31FA, U+30E9->U+31FB, U+30EA->U+31FC, U+30EB->U+31FD, U+30EC->U+31FE, U+30ED->U+31FF, # Map half-width katakana to full-width katakana U+FF66->U+30F2, U+FF67->U+30A1, U+FF68->U+30A3, U+FF69->U+30A5, U+FF6A->U+30A7, U+FF6B->U+30A9, U+FF6C->U+30E3, U+FF6D->U+30E5, U+FF6E->U+30E7, U+FF6F->U+30C3, U+FF71->U+30A1, U+FF72->U+30A3, U+FF73->U+30A5, U+FF74->U+30A7, U+FF75->U+30A9, U+FF76->U+30AB, U+FF77->U+30AD, U+FF78->U+30AF, U+FF79->U+30B1, U+FF7A->U+30B3, U+FF7B->U+30B5, U+FF7C->U+30B7, U+FF7D->U+30B9, U+FF7E->U+30BB, U+FF7F->U+30BD, U+FF80->U+30BF, U+FF81->U+30C1, U+FF82->U+30C3, U+FF83->U+30C6, U+FF84->U+30C8, U+FF85->U+30CA, U+FF86->U+30CB, U+FF87->U+30CC, U+FF88->U+30CD, U+FF89->U+30CE, U+FF8A->U+30CF, U+FF8B->U+30D2, U+FF8C->U+30D5, U+FF8D->U+30D8, U+FF8E->U+30DB, U+FF8F->U+30DE, U+FF90->U+30DF, U+FF91->U+30E0, U+FF92->U+30E1, U+FF93->U+30E2, U+FF94->U+30E3, U+FF95->U+30E5, U+FF96->U+30E7, U+FF97->U+30E9, U+FF98->U+30EA, U+FF99->U+30EB, U+FF9A->U+30EC, U+FF9B->U+30ED, U+FF9C->U+30EF, U+FF9D->U+30F3
Lastly, contrary to my previous comment (which is now amended). This file does not map equivalent hiragana to katakana (yet). I will get around to this … eventually.
From what I can tell, this website is the only source for a working Sphinx Japanese character table. If anyone actually reads this and makes a map for katakana->hiragana and hiragana->katakana, please send it through. Remember to check it thoroughly!
Lastly if you’re using this, please drop me a comment
your silence is deafening
admin
January 28th, 2010
I don’t have much background in any of this but could you explain to me how this is different from the CJK charsets given on the sphinx wiki?
abcdef123
February 26th, 2010
Oh and I do appreciate the work you’ve put into this, I think it is exactly what I was looking for but I am still curious about the differences.
abcdef123
February 26th, 2010
@abcdef123: The Sphinx wiki does have the entire unicode mapping for CJK unicode characters. This means it will map and index a very large tract of Chinese, Korean and Japanese characters.
See here: http://www.sphinxsearch.com/wiki/doku.php?id=charset_tables#cjk_ngram_characters
The differences are as follows:
1. Half-width & full-width kana characters are mapped as the same character, reducing confusion when users search using full-width text against a document with half-width text. For the uninitiated, half-width katakana characters are an essential part of the Japanese banking system, and other legacy systems, but read exactly the same way as their full-width cousins. Read this for a more colorful explanation http://www.ops.ietf.org/lists/idn/idn.2001/msg02452.html.
2. Glottal (eg. ka > ga) and aspirated (eg. ha > pa/ba) phonemes are treated as the same as non-glottal/aspirated. This is important because people often treat these differently in speech, and therefore tend to get them wrong when searching. I believe it’s better to match more widely, and then let them choose. Admittedly this could be omitted, and you can expand the search keywords programatically before sending it to Sphinx. The UTF range is clearly marked for your/my convenience further down the track!
3. This character map only includes the J from CJK. So it should index source documents faster, but of course will exclude any Korean or Chinese characters. It also excludes the most rare Japanese characters because these tend to be represented as kana these days (as a part of the general dumbing down of the Japanese writing system … thank God!)
This map was what “I” really needed for a project and at the time I could only find a documented map for Chinese text. This map will of course match all ASCII characters, which are an essential part of writing Japanese anyway (believe it or not!) There is more information at http://www.unicode.org if you are interested in digging deeper.
Thanks also for your words of encouragement.
DISCLAIMER: Apologies for any inaccuracies in this comment, I typed this off the top of my head.
admin
February 26th, 2010
Hi, and thank you for the in depth description, it is very much appreciated. One more question though if you don’t mind:
What about the ngram characters? Are they not required for the jp charset? On http://www.sphinxsearch.com/wiki/doku.php?id=charset_tables they advise the use these for the CJK charsets.
abcdef123
February 26th, 2010
@abcdef123: Ngram characters tend to behave strangely! I prefer to use infix instead. Allow me to explain.
THe Ngram feature treats each character as a white-space bounded word. But since there are not morphological analyzers that talk to Sphinx natively, you end up getting lots of characters that do not really match to any sensible string of Japanese vocablary.
The equivalent is searching for “dog” but Sphinx can only find “d” or “o” or “g”. This may have changed since early 2009, but I haven’t retested the Ngram feature.
On the other hand, the *infix* searching (basically a substring search) works very fast and matches pretty much anything as a string of characters.
Please let me know if you have a different experience? B)
admin
February 27th, 2010
Hi, thanks again. From my understanding an infix search would require the search string (if it was JP characters) to be surrounded by asterisks for this method to work correctly?
I guess is depends on the format you want your users to use when searching, to use asterisks for JP searches (infix method) or to use quotes around multiple characters that they want to be searched as a single word (for the ngram approach).
Thanks again for all your help.
abcdef123
February 27th, 2010