About 4 months ago I was wondering how hard it would be to get Sphinx to work in Japanese. This fantastic freetext search engine by Andrew Aksyonoff has literally changed my approach to web development.
At the time there were instructions for Chinese, but no Japanese unicode character map. Basically, Sphinx needs a “guide” to read UTF8 data, called a character map. It tells Sphinx which unicode charcters to index, and which to discard.
Here is the character map for Japanese:
[ruby]U+FF10..U+FF19->0..9, U+FF21..U+FF3A->a..z,
U+FF41..U+FF5A->a..z, 0..9, A..Z->a..z, a..z, U+4E00..U+9FCF, U+3400..U+4DBF,
U+20000..U+2A6DF, U+3040..U+309F, U+30A0..U+30FF, U+3000..U+303F, U+3042->U+3041,
U+3044->U+3043, U+3046->U+3045, U+3048->U+3047, U+304A->U+3049,
U+304C->U+304B, U+304E->U+304D, U+3050->U+304F, U+3052->U+3051,
U+3054->U+3053, U+3056->U+3055, U+3058->U+3057, U+305A->U+3059,
U+305C->U+305B, U+305E->U+305D, U+3060->U+305F, U+3062->U+3061,
U+3064->U+3063, U+3065->U+3063, U+3067->U+3066, U+3069->U+3068,
U+3070->U+306F, U+3071->U+306F, U+3073->U+3072, U+3074->U+3072,
U+3076->U+3075, U+3077->U+3075, U+3079->U+3078, U+307A->U+3078,
U+307C->U+307B, U+307D->U+307B, U+3084->U+3083, U+3086->U+3085,
U+3088->U+3087, U+308F->U+308E, U+3094->U+3046, U+3095->U+304B,
U+3096->U+3051, U+30A2->U+30A1, U+30A4->U+30A3, U+30A6->U+30A5,
U+30A8->U+30A7, U+30AA->U+30A9, U+30AC->U+30AB, U+30AE->U+30AD,
U+30B0->U+30AF, U+30B2->U+30B1, U+30B4->U+30B3, U+30B6->U+30B5,
U+30B8->U+30B7, U+30BA->U+30B9, U+30BC->U+30BB, U+30BE->U+30BD,
U+30C0->U+30BF, U+30C2->U+30C1, U+30C5->U+30C4, U+30C7->U+30C6,
U+30C9->U+30C8, U+30D0->U+30CF, U+30D1->U+30CF, U+30D3->U+30D2,
U+30D4->U+30D2, U+30D6->U+30D5, U+30D7->U+30D5, U+30D9->U+30D8,
U+30DA->U+30D8, U+30DC->U+30DB, U+30DD->U+30DB, U+30E4->U+30E3,
U+30E6->U+30E5, U+30E8->U+30E7, U+30EF->U+30EE, U+30F4->U+30A6,
U+30AB->U+30F5, U+30B1->U+30F6, U+30F7->U+30EF, U+30F8->U+30F0,
U+30F9->U+30F1, U+30FA->U+30F2, U+30AF->U+31F0, U+30B7->U+31F1,
U+30B9->U+31F2, U+30C8->U+31F3, U+30CC->U+31F4, U+30CF->U+31F5,
U+30D2->U+31F6, U+30D5->U+31F7, U+30D8->U+31F8, U+30DB->U+31F9,
U+30E0->U+31FA, U+30E9->U+31FB, U+30EA->U+31FC, U+30EB->U+31FD,
U+30EC->U+31FE, U+30ED->U+31FF, U+FF66->U+30F2, U+FF67->U+30A1,
U+FF68->U+30A3, U+FF69->U+30A5, U+FF6A->U+30A7, U+FF6B->U+30A9,
U+FF6C->U+30E3, U+FF6D->U+30E5, U+FF6E->U+30E7, U+FF6F->U+30C3,
U+FF71->U+30A1, U+FF72->U+30A3, U+FF73->U+30A5, U+FF74->U+30A7,
U+FF75->U+30A9, U+FF76->U+30AB, U+FF77->U+30AD, U+FF78->U+30AF,
U+FF79->U+30B1, U+FF7A->U+30B3, U+FF7B->U+30B5, U+FF7C->U+30B7,
U+FF7D->U+30B9, U+FF7E->U+30BB, U+FF7F->U+30BD, U+FF80->U+30BF,
U+FF81->U+30C1, U+FF82->U+30C3, U+FF83->U+30C6, U+FF84->U+30C8,
U+FF85->U+30CA, U+FF86->U+30CB, U+FF87->U+30CC, U+FF88->U+30CD,
U+FF89->U+30CE, U+FF8A->U+30CF, U+FF8B->U+30D2, U+FF8C->U+30D5,
U+FF8D->U+30D8, U+FF8E->U+30DB, U+FF8F->U+30DE, U+FF90->U+30DF,
U+FF91->U+30E0, U+FF92->U+30E1, U+FF93->U+30E2, U+FF94->U+30E3,
U+FF95->U+30E5, U+FF96->U+30E7, U+FF97->U+30E9, U+FF98->U+30EA,
U+FF99->U+30EB, U+FF9A->U+30EC, U+FF9B->U+30ED, U+FF9C->U+30EF,
U+FF9D->U+30F3[/ruby]
Basically this character map covers the following ranges:
# Unified Kanji U+4E00..U+9FCF # Extended Kanji U+3400..U+4DBF # Extension B U+20000..U+2A6DF # Hiragana U+3040..U+309F # Katakana U+30A0..U+30FF # Punctuation U+3000..U+303F
Additionally it automatically converts half-width (hankaku) to full-width (zenkaku) characters, so your users only need to input data in full-width characters to match half-width data!
Also, in order to get meaningful results, use the following settings too
min_word_len = 1 min_infix_len = 1 morphology = none ngram_len = 0 enable_star = 1
Whilst Ngram support is touted for Asian languages, in practice Sphinx is simply not setup to support languages without word boundaries. So the settings above give you sub-string matching at minimum. It works brilliantly for most common use.
Also, for those of you who like using the rather handy Thinking Sphinx plugin by Pat Allan, here is a setup block for your sphinx.yml file.
[ruby]mem_limit: 256M
listen: 127.0.0.1:3313
read_timeout: 5
max_children: 300
seamless_rotate: 1
pid_file: /sphinx-0.9.8/db/log/searchd.pid
searchd_file_path: /sphinx-0.9.8/db/sphinx_index_main
searchd_log_file: /sphinx-0.9.8/db/log/searchd.log
query_log_file: /sphinx-0.9.8/db/log/query.log
enable_star: 1
html_strip: 1
max_matches: 10000
min_prefix_len: 0
min_infix_len: 1
min_word_len: 1
morphology: none
ngram_len: 0
sql_ranged_throttle: 0
sql_range_step: 5000
dictionary_name: ap
charset_type: utf-8
charset_table:U+FF10..U+FF19->0..9, U+FF21..U+FF3A->a..z,
U+FF41..U+FF5A->a..z, 0..9, A..Z->a..z, a..z, U+4E00..U+9FCF, U+3400..U+4DBF,
U+20000..U+2A6DF, U+3040..U+309F, U+30A0..U+30FF, U+3000..U+303F, U+3042->U+3041,
U+3044->U+3043, U+3046->U+3045, U+3048->U+3047, U+304A->U+3049,
U+304C->U+304B, U+304E->U+304D, U+3050->U+304F, U+3052->U+3051,
U+3054->U+3053, U+3056->U+3055, U+3058->U+3057, U+305A->U+3059,
U+305C->U+305B, U+305E->U+305D, U+3060->U+305F, U+3062->U+3061,
U+3064->U+3063, U+3065->U+3063, U+3067->U+3066, U+3069->U+3068,
U+3070->U+306F, U+3071->U+306F, U+3073->U+3072, U+3074->U+3072,
U+3076->U+3075, U+3077->U+3075, U+3079->U+3078, U+307A->U+3078,
U+307C->U+307B, U+307D->U+307B, U+3084->U+3083, U+3086->U+3085,
U+3088->U+3087, U+308F->U+308E, U+3094->U+3046, U+3095->U+304B,
U+3096->U+3051, U+30A2->U+30A1, U+30A4->U+30A3, U+30A6->U+30A5,
U+30A8->U+30A7, U+30AA->U+30A9, U+30AC->U+30AB, U+30AE->U+30AD,
U+30B0->U+30AF, U+30B2->U+30B1, U+30B4->U+30B3, U+30B6->U+30B5,
U+30B8->U+30B7, U+30BA->U+30B9, U+30BC->U+30BB, U+30BE->U+30BD,
U+30C0->U+30BF, U+30C2->U+30C1, U+30C5->U+30C4, U+30C7->U+30C6,
U+30C9->U+30C8, U+30D0->U+30CF, U+30D1->U+30CF, U+30D3->U+30D2,
U+30D4->U+30D2, U+30D6->U+30D5, U+30D7->U+30D5, U+30D9->U+30D8,
U+30DA->U+30D8, U+30DC->U+30DB, U+30DD->U+30DB, U+30E4->U+30E3,
U+30E6->U+30E5, U+30E8->U+30E7, U+30EF->U+30EE, U+30F4->U+30A6,
U+30AB->U+30F5, U+30B1->U+30F6, U+30F7->U+30EF, U+30F8->U+30F0,
U+30F9->U+30F1, U+30FA->U+30F2, U+30AF->U+31F0, U+30B7->U+31F1,
U+30B9->U+31F2, U+30C8->U+31F3, U+30CC->U+31F4, U+30CF->U+31F5,
U+30D2->U+31F6, U+30D5->U+31F7, U+30D8->U+31F8, U+30DB->U+31F9,
U+30E0->U+31FA, U+30E9->U+31FB, U+30EA->U+31FC, U+30EB->U+31FD,
U+30EC->U+31FE, U+30ED->U+31FF, U+FF66->U+30F2, U+FF67->U+30A1,
U+FF68->U+30A3, U+FF69->U+30A5, U+FF6A->U+30A7, U+FF6B->U+30A9,
U+FF6C->U+30E3, U+FF6D->U+30E5, U+FF6E->U+30E7, U+FF6F->U+30C3,
U+FF71->U+30A1, U+FF72->U+30A3, U+FF73->U+30A5, U+FF74->U+30A7,
U+FF75->U+30A9, U+FF76->U+30AB, U+FF77->U+30AD, U+FF78->U+30AF,
U+FF79->U+30B1, U+FF7A->U+30B3, U+FF7B->U+30B5, U+FF7C->U+30B7,
U+FF7D->U+30B9, U+FF7E->U+30BB, U+FF7F->U+30BD, U+FF80->U+30BF,
U+FF81->U+30C1, U+FF82->U+30C3, U+FF83->U+30C6, U+FF84->U+30C8,
U+FF85->U+30CA, U+FF86->U+30CB, U+FF87->U+30CC, U+FF88->U+30CD,
U+FF89->U+30CE, U+FF8A->U+30CF, U+FF8B->U+30D2, U+FF8C->U+30D5,
U+FF8D->U+30D8, U+FF8E->U+30DB, U+FF8F->U+30DE, U+FF90->U+30DF,
U+FF91->U+30E0, U+FF92->U+30E1, U+FF93->U+30E2, U+FF94->U+30E3,
U+FF95->U+30E5, U+FF96->U+30E7, U+FF97->U+30E9, U+FF98->U+30EA,
U+FF99->U+30EB, U+FF9A->U+30EC, U+FF9B->U+30ED, U+FF9C->U+30EF,
U+FF9D->U+30F3
mlock: 0[/ruby]
Thanks for your post, I am currently testing sphinx to use in a large online shopping website in Japanese. I followed your character map guide added it to my sphinx.conf, reindexed. All queries of kanji based words are fine but katakana words are just returning all of the results in the DB? Did you find this? I am using 9.8.1
Richard
May 1st, 2009
@richard – I’ll look into this and email you separately. Please note, these mappings treat equivalent Katakana and Hiragana characters as the same character. This means searching for かな will also match カナ and カナ. So you basically get a phonetic search for any type of kana characters.
admin
May 1st, 2009
@richard – please check out my latest post about the Sphinx Japanese Character Table. It explains what each section does, and it confirms that hiragana does not match to katakana.
Also I have separately confirmed that straight katakana searching does work! Can you privately send me a sample of your test data, and your setup file. I’ll see what I can do.
admin
May 1st, 2009
Thanks for this!
Just a quick question – you’re including
# Katakana U+30A0..U+30FF
I don’t know much about the Japanese language, but should U+30FB..U+30FF be included? (I’m asking because U+30FB is fullwidth katakana middle dot, and I thought this was used as a word separator)?
Dave
September 21st, 2009
Hey there, great to see someone documenting how they use Sphinx and Thinking Sphinx with Japanese characters. One thing to note, though: my name’s Pat Allan, not Paul Smith
pat
September 30th, 2009
@Dave – Thanks for bringing this to my attention. I’ve included the middle dot because I wanted it to be searchable. The middle dot is often used for separating foreign words to match their source language equivalent word separation (esp. for foreign names).
More importantly, the boubiki (long vowel sound marker) character U+30FC is also in this range. See http://www.unicode.org/charts/PDF/Unicode-3.2/U32-30A0.pdf
The middle dot may not be everyone’s cup of tea, but the boubiki (pronounced like “bore-bikkie”) is definitely needed.
I may change my mind about the middle separator dot, because it will probably hinder more searches than it helps, but the “boubiki” is essential.
admin
September 30th, 2009
@Pat – Sorry about that! It’s all fixed. It was just one of those things.
Would it be any better if I said I’ve been going around talking about Pat Allan the clothing designer??!
admin
October 2nd, 2009
[...] http://crunchytoast.com/2009/04/14/sphinx-search-in-japanese/ [...]
Gitorious入れたメモ « **deadwinter**
January 6th, 2010
[...] で、現時点では日本語に対応しているかなんか微妙な感じっぽくて、いくつか日本語化のための指南がありました。 http://blog.shibu.jp/article/32831225.html http://crunchytoast.com/2009/04/14/sphinx-search-in-japanese/ [...]
中国語での全文検索について調べてみた。 « fmob中の人ブログ
February 23rd, 2011