crunchytoast.com

New shoes for crunchytoast!

crunchyt sez:

This is my first website ... after 15 years of making them for everyone else! Hope you enjoy it too.

About 4 months ago I was wondering how hard it would be to get Sphinx to work in Japanese. This fantastic freetext search engine by Andrew Aksyonoff has literally changed my approach to web development.

At the time there were instructions for Chinese, but no Japanese unicode character map. Basically, Sphinx needs a “guide” to read UTF8 data, called a character map. It tells Sphinx which unicode charcters to index, and which to discard.

Here is the character map for Japanese:

U+FF10..U+FF19->0..9, U+FF21..U+FF3A->a..z,
U+FF41..U+FF5A->a..z, 0..9, A..Z->a..z, a..z, U+4E00..U+9FCF, U+3400..U+4DBF,
U+20000..U+2A6DF, U+3040..U+309F, U+30A0..U+30FF, U+3000..U+303F, U+3042->U+3041,
U+3044->U+3043, U+3046->U+3045, U+3048->U+3047, U+304A->U+3049,
U+304C->U+304B, U+304E->U+304D, U+3050->U+304F, U+3052->U+3051,
U+3054->U+3053, U+3056->U+3055, U+3058->U+3057, U+305A->U+3059,
U+305C->U+305B, U+305E->U+305D, U+3060->U+305F, U+3062->U+3061,
U+3064->U+3063, U+3065->U+3063, U+3067->U+3066, U+3069->U+3068,
U+3070->U+306F, U+3071->U+306F, U+3073->U+3072, U+3074->U+3072,
U+3076->U+3075, U+3077->U+3075, U+3079->U+3078, U+307A->U+3078,
U+307C->U+307B, U+307D->U+307B, U+3084->U+3083, U+3086->U+3085,
U+3088->U+3087, U+308F->U+308E, U+3094->U+3046, U+3095->U+304B,
U+3096->U+3051, U+30A2->U+30A1, U+30A4->U+30A3, U+30A6->U+30A5,
U+30A8->U+30A7, U+30AA->U+30A9, U+30AC->U+30AB, U+30AE->U+30AD,
U+30B0->U+30AF, U+30B2->U+30B1, U+30B4->U+30B3, U+30B6->U+30B5,
U+30B8->U+30B7, U+30BA->U+30B9, U+30BC->U+30BB, U+30BE->U+30BD,
U+30C0->U+30BF, U+30C2->U+30C1, U+30C5->U+30C4, U+30C7->U+30C6,
U+30C9->U+30C8, U+30D0->U+30CF, U+30D1->U+30CF, U+30D3->U+30D2,
U+30D4->U+30D2, U+30D6->U+30D5, U+30D7->U+30D5, U+30D9->U+30D8,
U+30DA->U+30D8, U+30DC->U+30DB, U+30DD->U+30DB, U+30E4->U+30E3,
U+30E6->U+30E5, U+30E8->U+30E7, U+30EF->U+30EE, U+30F4->U+30A6,
U+30AB->U+30F5, U+30B1->U+30F6, U+30F7->U+30EF, U+30F8->U+30F0,
U+30F9->U+30F1, U+30FA->U+30F2, U+30AF->U+31F0, U+30B7->U+31F1,
U+30B9->U+31F2, U+30C8->U+31F3, U+30CC->U+31F4, U+30CF->U+31F5,
U+30D2->U+31F6, U+30D5->U+31F7, U+30D8->U+31F8, U+30DB->U+31F9,
U+30E0->U+31FA, U+30E9->U+31FB, U+30EA->U+31FC, U+30EB->U+31FD,
U+30EC->U+31FE, U+30ED->U+31FF, U+FF66->U+30F2, U+FF67->U+30A1,
U+FF68->U+30A3, U+FF69->U+30A5, U+FF6A->U+30A7, U+FF6B->U+30A9,
U+FF6C->U+30E3, U+FF6D->U+30E5, U+FF6E->U+30E7, U+FF6F->U+30C3,
U+FF71->U+30A1, U+FF72->U+30A3, U+FF73->U+30A5, U+FF74->U+30A7,
U+FF75->U+30A9, U+FF76->U+30AB, U+FF77->U+30AD, U+FF78->U+30AF,
U+FF79->U+30B1, U+FF7A->U+30B3, U+FF7B->U+30B5, U+FF7C->U+30B7,
U+FF7D->U+30B9, U+FF7E->U+30BB, U+FF7F->U+30BD, U+FF80->U+30BF,
U+FF81->U+30C1, U+FF82->U+30C3, U+FF83->U+30C6, U+FF84->U+30C8,
U+FF85->U+30CA, U+FF86->U+30CB, U+FF87->U+30CC, U+FF88->U+30CD,
U+FF89->U+30CE, U+FF8A->U+30CF, U+FF8B->U+30D2, U+FF8C->U+30D5,
U+FF8D->U+30D8, U+FF8E->U+30DB, U+FF8F->U+30DE, U+FF90->U+30DF,
U+FF91->U+30E0, U+FF92->U+30E1, U+FF93->U+30E2, U+FF94->U+30E3,
U+FF95->U+30E5, U+FF96->U+30E7, U+FF97->U+30E9, U+FF98->U+30EA,
U+FF99->U+30EB, U+FF9A->U+30EC, U+FF9B->U+30ED, U+FF9C->U+30EF,
U+FF9D->U+30F3

Basically this character map covers the following ranges:

# Unified Kanji    U+4E00..U+9FCF
# Extended Kanji   U+3400..U+4DBF
# Extension B      U+20000..U+2A6DF
# Hiragana         U+3040..U+309F
# Katakana         U+30A0..U+30FF
# Punctuation      U+3000..U+303F

Additionally it automatically converts half-width (hankaku) to full-width (zenkaku) characters, so your users only need to input data in full-width characters to match half-width data!

Also, in order to get meaningful results, use the following settings too

min_word_len = 1
min_infix_len = 1
morphology = none
ngram_len = 0
enable_star = 1

Whilst Ngram support is touted for Asian languages, in practice Sphinx is simply not setup to support languages without word boundaries. So the settings above give you sub-string matching at minimum. It works brilliantly for most common use.

Also, for those of you who like using the rather handy Thinking Sphinx plugin by Pat Allan, here is a setup block for your sphinx.yml file.

mem_limit: 256M
listen: 127.0.0.1:3313
read_timeout: 5
max_children: 300
seamless_rotate: 1
pid_file: /sphinx-0.9.8/db/log/searchd.pid
searchd_file_path: /sphinx-0.9.8/db/sphinx_index_main
searchd_log_file: /sphinx-0.9.8/db/log/searchd.log
query_log_file: /sphinx-0.9.8/db/log/query.log
enable_star: 1
html_strip: 1
max_matches: 10000
min_prefix_len: 0
min_infix_len: 1
min_word_len: 1
morphology: none
ngram_len: 0
sql_ranged_throttle: 0
sql_range_step: 5000
dictionary_name: ap
charset_type: utf-8
charset_table:U+FF10..U+FF19->0..9, U+FF21..U+FF3A->a..z,
U+FF41..U+FF5A->a..z, 0..9, A..Z->a..z, a..z, U+4E00..U+9FCF, U+3400..U+4DBF,
U+20000..U+2A6DF, U+3040..U+309F, U+30A0..U+30FF, U+3000..U+303F, U+3042->U+3041,
U+3044->U+3043, U+3046->U+3045, U+3048->U+3047, U+304A->U+3049,
U+304C->U+304B, U+304E->U+304D, U+3050->U+304F, U+3052->U+3051,
U+3054->U+3053, U+3056->U+3055, U+3058->U+3057, U+305A->U+3059,
U+305C->U+305B, U+305E->U+305D, U+3060->U+305F, U+3062->U+3061,
U+3064->U+3063, U+3065->U+3063, U+3067->U+3066, U+3069->U+3068,
U+3070->U+306F, U+3071->U+306F, U+3073->U+3072, U+3074->U+3072,
U+3076->U+3075, U+3077->U+3075, U+3079->U+3078, U+307A->U+3078,
U+307C->U+307B, U+307D->U+307B, U+3084->U+3083, U+3086->U+3085,
U+3088->U+3087, U+308F->U+308E, U+3094->U+3046, U+3095->U+304B,
U+3096->U+3051, U+30A2->U+30A1, U+30A4->U+30A3, U+30A6->U+30A5,
U+30A8->U+30A7, U+30AA->U+30A9, U+30AC->U+30AB, U+30AE->U+30AD,
U+30B0->U+30AF, U+30B2->U+30B1, U+30B4->U+30B3, U+30B6->U+30B5,
U+30B8->U+30B7, U+30BA->U+30B9, U+30BC->U+30BB, U+30BE->U+30BD,
U+30C0->U+30BF, U+30C2->U+30C1, U+30C5->U+30C4, U+30C7->U+30C6,
U+30C9->U+30C8, U+30D0->U+30CF, U+30D1->U+30CF, U+30D3->U+30D2,
U+30D4->U+30D2, U+30D6->U+30D5, U+30D7->U+30D5, U+30D9->U+30D8,
U+30DA->U+30D8, U+30DC->U+30DB, U+30DD->U+30DB, U+30E4->U+30E3,
U+30E6->U+30E5, U+30E8->U+30E7, U+30EF->U+30EE, U+30F4->U+30A6,
U+30AB->U+30F5, U+30B1->U+30F6, U+30F7->U+30EF, U+30F8->U+30F0,
U+30F9->U+30F1, U+30FA->U+30F2, U+30AF->U+31F0, U+30B7->U+31F1,
U+30B9->U+31F2, U+30C8->U+31F3, U+30CC->U+31F4, U+30CF->U+31F5,
U+30D2->U+31F6, U+30D5->U+31F7, U+30D8->U+31F8, U+30DB->U+31F9,
U+30E0->U+31FA, U+30E9->U+31FB, U+30EA->U+31FC, U+30EB->U+31FD,
U+30EC->U+31FE, U+30ED->U+31FF, U+FF66->U+30F2, U+FF67->U+30A1,
U+FF68->U+30A3, U+FF69->U+30A5, U+FF6A->U+30A7, U+FF6B->U+30A9,
U+FF6C->U+30E3, U+FF6D->U+30E5, U+FF6E->U+30E7, U+FF6F->U+30C3,
U+FF71->U+30A1, U+FF72->U+30A3, U+FF73->U+30A5, U+FF74->U+30A7,
U+FF75->U+30A9, U+FF76->U+30AB, U+FF77->U+30AD, U+FF78->U+30AF,
U+FF79->U+30B1, U+FF7A->U+30B3, U+FF7B->U+30B5, U+FF7C->U+30B7,
U+FF7D->U+30B9, U+FF7E->U+30BB, U+FF7F->U+30BD, U+FF80->U+30BF,
U+FF81->U+30C1, U+FF82->U+30C3, U+FF83->U+30C6, U+FF84->U+30C8,
U+FF85->U+30CA, U+FF86->U+30CB, U+FF87->U+30CC, U+FF88->U+30CD,
U+FF89->U+30CE, U+FF8A->U+30CF, U+FF8B->U+30D2, U+FF8C->U+30D5,
U+FF8D->U+30D8, U+FF8E->U+30DB, U+FF8F->U+30DE, U+FF90->U+30DF,
U+FF91->U+30E0, U+FF92->U+30E1, U+FF93->U+30E2, U+FF94->U+30E3,
U+FF95->U+30E5, U+FF96->U+30E7, U+FF97->U+30E9, U+FF98->U+30EA,
U+FF99->U+30EB, U+FF9A->U+30EC, U+FF9B->U+30ED, U+FF9C->U+30EF,
U+FF9D->U+30F3
mlock: 0

7 Responses to “Sphinx Search in Japanese”

  1. Thanks for your post, I am currently testing sphinx to use in a large online shopping website in Japanese. I followed your character map guide added it to my sphinx.conf, reindexed. All queries of kanji based words are fine but katakana words are just returning all of the results in the DB? Did you find this? I am using 9.8.1

    Richard

  2. @richard - I’ll look into this and email you separately. Please note, these mappings treat equivalent Katakana and Hiragana characters as the same character. This means searching for かな will also match カナ and カナ. So you basically get a phonetic search for any type of kana characters.

    admin

  3. @richard - please check out my latest post about the Sphinx Japanese Character Table. It explains what each section does, and it confirms that hiragana does not match to katakana.

    Also I have separately confirmed that straight katakana searching does work! Can you privately send me a sample of your test data, and your setup file. I’ll see what I can do.

    admin

  4. Thanks for this!

    Just a quick question - you’re including
    # Katakana U+30A0..U+30FF

    I don’t know much about the Japanese language, but should U+30FB..U+30FF be included? (I’m asking because U+30FB is fullwidth katakana middle dot, and I thought this was used as a word separator)?

    Dave

  5. Hey there, great to see someone documenting how they use Sphinx and Thinking Sphinx with Japanese characters. One thing to note, though: my name’s Pat Allan, not Paul Smith :)

    pat

  6. @Dave - Thanks for bringing this to my attention. I’ve included the middle dot because I wanted it to be searchable. The middle dot is often used for separating foreign words to match their source language equivalent word separation (esp. for foreign names).

    More importantly, the boubiki (long vowel sound marker) character U+30FC is also in this range. See http://www.unicode.org/charts/PDF/Unicode-3.2/U32-30A0.pdf

    The middle dot may not be everyone’s cup of tea, but the boubiki (pronounced like “bore-bikkie”) is definitely needed.

    I may change my mind about the middle separator dot, because it will probably hinder more searches than it helps, but the “boubiki” is essential.

    admin

  7. @Pat - Sorry about that! It’s all fixed. It was just one of those things.
    Would it be any better if I said I’ve been going around talking about Pat Allan the clothing designer??!

    admin

Leave a Reply