crunchytoast.com

New shoes for crunchytoast!

crunchyt sez:

This is my first website ... after 15 years of making them for everyone else! Hope you enjoy it too.

I recently received some comments regarding the Sphinx Search Japanese character table. This prompted me to document it a little better. To be honest, I originally put the character table together during a late night coding binge in 2008. As such my memory was a little fuzzy.

After visiting the Oracle of Unicode (now resident at unicode.org), I was able to clear up somethings for myself. Hopefully it helps you too, where-ever you are!

Sphinx Character Mapping
Sphinx allows you to specify “allowed” characters as well as match one type of character to another. In Japanese this is useful for matching small characters, or voiced characters, to their unadorned cousins.

Example 1:
か (unadorned, main form 'KA')
が (voiced character, diacritically marked)
Example 2:
あ (unadorned, main form 'A')
ぁ (small form of  'A')

Here is an example of a mapping U+3042->U+3041. This declares the left-hand character as “allowed”, and also matches it to the right-hand character when searching. However, this alone does not declare the right-hand character to as “allowed”.

In the Sphinx Japanese Character Table, we transform one way only. So the simpler, main form matches to the rarer forms (e.g. か==が, or あ==ぁ), but not the other way round (e.g. が!=か, and ぁ!=あ).

Here is the rest of the file explained.

### Standard ASCII Inclusions ###
0..9, A..Z->a..z, a..z

### Include/Transform full-width (zenkaku) to half-width (hankaku) roman forms ###
# See http://unicode.org/charts/PDF/UFF00.pdf
U+FF10..U+FF19->0..9, U+FF21..U+FF3A->a..z, U+FF41..U+FF5A->a..z,

### Japanese Character Inclusions ###
# Unified Kanji U+4E00..U+9FCF
# Extended Kanji U+3400..U+4DBF
# Extension B U+20000..U+2A6DF
# Hiragana U+3040..U+309F
# Katakana U+30A0..U+30FF
# Punctuation U+3000..U+303F
U+4E00..U+9FCF, U+3400..U+4DBF, U+20000..U+2A6DF, U+3040..U+309F, U+30A0..U+30FF, U+3000..U+303F,

### Character Transformations ###
# Allow large-hiragana to be matched against small-hiragana equivalents
# Allow non-voiced-hiragana to be matched against voiced-hiragana equivalents
# See http://unicode.org/charts/PDF/U3040.pdf

# Map hiragana small/voiced characters to main form
U+3042->U+3041, U+3044->U+3043, U+3046->U+3045, U+3048->U+3047, U+304A->U+3049, U+304C->U+304B, U+304E->U+304D, U+3050->U+304F, U+3052->U+3051, U+3054->U+3053, U+3056->U+3055, U+3058->U+3057, U+305A->U+3059, U+305C->U+305B, U+305E->U+305D, U+3060->U+305F, U+3062->U+3061, U+3064->U+3063, U+3065->U+3063, U+3067->U+3066, U+3069->U+3068, U+3070->U+306F, U+3071->U+306F, U+3073->U+3072, U+3074->U+3072, U+3076->U+3075, U+3077->U+3075, U+3079->U+3078, U+307A->U+3078, U+307C->U+307B, U+307D->U+307B, U+3084->U+3083, U+3086->U+3085, U+3088->U+3087, U+308F->U+308E, U+3094->U+3046, U+3095->U+304B, U+3096->U+3051,

# Map katakana small/voiced characters to main form
U+30A2->U+30A1, U+30A4->U+30A3, U+30A6->U+30A5, U+30A8->U+30A7, U+30AA->U+30A9, U+30AC->U+30AB, U+30AE->U+30AD, U+30B0->U+30AF, U+30B2->U+30B1, U+30B4->U+30B3, U+30B6->U+30B5, U+30B8->U+30B7, U+30BA->U+30B9, U+30BC->U+30BB, U+30BE->U+30BD, U+30C0->U+30BF, U+30C2->U+30C1, U+30C5->U+30C4, U+30C7->U+30C6, U+30C9->U+30C8, U+30D0->U+30CF, U+30D1->U+30CF, U+30D3->U+30D2, U+30D4->U+30D2, U+30D6->U+30D5, U+30D7->U+30D5, U+30D9->U+30D8, U+30DA->U+30D8, U+30DC->U+30DB, U+30DD->U+30DB,
U+30E4->U+30E3, U+30E6->U+30E5, U+30E8->U+30E7, U+30EF->U+30EE, U+30F4->U+30A6, U+30AB->U+30F5, U+30B1->U+30F6, U+30F7->U+30EF, U+30F8->U+30F0, U+30F9->U+30F1, U+30FA->U+30F2,

# Map “katakana phonetic extension” characters to main forms
# NB: Out of sequence b/c they were added late to the Unicode specification
# See http://unicode.org/charts/PDF/U31F0.pdf
U+30AF->U+31F0, U+30B7->U+31F1, U+30B9->U+31F2, U+30C8->U+31F3, U+30CC->U+31F4, U+30CF->U+31F5, U+30D2->U+31F6, U+30D5->U+31F7, U+30D8->U+31F8, U+30DB->U+31F9, U+30E0->U+31FA, U+30E9->U+31FB, U+30EA->U+31FC, U+30EB->U+31FD, U+30EC->U+31FE, U+30ED->U+31FF,

# Map half-width katakana to full-width katakana
U+FF66->U+30F2, U+FF67->U+30A1, U+FF68->U+30A3, U+FF69->U+30A5, U+FF6A->U+30A7, U+FF6B->U+30A9, U+FF6C->U+30E3, U+FF6D->U+30E5, U+FF6E->U+30E7, U+FF6F->U+30C3, U+FF71->U+30A1, U+FF72->U+30A3, U+FF73->U+30A5, U+FF74->U+30A7, U+FF75->U+30A9, U+FF76->U+30AB, U+FF77->U+30AD, U+FF78->U+30AF, U+FF79->U+30B1, U+FF7A->U+30B3, U+FF7B->U+30B5, U+FF7C->U+30B7, U+FF7D->U+30B9, U+FF7E->U+30BB, U+FF7F->U+30BD, U+FF80->U+30BF, U+FF81->U+30C1, U+FF82->U+30C3, U+FF83->U+30C6, U+FF84->U+30C8, U+FF85->U+30CA, U+FF86->U+30CB, U+FF87->U+30CC, U+FF88->U+30CD, U+FF89->U+30CE, U+FF8A->U+30CF, U+FF8B->U+30D2, U+FF8C->U+30D5, U+FF8D->U+30D8, U+FF8E->U+30DB, U+FF8F->U+30DE, U+FF90->U+30DF, U+FF91->U+30E0, U+FF92->U+30E1, U+FF93->U+30E2, U+FF94->U+30E3, U+FF95->U+30E5, U+FF96->U+30E7, U+FF97->U+30E9, U+FF98->U+30EA, U+FF99->U+30EB, U+FF9A->U+30EC, U+FF9B->U+30ED, U+FF9C->U+30EF, U+FF9D->U+30F3

Lastly, contrary to my previous comment (which is now amended). This file does not map equivalent hiragana to katakana (yet). I will get around to this … eventually.

From what I can tell, this website is the only source for a working Sphinx Japanese character table. If anyone actually reads this and makes a map for katakana->hiragana and hiragana->katakana, please send it through. Remember to check it thoroughly!

Lastly if you’re using this, please drop me a comment :D

One Response to “Japanese Sphinx Explained”

  1. your silence is deafening

    admin

Leave a Reply