<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Japanese Sphinx Explained</title>
	<atom:link href="http://crunchytoast.com/2009/05/01/japanese-sphinx-explained/feed/" rel="self" type="application/rss+xml" />
	<link>http://crunchytoast.com/2009/05/01/japanese-sphinx-explained/</link>
	<description>What's better than toast? Crunchytoast!</description>
	<pubDate>Wed, 10 Mar 2010 04:41:59 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6</generator>
		<item>
		<title>By: abcdef123</title>
		<link>http://crunchytoast.com/2009/05/01/japanese-sphinx-explained/#comment-1612</link>
		<dc:creator>abcdef123</dc:creator>
		<pubDate>Fri, 26 Feb 2010 16:43:58 +0000</pubDate>
		<guid isPermaLink="false">http://crunchytoast.com/?p=157#comment-1612</guid>
		<description>Hi, thanks again. From my understanding an infix search would require the search string (if it was JP characters) to be surrounded by asterisks for this method to work correctly?

I guess is depends on the format you want your users to use when searching, to use asterisks for JP searches (infix method) or to use quotes around multiple characters that they want to be searched as a single word (for the ngram approach).

Thanks again for all your help.</description>
		<content:encoded><![CDATA[<p>Hi, thanks again. From my understanding an infix search would require the search string (if it was JP characters) to be surrounded by asterisks for this method to work correctly?</p>
<p>I guess is depends on the format you want your users to use when searching, to use asterisks for JP searches (infix method) or to use quotes around multiple characters that they want to be searched as a single word (for the ngram approach).</p>
<p>Thanks again for all your help.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: admin</title>
		<link>http://crunchytoast.com/2009/05/01/japanese-sphinx-explained/#comment-1611</link>
		<dc:creator>admin</dc:creator>
		<pubDate>Fri, 26 Feb 2010 15:04:36 +0000</pubDate>
		<guid isPermaLink="false">http://crunchytoast.com/?p=157#comment-1611</guid>
		<description>@abcdef123: Ngram characters tend to behave strangely! I prefer to use infix instead. Allow me to explain.

THe Ngram feature treats each character as a white-space bounded word. But since there are not morphological analyzers that talk to Sphinx natively, you end up getting lots of characters that do not really match to any sensible string of Japanese vocablary. 

The equivalent is searching for "dog" but Sphinx can only find "d" or "o" or "g". This may have changed since early 2009, but I haven't retested the Ngram feature.

On the other hand, the *infix* searching (basically a substring search) works very fast and matches pretty much anything as a string of characters. 

Please let me know if you have a different experience? B)</description>
		<content:encoded><![CDATA[<p>@abcdef123: Ngram characters tend to behave strangely! I prefer to use infix instead. Allow me to explain.</p>
<p>THe Ngram feature treats each character as a white-space bounded word. But since there are not morphological analyzers that talk to Sphinx natively, you end up getting lots of characters that do not really match to any sensible string of Japanese vocablary. </p>
<p>The equivalent is searching for &#8220;dog&#8221; but Sphinx can only find &#8220;d&#8221; or &#8220;o&#8221; or &#8220;g&#8221;. This may have changed since early 2009, but I haven&#8217;t retested the Ngram feature.</p>
<p>On the other hand, the *infix* searching (basically a substring search) works very fast and matches pretty much anything as a string of characters. </p>
<p>Please let me know if you have a different experience? B)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: abcdef123</title>
		<link>http://crunchytoast.com/2009/05/01/japanese-sphinx-explained/#comment-1610</link>
		<dc:creator>abcdef123</dc:creator>
		<pubDate>Fri, 26 Feb 2010 14:56:24 +0000</pubDate>
		<guid isPermaLink="false">http://crunchytoast.com/?p=157#comment-1610</guid>
		<description>Hi, and thank you for the in depth description, it is very much appreciated. One more question though if you don't mind:

What about the ngram characters? Are they not required for the jp charset? On http://www.sphinxsearch.com/wiki/doku.php?id=charset_tables they advise the use these for the CJK charsets.</description>
		<content:encoded><![CDATA[<p>Hi, and thank you for the in depth description, it is very much appreciated. One more question though if you don&#8217;t mind:</p>
<p>What about the ngram characters? Are they not required for the jp charset? On <a href="http://www.sphinxsearch.com/wiki/doku.php?id=charset_tables" rel="nofollow">http://www.sphinxsearch.com/wiki/doku.php?id=charset_tables</a> they advise the use these for the CJK charsets.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: admin</title>
		<link>http://crunchytoast.com/2009/05/01/japanese-sphinx-explained/#comment-1609</link>
		<dc:creator>admin</dc:creator>
		<pubDate>Fri, 26 Feb 2010 14:38:38 +0000</pubDate>
		<guid isPermaLink="false">http://crunchytoast.com/?p=157#comment-1609</guid>
		<description>@abcdef123: The Sphinx wiki does have the entire unicode mapping for CJK unicode characters. This means it will map and index a very large tract of Chinese, Korean and Japanese characters.

See here: http://www.sphinxsearch.com/wiki/doku.php?id=charset_tables#cjk_ngram_characters

The differences are as follows:

1. Half-width &#38; full-width kana characters are mapped as the same character, reducing confusion when users search using full-width text against a document with half-width text. For the uninitiated, half-width katakana characters are an essential part of the Japanese banking system, and other legacy systems, but read exactly the same way as their full-width cousins. Read this for a more colorful explanation http://www.ops.ietf.org/lists/idn/idn.2001/msg02452.html.

2. Glottal (eg. ka &#62; ga) and aspirated (eg. ha &#62; pa/ba) phonemes are treated as the same as non-glottal/aspirated. This is important because people often treat these differently in speech, and therefore tend to get them wrong when searching. I believe it's better to match more widely, and then let them choose. Admittedly this could be omitted, and you can expand the search keywords programatically before sending it to Sphinx. The UTF range is clearly marked for your/my convenience further down the track! :D

3. This character map only includes the J from CJK. So it should index source documents faster, but of course will exclude any Korean or Chinese characters. It also excludes the most rare Japanese characters because these tend to be represented as kana these days (as a part of the general dumbing down of the Japanese writing system ... thank God!) 

This map was what "I" really needed for a project and at the time I could only find a documented map for Chinese text. This map will of course match all ASCII characters, which are an essential part of writing Japanese anyway (believe it or not!) There is more information at www.unicode.org if you are interested in digging deeper.

Thanks also for your words of encouragement. :D

DISCLAIMER: Apologies for any inaccuracies in this comment, I typed this off the top of my head.</description>
		<content:encoded><![CDATA[<p>@abcdef123: The Sphinx wiki does have the entire unicode mapping for CJK unicode characters. This means it will map and index a very large tract of Chinese, Korean and Japanese characters.</p>
<p>See here: <a href="http://www.sphinxsearch.com/wiki/doku.php?id=charset_tables#cjk_ngram_characters" rel="nofollow">http://www.sphinxsearch.com/wiki/doku.php?id=charset_tables#cjk_ngram_characters</a></p>
<p>The differences are as follows:</p>
<p>1. Half-width &amp; full-width kana characters are mapped as the same character, reducing confusion when users search using full-width text against a document with half-width text. For the uninitiated, half-width katakana characters are an essential part of the Japanese banking system, and other legacy systems, but read exactly the same way as their full-width cousins. Read this for a more colorful explanation <a href="http://www.ops.ietf.org/lists/idn/idn.2001/msg02452.html" rel="nofollow">http://www.ops.ietf.org/lists/idn/idn.2001/msg02452.html</a>.</p>
<p>2. Glottal (eg. ka &gt; ga) and aspirated (eg. ha &gt; pa/ba) phonemes are treated as the same as non-glottal/aspirated. This is important because people often treat these differently in speech, and therefore tend to get them wrong when searching. I believe it&#8217;s better to match more widely, and then let them choose. Admittedly this could be omitted, and you can expand the search keywords programatically before sending it to Sphinx. The UTF range is clearly marked for your/my convenience further down the track! <img src='http://crunchytoast.com/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' /> </p>
<p>3. This character map only includes the J from CJK. So it should index source documents faster, but of course will exclude any Korean or Chinese characters. It also excludes the most rare Japanese characters because these tend to be represented as kana these days (as a part of the general dumbing down of the Japanese writing system &#8230; thank God!) </p>
<p>This map was what &#8220;I&#8221; really needed for a project and at the time I could only find a documented map for Chinese text. This map will of course match all ASCII characters, which are an essential part of writing Japanese anyway (believe it or not!) There is more information at <a href="http://www.unicode.org" rel="nofollow">http://www.unicode.org</a> if you are interested in digging deeper.</p>
<p>Thanks also for your words of encouragement. <img src='http://crunchytoast.com/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' /> </p>
<p>DISCLAIMER: Apologies for any inaccuracies in this comment, I typed this off the top of my head.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: abcdef123</title>
		<link>http://crunchytoast.com/2009/05/01/japanese-sphinx-explained/#comment-1608</link>
		<dc:creator>abcdef123</dc:creator>
		<pubDate>Fri, 26 Feb 2010 13:54:14 +0000</pubDate>
		<guid isPermaLink="false">http://crunchytoast.com/?p=157#comment-1608</guid>
		<description>Oh and I do appreciate the work you've put into this, I think it is exactly what I was looking for but I am still curious about the differences.</description>
		<content:encoded><![CDATA[<p>Oh and I do appreciate the work you&#8217;ve put into this, I think it is exactly what I was looking for but I am still curious about the differences.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: abcdef123</title>
		<link>http://crunchytoast.com/2009/05/01/japanese-sphinx-explained/#comment-1607</link>
		<dc:creator>abcdef123</dc:creator>
		<pubDate>Fri, 26 Feb 2010 13:52:20 +0000</pubDate>
		<guid isPermaLink="false">http://crunchytoast.com/?p=157#comment-1607</guid>
		<description>I don't have much background in any of this but could you explain to me how this is different from the CJK charsets given on the sphinx wiki?</description>
		<content:encoded><![CDATA[<p>I don&#8217;t have much background in any of this but could you explain to me how this is different from the CJK charsets given on the sphinx wiki?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: admin</title>
		<link>http://crunchytoast.com/2009/05/01/japanese-sphinx-explained/#comment-1539</link>
		<dc:creator>admin</dc:creator>
		<pubDate>Thu, 28 Jan 2010 04:16:01 +0000</pubDate>
		<guid isPermaLink="false">http://crunchytoast.com/?p=157#comment-1539</guid>
		<description>your silence is deafening</description>
		<content:encoded><![CDATA[<p>your silence is deafening</p>
]]></content:encoded>
	</item>
</channel>
</rss>
