<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Japanese Sphinx Explained</title>
	<atom:link href="http://crunchytoast.com/2009/05/01/japanese-sphinx-explained/feed/" rel="self" type="application/rss+xml" />
	<link>http://crunchytoast.com/2009/05/01/japanese-sphinx-explained/</link>
	<description>What&#039;s better than toast? Crunchytoast!</description>
	<lastBuildDate>Wed, 25 Jan 2012 02:55:23 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: Gitorious入れたメモ &#171; blog.udzura.jp</title>
		<link>http://crunchytoast.com/2009/05/01/japanese-sphinx-explained/#comment-12914</link>
		<dc:creator>Gitorious入れたメモ &#171; blog.udzura.jp</dc:creator>
		<pubDate>Tue, 17 Jan 2012 09:42:21 +0000</pubDate>
		<guid isPermaLink="false">http://crunchytoast.com/?p=157#comment-12914</guid>
		<description>[...] http://crunchytoast.com/2009/05/01/japanese-sphinx-explained/ [...]</description>
		<content:encoded><![CDATA[<p>[...] <a href="http://crunchytoast.com/2009/05/01/japanese-sphinx-explained/" rel="nofollow">http://crunchytoast.com/2009/05/01/japanese-sphinx-explained/</a> [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: alex</title>
		<link>http://crunchytoast.com/2009/05/01/japanese-sphinx-explained/#comment-7478</link>
		<dc:creator>alex</dc:creator>
		<pubDate>Mon, 10 Oct 2011 16:16:59 +0000</pubDate>
		<guid isPermaLink="false">http://crunchytoast.com/?p=157#comment-7478</guid>
		<description>Hello, and thanks for taking the time to document this. I don&#039;t understand a word of japanese, but we have japanese entries in our database and I need to search them with Sphinx. 

I copied your character table as well as the settings but cannot get any matches whatsoever from the japanese entries. The Latin (english, spanish) entries are fine and return matches, even with accents in spanish, but nothing in Japanese. 

I also tried the configuration in the Sphinx wiki with ngram settings and still nothing. Any help you could give me would be greatly appreciated. 

As an example,  Our database will indeed find the following using a LIKE search:

...WHERE title LIKE &#039;%レプリカントワークス%&#039;;

But Sphinx will not, even from the CLI:

../bin/search -i bookIndex レプリカントワークス

That is a title copied from our database.

Thanks in advance.</description>
		<content:encoded><![CDATA[<p>Hello, and thanks for taking the time to document this. I don&#8217;t understand a word of japanese, but we have japanese entries in our database and I need to search them with Sphinx. </p>
<p>I copied your character table as well as the settings but cannot get any matches whatsoever from the japanese entries. The Latin (english, spanish) entries are fine and return matches, even with accents in spanish, but nothing in Japanese. </p>
<p>I also tried the configuration in the Sphinx wiki with ngram settings and still nothing. Any help you could give me would be greatly appreciated. </p>
<p>As an example,  Our database will indeed find the following using a LIKE search:</p>
<p>&#8230;WHERE title LIKE &#8216;%レプリカントワークス%&#8217;;</p>
<p>But Sphinx will not, even from the CLI:</p>
<p>../bin/search -i bookIndex レプリカントワークス</p>
<p>That is a title copied from our database.</p>
<p>Thanks in advance.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: admin</title>
		<link>http://crunchytoast.com/2009/05/01/japanese-sphinx-explained/#comment-28</link>
		<dc:creator>admin</dc:creator>
		<pubDate>Mon, 31 May 2010 12:02:05 +0000</pubDate>
		<guid isPermaLink="false">http://crunchytoast.com/?p=157#comment-28</guid>
		<description>Do your index files have any size? Basically make sure you have data in your index to begin with. You can use indextool which comes with Sphinx to dump information from your index files.

http://sphinxsearch.com/docs/current.html#ref-indextool

Also, are you accessing the daemon through a library like Thinking Sphinx?
I use Thinking Sphinx in RoR, but I remember using the command line directly
once to debug things.

The CLI query tool included is called &#039;search&#039; - funnily enough. See if you can return
anything from your indexes at the command line. Try both standard ASCII and CJK characters too.

http://sphinxsearch.com/docs/current.html#ref-search

Alternatively, send me some debug and I&#039;ll see what I can do.</description>
		<content:encoded><![CDATA[<p>Do your index files have any size? Basically make sure you have data in your index to begin with. You can use indextool which comes with Sphinx to dump information from your index files.</p>
<p><a href="http://sphinxsearch.com/docs/current.html#ref-indextool" rel="nofollow">http://sphinxsearch.com/docs/current.html#ref-indextool</a></p>
<p>Also, are you accessing the daemon through a library like Thinking Sphinx?<br />
I use Thinking Sphinx in RoR, but I remember using the command line directly<br />
once to debug things.</p>
<p>The CLI query tool included is called &#8216;search&#8217; &#8211; funnily enough. See if you can return<br />
anything from your indexes at the command line. Try both standard ASCII and CJK characters too.</p>
<p><a href="http://sphinxsearch.com/docs/current.html#ref-search" rel="nofollow">http://sphinxsearch.com/docs/current.html#ref-search</a></p>
<p>Alternatively, send me some debug and I&#8217;ll see what I can do.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: MattM</title>
		<link>http://crunchytoast.com/2009/05/01/japanese-sphinx-explained/#comment-27</link>
		<dc:creator>MattM</dc:creator>
		<pubDate>Fri, 28 May 2010 17:13:35 +0000</pubDate>
		<guid isPermaLink="false">http://crunchytoast.com/?p=157#comment-27</guid>
		<description>If I try to use your configuration and it doesn&#039;t return any results for queries do you have any suggestions on how to debug what sphinx is doing?</description>
		<content:encoded><![CDATA[<p>If I try to use your configuration and it doesn&#8217;t return any results for queries do you have any suggestions on how to debug what sphinx is doing?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sat</title>
		<link>http://crunchytoast.com/2009/05/01/japanese-sphinx-explained/#comment-26</link>
		<dc:creator>Sat</dc:creator>
		<pubDate>Mon, 03 May 2010 11:53:56 +0000</pubDate>
		<guid isPermaLink="false">http://crunchytoast.com/?p=157#comment-26</guid>
		<description>Hi, thanks for the great post!

I have a question regarding sphinx index configuration (it&#039;s actually nothing to do with Japanese).

My index file is too big (more than 4 GB) that the searchd throws a warning and it wouldn&#039;t start. I looked for a solution and found out that I can split one big index into multiple indexes, but I don&#039;t know how to do that...

If you happen to know how to write a conf file to build multiple indexes from a same source, could you show me the example?

Thanks very much!

Thanks!</description>
		<content:encoded><![CDATA[<p>Hi, thanks for the great post!</p>
<p>I have a question regarding sphinx index configuration (it&#8217;s actually nothing to do with Japanese).</p>
<p>My index file is too big (more than 4 GB) that the searchd throws a warning and it wouldn&#8217;t start. I looked for a solution and found out that I can split one big index into multiple indexes, but I don&#8217;t know how to do that&#8230;</p>
<p>If you happen to know how to write a conf file to build multiple indexes from a same source, could you show me the example?</p>
<p>Thanks very much!</p>
<p>Thanks!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: abcdef123</title>
		<link>http://crunchytoast.com/2009/05/01/japanese-sphinx-explained/#comment-25</link>
		<dc:creator>abcdef123</dc:creator>
		<pubDate>Fri, 26 Feb 2010 16:43:58 +0000</pubDate>
		<guid isPermaLink="false">http://crunchytoast.com/?p=157#comment-25</guid>
		<description>Hi, thanks again. From my understanding an infix search would require the search string (if it was JP characters) to be surrounded by asterisks for this method to work correctly?

I guess is depends on the format you want your users to use when searching, to use asterisks for JP searches (infix method) or to use quotes around multiple characters that they want to be searched as a single word (for the ngram approach).

Thanks again for all your help.</description>
		<content:encoded><![CDATA[<p>Hi, thanks again. From my understanding an infix search would require the search string (if it was JP characters) to be surrounded by asterisks for this method to work correctly?</p>
<p>I guess is depends on the format you want your users to use when searching, to use asterisks for JP searches (infix method) or to use quotes around multiple characters that they want to be searched as a single word (for the ngram approach).</p>
<p>Thanks again for all your help.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: admin</title>
		<link>http://crunchytoast.com/2009/05/01/japanese-sphinx-explained/#comment-24</link>
		<dc:creator>admin</dc:creator>
		<pubDate>Fri, 26 Feb 2010 15:04:36 +0000</pubDate>
		<guid isPermaLink="false">http://crunchytoast.com/?p=157#comment-24</guid>
		<description>@abcdef123: Ngram characters tend to behave strangely! I prefer to use infix instead. Allow me to explain.

THe Ngram feature treats each character as a white-space bounded word. But since there are not morphological analyzers that talk to Sphinx natively, you end up getting lots of characters that do not really match to any sensible string of Japanese vocablary.

The equivalent is searching for &quot;dog&quot; but Sphinx can only find &quot;d&quot; or &quot;o&quot; or &quot;g&quot;. This may have changed since early 2009, but I haven&#039;t retested the Ngram feature.

On the other hand, the *infix* searching (basically a substring search) works very fast and matches pretty much anything as a string of characters.

Please let me know if you have a different experience? B)</description>
		<content:encoded><![CDATA[<p>@abcdef123: Ngram characters tend to behave strangely! I prefer to use infix instead. Allow me to explain.</p>
<p>THe Ngram feature treats each character as a white-space bounded word. But since there are not morphological analyzers that talk to Sphinx natively, you end up getting lots of characters that do not really match to any sensible string of Japanese vocablary.</p>
<p>The equivalent is searching for &#8220;dog&#8221; but Sphinx can only find &#8220;d&#8221; or &#8220;o&#8221; or &#8220;g&#8221;. This may have changed since early 2009, but I haven&#8217;t retested the Ngram feature.</p>
<p>On the other hand, the *infix* searching (basically a substring search) works very fast and matches pretty much anything as a string of characters.</p>
<p>Please let me know if you have a different experience? B)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: abcdef123</title>
		<link>http://crunchytoast.com/2009/05/01/japanese-sphinx-explained/#comment-23</link>
		<dc:creator>abcdef123</dc:creator>
		<pubDate>Fri, 26 Feb 2010 14:56:24 +0000</pubDate>
		<guid isPermaLink="false">http://crunchytoast.com/?p=157#comment-23</guid>
		<description>Hi, and thank you for the in depth description, it is very much appreciated. One more question though if you don&#039;t mind:

What about the ngram characters? Are they not required for the jp charset? On http://www.sphinxsearch.com/wiki/doku.php?id=charset_tables they advise the use these for the CJK charsets.</description>
		<content:encoded><![CDATA[<p>Hi, and thank you for the in depth description, it is very much appreciated. One more question though if you don&#8217;t mind:</p>
<p>What about the ngram characters? Are they not required for the jp charset? On <a href="http://www.sphinxsearch.com/wiki/doku.php?id=charset_tables" rel="nofollow">http://www.sphinxsearch.com/wiki/doku.php?id=charset_tables</a> they advise the use these for the CJK charsets.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: admin</title>
		<link>http://crunchytoast.com/2009/05/01/japanese-sphinx-explained/#comment-22</link>
		<dc:creator>admin</dc:creator>
		<pubDate>Fri, 26 Feb 2010 14:38:38 +0000</pubDate>
		<guid isPermaLink="false">http://crunchytoast.com/?p=157#comment-22</guid>
		<description>@abcdef123: The Sphinx wiki does have the entire unicode mapping for CJK unicode characters. This means it will map and index a very large tract of Chinese, Korean and Japanese characters.

See here: http://www.sphinxsearch.com/wiki/doku.php?id=charset_tables#cjk_ngram_characters

The differences are as follows:

1. Half-width &amp; full-width kana characters are mapped as the same character, reducing confusion when users search using full-width text against a document with half-width text. For the uninitiated, half-width katakana characters are an essential part of the Japanese banking system, and other legacy systems, but read exactly the same way as their full-width cousins. Read this for a more colorful explanation http://www.ops.ietf.org/lists/idn/idn.2001/msg02452.html.

2. Glottal (eg. ka &gt; ga) and aspirated (eg. ha &gt; pa/ba) phonemes are treated as the same as non-glottal/aspirated. This is important because people often treat these differently in speech, and therefore tend to get them wrong when searching. I believe it&#039;s better to match more widely, and then let them choose. Admittedly this could be omitted, and you can expand the search keywords programatically before sending it to Sphinx. The UTF range is clearly marked for your/my convenience further down the track! :D

3. This character map only includes the J from CJK. So it should index source documents faster, but of course will exclude any Korean or Chinese characters. It also excludes the most rare Japanese characters because these tend to be represented as kana these days (as a part of the general dumbing down of the Japanese writing system ... thank God!)

This map was what &quot;I&quot; really needed for a project and at the time I could only find a documented map for Chinese text. This map will of course match all ASCII characters, which are an essential part of writing Japanese anyway (believe it or not!) There is more information at www.unicode.org if you are interested in digging deeper.

Thanks also for your words of encouragement. :D

DISCLAIMER: Apologies for any inaccuracies in this comment, I typed this off the top of my head.</description>
		<content:encoded><![CDATA[<p>@abcdef123: The Sphinx wiki does have the entire unicode mapping for CJK unicode characters. This means it will map and index a very large tract of Chinese, Korean and Japanese characters.</p>
<p>See here: <a href="http://www.sphinxsearch.com/wiki/doku.php?id=charset_tables#cjk_ngram_characters" rel="nofollow">http://www.sphinxsearch.com/wiki/doku.php?id=charset_tables#cjk_ngram_characters</a></p>
<p>The differences are as follows:</p>
<p>1. Half-width &amp; full-width kana characters are mapped as the same character, reducing confusion when users search using full-width text against a document with half-width text. For the uninitiated, half-width katakana characters are an essential part of the Japanese banking system, and other legacy systems, but read exactly the same way as their full-width cousins. Read this for a more colorful explanation <a href="http://www.ops.ietf.org/lists/idn/idn.2001/msg02452.html" rel="nofollow">http://www.ops.ietf.org/lists/idn/idn.2001/msg02452.html</a>.</p>
<p>2. Glottal (eg. ka &gt; ga) and aspirated (eg. ha &gt; pa/ba) phonemes are treated as the same as non-glottal/aspirated. This is important because people often treat these differently in speech, and therefore tend to get them wrong when searching. I believe it&#8217;s better to match more widely, and then let them choose. Admittedly this could be omitted, and you can expand the search keywords programatically before sending it to Sphinx. The UTF range is clearly marked for your/my convenience further down the track! <img src='http://crunchytoast.com/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' /> </p>
<p>3. This character map only includes the J from CJK. So it should index source documents faster, but of course will exclude any Korean or Chinese characters. It also excludes the most rare Japanese characters because these tend to be represented as kana these days (as a part of the general dumbing down of the Japanese writing system &#8230; thank God!)</p>
<p>This map was what &#8220;I&#8221; really needed for a project and at the time I could only find a documented map for Chinese text. This map will of course match all ASCII characters, which are an essential part of writing Japanese anyway (believe it or not!) There is more information at <a href="http://www.unicode.org" rel="nofollow">http://www.unicode.org</a> if you are interested in digging deeper.</p>
<p>Thanks also for your words of encouragement. <img src='http://crunchytoast.com/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' /> </p>
<p>DISCLAIMER: Apologies for any inaccuracies in this comment, I typed this off the top of my head.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: abcdef123</title>
		<link>http://crunchytoast.com/2009/05/01/japanese-sphinx-explained/#comment-21</link>
		<dc:creator>abcdef123</dc:creator>
		<pubDate>Fri, 26 Feb 2010 13:54:14 +0000</pubDate>
		<guid isPermaLink="false">http://crunchytoast.com/?p=157#comment-21</guid>
		<description>Oh and I do appreciate the work you&#039;ve put into this, I think it is exactly what I was looking for but I am still curious about the differences.</description>
		<content:encoded><![CDATA[<p>Oh and I do appreciate the work you&#8217;ve put into this, I think it is exactly what I was looking for but I am still curious about the differences.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

