[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [sup-devel] Cannot query Japanese characters
Reformatted excerpts from Horacio Sanson's message of 2011-05-03:
> index = Index.new "index" => #<Whistlepig::Index:0x00000002093f60>
> entry1 = Entry.new => #<Whistlepig::Entry:0x0000000207d328>
> entry1.add_string "body", "研究会" => #<Whistlepig::Entry:0x0000000207d328>
> docid1 = index.add_entry entry1 => 1
> q1 = Query.new "body", "研究" => body:"研究"
> results1 = index.search q1 => []
The problem here is tokenization. Whistlepig only provides a very simple
tokenizer, namely, it looks for space-separated things [1]. So you have to
space-separate your tokens in both the indexing and querying stages, e.g.:
entry1.add_string "body", "研 究 会" => #<Whistlepig::Entry:0x90b873c>
docid1 = index.add_entry entry1 => 1
q1 = Query.new "body", "研 究" => AND body:"研" body:"究"
q1 = Query.new "body", "\"研 究\"" => PHRASE body:"研" body:"究"
results1 = index.search q1 => [1]
For Japanese, proper tokenization is tricky. You could simply space-separate
every character and deal with the spurious matches across word boundaries.
Or you could do it right by plugging in a proper tokenizer, e.g. something
like http://www.chasen.org/~taku/software/TinySegmenter/.
[1] It also strips any prefix or suffix characters that match [:punct:]. This
is all pretty ad-hoc and undocumented. Providing simpler whitespace-only
tokenizer as an alternative is in the works.
--
William <wmorgan-sup@masanjin.net>
_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel