[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [sup-devel] Cannot query Japanese characters



Hi Horacio,

Thanks for all your help so far.

Reformatted excerpts from Horacio Sanson's message of 2011-05-04:
> After some hacking I got a Heliotrope server that works perfectly with
> Japanese text. All I did was follow your comments
> and applied the MeCab tokenizer to the message body and query strings
> before passing them to Whistelpig or more specific
> to Heliotrope::Index.

Great!

> There is one problem I don't see how to handle... I do receive email
> in Japanese but also Chinese and Korean. I need a different
> tokenizer for each one and I have no idea how to handle this. Do email
> messages contain a language header that would allow me
> to identify the language and pass it to the corresponding tokenizer??

There's not a great way to do this in email. You can look at the
content-type headers, which is sometimes present, and that will
sometimes give you a clue. But it's usually useless.

You can write some heuristics by hand, of course. Or you can try naive
bayes, which performs pretty well on this type of task. It looks like
someone just started a ruby project here: https://github.com/fela/rlid.
It seems to only have Eurpoean languages so far, but you can probably
just dump in some CKJ text and retrain.

As for your patches: I've applied a related patch to fix the encoding
issue with Query#parsed_query_s. Can you let me know if that works?

Rather than sticking mecab directly in heliotrope, I am going to make a
hook for users to plug in their own custom tokenization code like you're
doing.
-- 
William <wmorgan-sup@masanjin.net>
_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel