[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [sup-devel] Cannot query Japanese characters

To: Sup developer discussion <sup-devel@rubyforge.org>
Subject: Re: [sup-devel] Cannot query Japanese characters
From: Horacio Sanson <hsanson@gmail.com>
Date: Wed, 4 May 2011 11:03:14 +0900
Authentication-results: mx.google.com; spf=pass (google.com: domain of sup-devel-bounces@rubyforge.org designates 205.234.109.19 as permitted sender) smtp.mail=sup-devel-bounces@rubyforge.org; dkim=neutral (body hash did not verify) header.i=@gmail.com
Delivered-to: eg@gaute.vetsj.com
Delivered-to: sup-devel@rubyforge.org
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type:content-transfer-encoding; bh=4rRltYsasVYO38EoNB9MdWfulDGw+j2gEGxfQCK/oog=; b=DYGyVdyYd1c1a1R/fHXjkgghv/zYU2k7N4ampNhW/WvVTUzWSrw+W0m4OnWZh5pmk7 riGle+wf1C+pHZxoFnqjWiiR4MYaGU/YrJtrBy30v9bpSO8T9euNNADu0t5Fv/bcqWLL uw23TTPJNxjcHtA1r2rTv+oTyPgRj2neb15F4=
Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=fnoI61CXXP/EOOkgAiCYpZynnale4VRtE21fYcPeRZZ+dn5HF3agStvw1S+jA7bUQY ZhIF3RHfcc5n01VUohjm3hc0XjAm5wKnkarZFndasqA/X9PC4emKG5CBf8mOYnKidSKS 738EvYHQB0ckMgkBvxkHrGfZRwd1ELGg8lYHM=
In-reply-to: <BANLkTikbENFqT2GsE5uWjqN_DTMq43FFkw@mail.gmail.com>
List-archive: <http://rubyforge.org/pipermail/sup-devel>
List-help: <mailto:sup-devel-request@rubyforge.org?subject=help>
List-id: Sup developer discussion <sup-devel.rubyforge.org>
List-post: <mailto:sup-devel@rubyforge.org>
List-subscribe: <http://rubyforge.org/mailman/listinfo/sup-devel>, <mailto:sup-devel-request@rubyforge.org?subject=subscribe>
List-unsubscribe: <http://rubyforge.org/mailman/options/sup-devel>, <mailto:sup-devel-request@rubyforge.org?subject=unsubscribe>
References: <201104251023.19659.hsanson@gmail.com> <1303793294-sup-688@masanjin.net> <1304052708-sup-4240@masanjin.net> <BANLkTim9PigP91LaDQ6UG2_prxncYv1zEA@mail.gmail.com> <BANLkTi=ObuZiHWGs7Mtvh8J5k4J3KTxgMA@mail.gmail.com> <BANLkTi=tSnbEijoEHG76Z5Fy9-3G4TPxVw@mail.gmail.com> <1304460745-sup-6241@masanjin.net> <BANLkTikbENFqT2GsE5uWjqN_DTMq43FFkw@mail.gmail.com>
Reply-to: Sup developer discussion <sup-devel@rubyforge.org>
Sender: sup-devel-bounces@rubyforge.org

Forgot to mention you need the mecab ruby gem. In Ubuntu 10.04 this
gem is part of the distribution and can be installed with the command:

sudo apt-get install libmecab-ruby1.8 libmecab-ruby1.9.1 mecab-ipadic-utf8

regards
Horacio

On Wed, May 4, 2011 at 10:42 AM, Horacio Sanson <hsanson@gmail.com> wrote:
> Chasen is the worst tokenizer, is pretty old. The best one is MeCab
> that is the faster and from the same author of Chasen.
> You can see all major Japanese tokenizer in action at this URL:
> http://nomadscafe.jp/test/keitaiso/index.cgi. Just put some
> text in the box and press the button.
>
> After some hacking I got a Heliotrope server that works perfectly with
> Japanese text. All I did was follow your comments
> and applied the MeCab tokenizer to the message body and query strings
> before passing them to Whistelpig or more specific
> to Heliotrope::Index.
>
> There is one problem I don't see how to handle... I do receive email
> in Japanese but also Chinese and Korean. I need a different
> tokenizer for each one and I have no idea how to handle this. Do email
> messages contain a language header that would allow me
> to identify the language and pass it to the corresponding tokenizer??
>
>
> regards,
> Horacio
>
> On Wed, May 4, 2011 at 7:26 AM, William Morgan <wmorgan-sup@masanjin.net> wrote:
>> Reformatted excerpts from Horacio Sanson's message of 2011-05-03:
>>> index = Index.new "index" => #<Whistlepig::Index:0x00000002093f60>
>>> entry1 = Entry.new => #<Whistlepig::Entry:0x0000000207d328>
>>> entry1.add_string "body", "研究会" => #<Whistlepig::Entry:0x0000000207d328>
>>> docid1 = index.add_entry entry1 => 1
>>> q1 = Query.new "body", "研究" => body:"研究"
>>> results1 = index.search q1 => []
>>
>> The problem here is tokenization. Whistlepig only provides a very simple
>> tokenizer, namely, it looks for space-separated things [1]. So you have to
>> space-separate your tokens in both the indexing and querying stages, e.g.:
>>
>>  entry1.add_string "body", "研 究 会" => #<Whistlepig::Entry:0x90b873c>
>>  docid1 = index.add_entry entry1      => 1
>>  q1 = Query.new "body", "研 究"       => AND body:"研" body:"究"
>>  q1 = Query.new "body", "\"研 究\""   => PHRASE body:"研" body:"究"
>>  results1 = index.search q1           => [1]
>>
>> For Japanese, proper tokenization is tricky. You could simply space-separate
>> every character and deal with the spurious matches across word boundaries.
>> Or you could do it right by plugging in a proper tokenizer, e.g. something
>> like http://www.chasen.org/~taku/software/TinySegmenter/.
>>
>> [1] It also strips any prefix or suffix characters that match [:punct:]. This
>> is all pretty ad-hoc and undocumented. Providing simpler whitespace-only
>> tokenizer as an alternative is in the works.
>> --
>> William <wmorgan-sup@masanjin.net>
>> _______________________________________________
>> Sup-devel mailing list
>> Sup-devel@rubyforge.org
>> http://rubyforge.org/mailman/listinfo/sup-devel
>>
>
_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel

References:
- [sup-devel] Cannot query Japanese characters
  - From: Horacio Sanson <hsanson@gmail.com>
- Re: [sup-devel] Cannot query Japanese characters
  - From: William Morgan <wmorgan-sup@masanjin.net>
- Re: [sup-devel] Cannot query Japanese characters
  - From: William Morgan <wmorgan-sup@masanjin.net>
- Re: [sup-devel] Cannot query Japanese characters
  - From: Horacio Sanson <hsanson@gmail.com>
- Re: [sup-devel] Cannot query Japanese characters
  - From: Horacio Sanson <hsanson@gmail.com>
- Re: [sup-devel] Cannot query Japanese characters
  - From: Horacio Sanson <hsanson@gmail.com>
- Re: [sup-devel] Cannot query Japanese characters
  - From: William Morgan <wmorgan-sup@masanjin.net>
- Re: [sup-devel] Cannot query Japanese characters
  - From: Horacio Sanson <hsanson@gmail.com>

Prev by Date: Re: [sup-devel] Cannot query Japanese characters
Next by Date: Re: [sup-devel] Cannot query Japanese characters
Previous by thread: Re: [sup-devel] Cannot query Japanese characters
Next by thread: Re: [sup-devel] Cannot query Japanese characters
Index(es):
- Date
- Thread