[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [sup-devel] Cannot query Japanese characters



Chasen is the worst tokenizer, is pretty old. The best one is MeCab
that is the faster and from the same author of Chasen.
You can see all major Japanese tokenizer in action at this URL:
http://nomadscafe.jp/test/keitaiso/index.cgi. Just put some
text in the box and press the button.

After some hacking I got a Heliotrope server that works perfectly with
Japanese text. All I did was follow your comments
and applied the MeCab tokenizer to the message body and query strings
before passing them to Whistelpig or more specific
to Heliotrope::Index.

There is one problem I don't see how to handle... I do receive email
in Japanese but also Chinese and Korean. I need a different
tokenizer for each one and I have no idea how to handle this. Do email
messages contain a language header that would allow me
to identify the language and pass it to the corresponding tokenizer??


regards,
Horacio

On Wed, May 4, 2011 at 7:26 AM, William Morgan <wmorgan-sup@masanjin.net> wrote:
> Reformatted excerpts from Horacio Sanson's message of 2011-05-03:
>> index = Index.new "index" => #<Whistlepig::Index:0x00000002093f60>
>> entry1 = Entry.new => #<Whistlepig::Entry:0x0000000207d328>
>> entry1.add_string "body", "研究会" => #<Whistlepig::Entry:0x0000000207d328>
>> docid1 = index.add_entry entry1 => 1
>> q1 = Query.new "body", "研究" => body:"研究"
>> results1 = index.search q1 => []
>
> The problem here is tokenization. Whistlepig only provides a very simple
> tokenizer, namely, it looks for space-separated things [1]. So you have to
> space-separate your tokens in both the indexing and querying stages, e.g.:
>
>  entry1.add_string "body", "研 究 会" => #<Whistlepig::Entry:0x90b873c>
>  docid1 = index.add_entry entry1      => 1
>  q1 = Query.new "body", "研 究"       => AND body:"研" body:"究"
>  q1 = Query.new "body", "\"研 究\""   => PHRASE body:"研" body:"究"
>  results1 = index.search q1           => [1]
>
> For Japanese, proper tokenization is tricky. You could simply space-separate
> every character and deal with the spurious matches across word boundaries.
> Or you could do it right by plugging in a proper tokenizer, e.g. something
> like http://www.chasen.org/~taku/software/TinySegmenter/.
>
> [1] It also strips any prefix or suffix characters that match [:punct:]. This
> is all pretty ad-hoc and undocumented. Providing simpler whitespace-only
> tokenizer as an alternative is in the works.
> --
> William <wmorgan-sup@masanjin.net>
> _______________________________________________
> Sup-devel mailing list
> Sup-devel@rubyforge.org
> http://rubyforge.org/mailman/listinfo/sup-devel
>
From f484b09518db47a06690e09a710cf6e866c5561b Mon Sep 17 00:00:00 2001
From: Horacio Sanson <hsanson@gmail.com>
Date: Wed, 4 May 2011 10:31:12 +0900
Subject: [PATCH 1/2] Fix crash for non ASCII chars

---
 bin/heliotrope-server |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/bin/heliotrope-server b/bin/heliotrope-server
index 4793ac2..ed9c3be 100644
--- a/bin/heliotrope-server
+++ b/bin/heliotrope-server
@@ -151,7 +151,7 @@ class HeliotropeServer < Sinatra::Base
       nav += "</div>"
 
       header("Search: #{query.original_query_s}", query.original_query_s) +
-        "<div>Parsed query: #{escape_html query.parsed_query_s}</div>" +
+        "<div>Parsed query: #{escape_html query.parsed_query_s.force_encoding('UTF-8')}</div>" +
         "<div>Search took #{sprintf '%.2f', info[:elapsed]}s and #{info[:continued] ? 'was' : 'was NOT'} continued</div>" +
         "#{nav}<table>" +
         results.map { |r| threadinfo_to_html r }.join +
-- 
1.7.4.1

From 6595af0b55d52d1f68562fbdd0f1b23dfee34039 Mon Sep 17 00:00:00 2001
From: Horacio Sanson <hsanson@gmail.com>
Date: Wed, 4 May 2011 10:34:48 +0900
Subject: [PATCH 2/2] Add MeCab japanese text analyzer.

Japanese text has no white space separation causing the Whistelpig
tokenizer to fail. This patch processes the email indexable text
and search queries with MeCab before passing them to Whistelpig.
---
 bin/heliotrope-server     |    3 ++-
 lib/heliotrope/message.rb |    5 +++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/bin/heliotrope-server b/bin/heliotrope-server
index ed9c3be..f3bd5d4 100644
--- a/bin/heliotrope-server
+++ b/bin/heliotrope-server
@@ -67,6 +67,7 @@ class HeliotropeServer < Sinatra::Base
     end.to_json
   end
 
+  require "MeCab"
   def get_query_from_params
     ## work around a rack (?) bug where quotes are omitted in queries like "hello bob"
     query = if env["rack.request.query_string"] =~ /\bq=(.+?)(&|$)/
@@ -76,7 +77,7 @@ class HeliotropeServer < Sinatra::Base
     end
 
     raise RequestError, "need a query" unless query
-    query
+    MeCab::Tagger.new("-Owakati").parse(query).force_encoding("UTF-8")
   end
 
   def get_search_results
diff --git a/lib/heliotrope/message.rb b/lib/heliotrope/message.rb
index b48329b..e61d8bd 100644
--- a/lib/heliotrope/message.rb
+++ b/lib/heliotrope/message.rb
@@ -76,6 +76,7 @@ class Message
   def indirect_recipients; cc + bcc end
   def recipients; direct_recipients + indirect_recipients end
 
+  require "MeCab"
   def indexable_text
     @indexable_text ||= begin
       v = ([from.indexable_text] +
@@ -90,8 +91,8 @@ class Message
         end
       ).flatten.compact.join(" ")
 
-      v.gsub(/\s+[\W\d_]+(\s|$)/, " "). # drop funny tokens
-        gsub(/\s+/, " ")
+      MeCab::Tagger.new("-Owakati").parse(v)   # Tokenize Japanese Text
+        .gsub(/\s+/, " ")
     end
   end
 
-- 
1.7.4.1

_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel