Great to see Sup getting back on track again..
I submitted some patches for the Gmail dumper of Heliotrope some time ago but the lack of non alphabet languages (Japanese, Chinese) made it impossible for me to keep using heliotrope/turnesole.
The main issue to support Japanese/Chinese with heliotrope was that whistlepig (indexer) lacked the ability to tokenize these languages. Also the half baked UTF-8 support caused several issues with these languages.
I would like to help in testing/implementing support for these languages, starting with Japanese, but I would require some guidance. First I would like to know is there is a way to configure the Xapian tokenizer (segmenter) within sup? Please consider that I am new to both sup and to Xapian.
Also considering that development of ruby 1.8.x is going to be discontinued on June 2013 I don't find necessary to keep compatibility with it. This would improve UTF-8 support and eliminate the hacks required to support UTF-8 on ruby 1.8.x installations.
regards,
Horacio