[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [sup-devel] sup 0.13



Hi,

There has recently been opened an issue regarding this:

https://github.com/sup-heliotrope/sup/issues/60

Regards, Gaute

On 09. mai 2013 03:39, Horacio Sanson wrote:
> UTF-8 handles most cases but I still have to deal with emails in
> ISO2022-JP, Shift-JIS and EUC-JP. After some research it seems Xapian has
> no support for Asian languages. I will try to make some tests and open an
> issue if I cannot make it work.
> 
> I can see in the sup configuration file that the stem language can be
> configured but there are no CJK stemmers for Xapian that I can find.
> 
> 
> On Thu, May 2, 2013 at 5:17 PM, Gaute Hope <eg@gaute.vetsj.com> wrote:
> 
>>
>>
>> On 30. april 2013 11:44, Horacio Sanson wrote:
>>> Great to see Sup getting back on track again..
>>>
>>> I submitted some patches for the Gmail dumper of Heliotrope some time ago
>>> but the lack of non alphabet languages (Japanese, Chinese) made it
>>> impossible for me to keep using heliotrope/turnesole.
>>>
>>> The main issue to support Japanese/Chinese with heliotrope was that
>>> whistlepig (indexer) lacked the ability to tokenize these languages. Also
>>> the half baked UTF-8 support caused several issues with these languages.
>>>
>>> I would like to help in testing/implementing support for these languages,
>>> starting with Japanese, but I would require some guidance. First I would
>>> like to know is there is a way to configure the Xapian tokenizer
>>> (segmenter) within sup? Please consider that I am new to both sup and to
>>> Xapian.
>>
>> Hi Horacio,
>>
>> consider opening an issue at
>> https://github.com/sup-heliotrope/sup/issues to make sure this doesn't
>> disappear. Some changes will probably be made to the indexer when going
>> to Mail (from RMail), but I hope to be able to migrate the existing
>> index. Perhaps its time to get it right for arbitrary languages as well.
>> I am unfamiliar with Japanes/Chinese - does UTF-8 cover the needs?
>>
>> Mail is better at handling UTF-8 and I think there was some fork that
>> had some extra support for Japanese.
>>
>> Regards, Gaute
>>
>