[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[sup-talk] Indexing messages without ruby



I've gone forward with an experiment I had mentioned I might try:
trying to create a faster sup-sync by using C rather than ruby.

My approach was to use Xapian to create a sup-compatible index, and to
use libgmime for the mail parsing. That meant that I only had to write
a tiny bit of code to glue gmime together with xapian. The code is an
unholy mess of C and C++, (so there are as many as three different
string types, sometimes all in one function!), but it seems to be
working at least.

I wrote a little xapian-dump program to make a textual dump of a
database, (all data, terms, and values for each document), and I
verified that my program, notmuch, could create identical[*] terms and
values when indexing about 150 recent messages from the sup-talk
mailing list.

I've also verified that notmuch can index my own collection of nearly
600,000 email messages without any problems.

The big difference in notmuch that makes the resulting index
incompatible with sup is that it doesn't generate a serialized ruby
data structure for the document data like sup currently
expects. Instead it just shoves the mail message's filename into the
data field. So if anyone wanted to use notmuch with sup, this would
need to be resolved on one side or the other.[**]

As for performance, things look pretty good, but perhaps not as good
as I had hoped. From a test of importing about 400,000 messages from
my mail archive, notmuch starts out between 300 - 400 messages/sec.
but after about 40,000 messages slows down to about 100
messages/sec. and stabilizes there.

I haven't tested sup recently, but from my recollection indexing the
same archive on the same computer, sup started out at about 40
messages/sec. and slowed down to about 20 messages/sec. (for as long
as I watched it).

So this is preliminary, but it looks like notmuch gives a 5-10x
performance improvement over sup, (but likely closer to the 5x end of
that range unless you've got a very small index---at which point who
cares how fast/slow things are?).

A quick glance with a profiler shows xapian dominating the notmuch
profile at 62% and gmime hardly appearing at all (near 2%). As
contrast, ruby dominates the sup-sync profile with xapian down in the
8% range. (So there's the 10x target there.) One other advantage is
that with xapian dominating the profile, notmuch stands to benefit
from future performance improvements to xapian itself.

If that ruby time is dominated by mail parsing, it's possible that
much of the advantage of notmuch could be gained by simply switching
from rmail to a non-ruby-based parser like gmime. But that's just a
guess as I haven't tried to find where the ruby time is being spent.

If anyone is interested in playing along at home, my code is available
via git with:

	git clone git://git.cworth.org/git/notmuch

Have fun,

-Carl

[*] Some minor differences exist in the heuristics for reognizing
signatures, and sup sometimes splits numbers into multiple terms (such
as "1754" indexed as two terms "17" and "54") which I couldn't explain
nor replicate. Finally notmuch doesn't attempt to index encrypted
messages.

[**] Beyond this incompatibility, notmuch is not even close to being a
functional replacement for sup-sync. It currently only knows how to
shove new documents into the database and doesn't know how to do any
updating. Similarly it doesn't have any code to examine mtimes to
identify new or changed messages to be updated. It also doesn't look
at maildir filename flags to determine labels, nor does it provide any
means for the user to request custom labels to be attached to certain
messages.

Attachment: signature.asc
Description: PGP signature

_______________________________________________
sup-talk mailing list
sup-talk@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-talk