Excerpts from Carl Worth's message of Thu Oct 15 10:23:40 -0700 2009: > As for performance, things look pretty good, but perhaps not as good > as I had hoped. I know William already said he's not all that concerned with the performance of sup-sync since it's not a common operation, but me, I can't stop working on the problem. And I think that's justified, really. For one thing, the giant sup-sync is one of the first things a new user has to do. And I think that having to wait for an operation that's measured in hours before being able to use the program at all can be very off-putting. I think we could do better to give a good first impression. > So this is preliminary, but it looks like notmuch gives a 5-10x > performance improvement over sup, (but likely closer to the 5x end of > that range unless you've got a very small index---at which point who > cares how fast/slow things are?). Those numbers were off. I now believe that my original code gained only a 3x improvement by switching from ruby/rmail to C/GMime for mail parsing. But I've done a little more coding since. Here are the current results: For a benchmark of ~ 45000 messages, rate in messages/sec.: Rate Commit ID Significant change ----- --------- ------------------ 41 sup (with xapian, from next) 120 5fbdbeb33 Switch from ruby to C (with GMime) 538 9bc4253fa Index headers only, not body 1050 371091139 Use custom header parser, not GMime (Before each run the Linux disk cache was cleared with: sync; echo 3 > /proc/sys/vm/drop_caches ) So beyond the original 3x improvement, I gained a further 4x improvement by simply doing less work. I'm now starting off by only indexing message-id and thread-id data. That's obviously "cheating" in terms of comparing performance, but I think it really makes sense to do this. The idea is that by just computing the thread-ids and indexing those for a collection of email, that initial sup-sync could be performed very quickly. Then, later, (as a background thread while sup is running), the full-text indexing could be performed. Finally, I gained a final 2x improvement by not using GMime at all, (which constructs a data structure for the entire message, even if I only want a few header), and instead just rolling a simple parser for email headers. (Did you know you can hide nested parenthesized comments all over the place in email headers? I didn't.) I'm quite happy with the final result that's 25x faster than sup. I can build a cold-cache index from my half-million message archive in less than 10 minutes, (rather than 4 hours). And performance is fairly IO-bound at this point, (in the 10-minute run, less than 7 minutes of CPU are used). Anyway, there are some ideas to consider for sup. If anyone wants to play with my code, it's here: git clone git://notmuch.org/notmuch I won't bore the list with further developments in notmuch, if any, unless it's on-topic, (such as someone trying to make sup work on top of an index built by notmuch). And I'd be delighted to see that kind of thing happen. Happy hacking, -Carl
Attachment:
signature.asc
Description: PGP signature
_______________________________________________ sup-talk mailing list sup-talk@rubyforge.org http://rubyforge.org/mailman/listinfo/sup-talk