Searching mailman archives offline (python-dev, anyone?)

2010/05/07 § 8 Comments

Since I’m a newcomer to python-dev, I often need to search the python-dev mailman archives. While I did find this way to do it (using Google with site:), it’s no good for offline searches (and at best it’s a kludge for online searches, too, IMHO). I’m offline quite a lot these days, since cellular 3G isn’t what it never used to be, and as long as I’m travelling the world, pre-paid cellular 3G is even worse. So, I set out looking for a proper solution to search python-dev in an offline manner.

Initially I just downloaded the whole mailing list archive with the shell-concoction listed below, and used grep to fish out what I needed. Obviously, if you’re reading this to look into something which isn’t python-dev (but why would you?!), replace MAILMAN_URL with wherever the mailing list you care about is archived.

MAILMAN_URL=http://mail.python.org/pipermail/python-dev/
for FILENAME in $(wget -O - -q $MAILMAN_URL |
                         egrep -o 'href="[^"]+.txt.gz"' |
                         cut -f2 -d\" )
do
    wget $MAILMAN_URL/$FILENAME
    gunzip $FILENAME
done

Naturally, after a (short) while I realized I need a proper mailbox search utility. I’ve been using Debian for ages, but the pure richness of Ubuntu’s repositories has only recently made my brain rewire ‘task: find new software’ to apt-cache search. So I did, and indeed apt-cache search mbox search found mairix, a “a program for indexing and searching email messages stored in maildir, MH or mbox folders.”. Sweet.

mairix has a slightly odd usage pattern and is geared towards people fleunt in mutt (which I’m not) so it, ugh, took me a while to realize it’s the tool I need and how to use it (gory details below). To sum things up, with mairix, you (a) index all the mail you’d like to search in one invocation and (b) run mairix with a search query, which creates a new mailbox (mbox/Maildir/MH) only with the results. You can later view that mailbox with your favorite reader, but the only one that I know of that would make sense in this context is probably, indeed, mutt.

Initially I set mairix up to index the mboxes as they were, but then I realized that due to the limitations of the mbox format, mairix has to copy every matching message to the results mailbox. If I were to use Maildir, for example, where every message is a file, it would generate a search-result-Maildir made of symlinks, which sounds better. So how do you convert all these mbox’s to Maildir? apt-cache search convert mbox maildir found mb2md for me. I placed all the mbox’s in a directory called mbox, and created a directory called maildir, and ran: mb2md -s $(pwd)/mbox -d $(pwd)/maildir. It chugged along unhappily (spewed a ton of error messages), but seemed to have worked (it later occurred to me that some emails might have been lost, I’m not sure), and a few minutes later I had all of python-dev’s archives in Maildir format.

Now I can use mairix! I setup my .mairixrc like so:

base=/home/teolicy/Projects/python-internals/mail/maildir
maildir=...
database=/home/teolicy/.mairixdb
mfolder=/tmp/mairix-results

The maildir=... bit means “recurse under base and index the maildirs within”. The mfolder line says where to put the resulting mailbox from your last search. I recon the other parameters are rather obvious, but see mairixrc(5) for details if you need something else. Warning! Obviously, if you’re going to index something that’s private, don’t place the results in /tmp!

Now I just had to run mairix with no arguments, and a few (rather short) moments later all the emails in the archive were indexed. To use mairix, you type something like: mairix d:3m- s:gil f:antoine which means, “find all messages in the last three months where the subject has ‘gil’ in it and the sender has ‘antoine’ in it”. The results will be stored in /tmp/mairix-results, which you can read using mutt -f /tmp/mairix-results. I encourage you to read mairix(1), but if you don’t, be aware that the useful -t switch will pull in whole threads into the results, not just matched messages. I use it more often than not.

Two small things remained. The first, for some reason which I didn’t care enough to research, mutt kept complaining I don’t have a ~/Mail folder on startup. Placing set folder=/tmp/mairix-results in my .muttrc made it go away. <sheepish>I didn’t really read what that means</sheepish>, so if that setting eats your homework, well, you deserve it. Also, I wrote a simple function for my zshrc file that reads something like:

mairix() {
    /usr/bin/env mairix -o /tmp/mairix-results $* &&
    mutt -Rf /tmp/mairix-results
}

It makes the whole thing easier.

That’s it. I’d feel pretty happy with myself, having an itch scratched so nicely, unless I was so dumbheaded as to fail to see that mairix is essentially the tool I was looking for in the first place. After about three minutes in its manpage, I figured it’s “unwieldy crap”, and started writing my own mailbox search engine in Python, based on whoosh. Fortunately, after a couple of days of mellow hacking, and having learned of the horrors that are email algorithm (email just sucks, you know?), it dawned on me that I’m slowly changing my design thus that I’m bloody rewriting mairix, so I ditched my effort, spent a few more minutes reading mairix’s manpage and not dismissing it unconsciously all the time as crap and realized it’s exactly what I needed. I learned some from the experience about free text searching in Python and RFC2822 and stuff, but honestly, I wish I weren’t such an arse in the first place. There, I confessed.

Below you can find all the stuff written here in easily copy-pastable form, you lazy bastard. Note this isn’t a script, as it doesn’t check for any kind of error, so it’s up to you to make sure this doesn’t botch your computer or whatever.

ARCHIVE_LOCATION=$HOME/python-dev
MAILMAN_URL=http://mail.python.org/pipermail/python-dev/

echo installing mutt, mairix, mb2md
sudo apt-get install mutt mairix mb2md

echo creating directories
mkdir -p $ARCHIVE_LOCATION/mbox
mkdir -p $ARCHIVE_LOCATION/maildir
cd $ARCHIVE_LOCATION/mbox

echo downloading $MAILMAN_URL
for FILENAME in $(wget -O - -q $MAILMAN_URL |
                         egrep -o 'href="[^"]+.txt.gz"' |
                         cut -f2 -d\")
do
    echo downloading $FILENAME
    wget -q $MAILMAN_URL/$FILENAME
    gunzip $FILENAME
done

echo converting to maildir
cd $ARCHIVE_LOCATION
mb2md -s $(pwd)/mbox -d $(pwd)/maildir 2>/dev/null 1>/dev/null

echo removing converted mailboxes
rm -fr $ARCHIVE_LOCATION/mbox
mv $ARCHIVE_LOCATION/maildir/* $ARCHIVE_LOCATION/maildir/.??* $ARCHIVE_LOCATION
rmdir $ARCHIVE_LOCATION/maildir

echo setting up mairixrc and muttrc
cat << EOF > ~/.mairixrc
base=$ARCHIVE_LOCATION
maildir=...
database=$HOME/.mairixdb
mfolder=/tmp/mairix-results
EOF

cat << EOF > ~/.muttrc
set folder=/tmp/mairix-results
EOF

echo indexing archive
mairix

echo 'mairix is all set-up; maybe you want to use this function:'
echo 'mairix() {'
echo '  /usr/bin/env mairix -o /tmp/mairix-results $* &&'
echo '   mutt -Rf /tmp/mairix-results'
echo '}'

The question of updates remains; a simple script should be able to do the trick, and maybe I’ll write it sometime. Or not.

Advertisements

Where Am I?

You are currently viewing the archives for May, 2010 at NIL: .to write(1) ~ help:about.