Re: mailing list archiver chewing patches - Mailing list pgsql-hackers
From | Magnus Hagander |
---|---|
Subject | Re: mailing list archiver chewing patches |
Date | |
Msg-id | 9837222c1001121204m52adbf21k6260609f1c8768b@mail.gmail.com Whole thread Raw |
In response to | Re: mailing list archiver chewing patches (Matteo Beccati <php@beccati.com>) |
Responses |
Re: mailing list archiver chewing patches
|
List | pgsql-hackers |
On Tue, Jan 12, 2010 at 20:56, Matteo Beccati <php@beccati.com> wrote: > Il 12/01/2010 10:30, Magnus Hagander ha scritto: >> >> The problem is usually with strange looking emails with 15 different >> MIME types. If we can figure out the proper way to render that, the >> rest really is just a SMOP. > > Yeah, I was expecting some, but all the message I've looked at seemed to be > working ok. Have you been looking at old or new messages? Try grabbing a couple of MBOX files off archives.postgresql.org from several years back, you're more likely to find weird MUAs then I think. >> (BTW, for something to actually be used In Production (TM), we want >> something that uses one of our existing frameworks. So don't go >> overboard in code-wise implementations on something else - proof of >> concept on something else is always ok, of course) > > OK, that's something I didn't know, even though I expected some kind of > limitations. Could you please elaborate a bit more (i.e. where to find > info)? Well, the framework we're moving towards is built on top of django, so that would be a good first start. There is also whever the commitfest thing is built on, but I'm told that's basically no framework. > Having played with it, here's my feedback about AOX: > > pros: > - seemed to be working reliably; > - does most of the dirty job of parsing emails, splitting parts, etc > - highly normalized schema > - thread support (partial?) A killer will be if that thread support is enough. If we have to build that completely ourselves, it'll take a lot more work. > cons: > - directly publishing the live email feed might not be desirable Why not? > - queries might end up being a bit complicate for simple tasks As long as we don't have to hit them too often, which is solve:able with caching. And we do have a pretty good RDBMS to run the queries on :) >> I don't think you can trust the NNTP gateway now or in the past, >> messages are sometimes lost there. The mbox files are as complete as >> anything we'll ever get. > > Importing the whole pgsql-www archive with a perl script that bounces > messages via SMTP took about 30m. Maybe there's even a way to skip SMTP, I > haven't looked into it that much. Um, yes. There is an MBOX import tool. >>>> - We need to generate thread indexes >>> >>> We have CTEs :) >> >> Right. We still need the threading information, so we have something >> to use our CTEs on :-) >> >> But I assume that AOX already does this? > > there are thread related tables and they seem to get filled when a SORT IMAP > command is issued, however I haven't found a way to get the hierarchy out of > them. > > What that means is that we'd need some kind of post processing to populate a > thread hierarchy. > > If there isn't a fully usable thread hierarchy I was more thinking to ltree, > mainly because I've successfully used it in past and I haven't had enough > time yet to look at CTEs. But if performance is comparable I don't see a > reason why we shouldn't use them. I'd favor CTEs if they are fast enough. Great flexibility. >>>> - We need to re-generate the original URLs for backwards compatibility >>> >>> I guess the message-id one ain't the tricky one... and it should be >>> possible to fill a relation table like >>> monharc_compat(message_id, list, year, month, message_number); >> >> Yeah. It's not so hard, you can just screen-scrape the current >> archives the same way the search server does. > > Definitely an easy enough task. > > With all that said, I can't promise anything as it all depends on how much > spare time I have, but I can proceed with the evaluation if you think it's > useful. I have a feeling that AOX is not truly the right tool for the job, > but we might be able to customise it to suit our needs. Are there any other > requirements that weren't specified? Well, I think we want to avoid customizing it. Using a custom frontend, sure. But we don't want to end up customizing the parser/backend. That's the road to unmaintainability. -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
pgsql-hackers by date: