Thread: Suggestion for improving Archives
Folks, In addition to the pending migration of the Archives (or half the archives, or whatever), I had another suggestion to make the archives less resource-intensive yet more user-friendly: Drop the search interface and replace it with links to pgsql.ru and Google Groups. Both of those resources are faster and better search engines than the Mhonarc search could ever be, and neither eats CPU time on hub.org. Yes? -- Josh Berkus Aglio Database Solutions San Francisco
-----Original Message----- From: pgsql-www-owner@postgresql.org on behalf of Josh Berkus Sent: Fri 9/3/2004 5:19 PM To: PostgreSQL WWW Mailing List Subject: [pgsql-www] Suggestion for improving Archives > Both of those resources are faster and better search engines than the Mhonarc > search could ever be, and neither eats CPU time on hub.org. Yes? Um, Mhonarc is the mail to html program which will still be required, and none of the searches run on hub.org currently anyway. Regards, Dave
On Fri, 3 Sep 2004, Josh Berkus wrote: > Folks, > > In addition to the pending migration of the Archives (or half the archives, or > whatever), I had another suggestion to make the archives less > resource-intensive yet more user-friendly: > > Drop the search interface and replace it with links to pgsql.ru and Google > Groups. > > Both of those resources are faster and better search engines than the > Mhonarc search could ever be, and neither eats CPU time on hub.org. > Yes? Note that the search functions haven't chewed up CPU on our servers in over a month now ... John Hansen has been running search.postgresql.org off of his server(s) for about that long now ... ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
I could configure search daemon at pgsql.ru to allow search requests from archives.postgresl.org via perl interface, so results could be wrapped into any design. Oleg On Fri, 3 Sep 2004, Josh Berkus wrote: > Folks, > > In addition to the pending migration of the Archives (or half the archives, or > whatever), I had another suggestion to make the archives less > resource-intensive yet more user-friendly: > > Drop the search interface and replace it with links to pgsql.ru and Google > Groups. > > Both of those resources are faster and better search engines than the Mhonarc > search could ever be, and neither eats CPU time on hub.org. Yes? > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
> Note that the search functions haven't chewed up CPU on our > servers in over a month now ... John Hansen has been running > search.postgresql.org off of his server(s) for about that long now ... > ... It's been more than 2 months, but who's counting? :) ... John
Hi, > > I could configure search daemon at pgsql.ru to allow search > requests from archives.postgresl.org via perl interface, so > results could be wrapped into any design. > I currently run search.postgresql.org, and it can be integrated into any design as well, either using the (ASPseek) method of a html-like template, or using XSL (the daemon can be configured to output XML). > > > > Drop the search interface and replace it with links to pgsql.ru and > > Google Groups. > > > > Both of those resources are faster and better search > engines than the Mhonarc > > search could ever be, and neither eats CPU time on hub.org. Yes? However, only one of those resources, pgsql.ru, could be made to make available new threads on an hourly basis, or maybe even realtime. Either way, I'm not fuzzed, whichever works the best. ... John
Guys, > However, only one of those resources, pgsql.ru, could be made to make > available new threads on an hourly basis, or maybe even realtime. > > Either way, I'm not fuzzed, whichever works the best. Hmmmm ..... =========================== Click to Search the Archives: -- PGSQL.ru's Full Text Search of Archives using OpenFTS (fast, all PostgreSQL sites) -- Google Groups (fast, general search of Usenet and PostgreSQL mailing lists) -- Monharc Archive Search (slow but includes up-to-the-last-hour posts) =========================== Good? -- Josh Berkus Aglio Database Solutions San Francisco
> > -- Monharc Archive Search (slow but includes up-to-the-last-hour posts) > Maybe I'm missing something... Could you define 'slow' for me.... Any search I do on the archives comes back in less than a second. ... John
On Sat, 4 Sep 2004, Josh Berkus wrote: > Guys, > >> However, only one of those resources, pgsql.ru, could be made to make >> available new threads on an hourly basis, or maybe even realtime. >> >> Either way, I'm not fuzzed, whichever works the best. > > Hmmmm ..... > > =========================== > Click to Search the Archives: > > -- PGSQL.ru's Full Text Search of Archives using OpenFTS (fast, all PostgreSQL > sites) > > -- Google Groups (fast, general search of Usenet and PostgreSQL mailing lists) > > -- Monharc Archive Search (slow but includes up-to-the-last-hour posts) When is the last time you used the search on archives.postgresql.org? The following was searching mvcc: "Documents 1-10 of total 1576 found. Searching in 390035 documents took 0.037 seconds." the following was searching 'wal vadim': "Documents 1-10 of total 880 found. Searching in 390035 documents took 0.441 seconds." the following was searching "postgresql releases 8.0": "Documents 1-10 of total 2190 found. Searching in 390035 documents took 1.882 seconds." the followign was searching "nested transaction support": "Documents 1-10 of total 383 found. Searching in 390035 documents took 4.714 seconds." Not what I'd consider "slow" ... granted, that last one on Google too .2 seconds, but when we can build a server farm like them, then I'll be worried about 4secs vs .2 :) ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
> the followign was searching "nested transaction support": > > "Documents 1-10 of total 383 found. Searching in 390035 > documents took > 4.714 seconds." > > Not what I'd consider "slow" ... granted, that last one on > Google too .2 seconds, but when we can build a server farm > like them, then I'll be worried about 4secs vs .2 :) > And the archives is currently being crawled, and there was a vacuum running (heavy db load). nested transaction support: Documents 1-10 of total 383 found. Searching in 390035 documents took 0.104 seconds. ... John
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > the followign was searching "nested transaction support": > > "Documents 1-10 of total 383 found. Searching in 390035 documents took > 4.714 seconds." The self-reported time should really not be used. I just ran a query, for example, that took 8 seconds as measured by my local clock, but reported searching in under 2 seconds: so obviously there are some other factors here. (I'll give maybe half a second for network times on my end). I prefer pgsql.ru or google because it searches the docs and the mailing lists, and the quality of the results tend to be higher. While we are here, the "for files modified" bit of the search.postgresql.org box does not seem to work: searching for "nested transactions vadim" brings back 62 hits, regardless of whether I set it to within one day or within 2 years. The top hit is from June 2000. There is also no way to sort it by date, which can be extremely important. The ads on every page are annoying as well. My own personal summary of advantages: pgsql.ru: very fast, searches all sites at once, no advertisements, nice "group by site" feature, cool Mozilla plugin, BSD-licensed tech, written by PG developers google: extremely fast, searches many other sources, minimal ads, order by date, powerful "advanced search" available search.postgresql.org: linked from main site? - -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200409041541 -----BEGIN PGP SIGNATURE----- iD8DBQFBOh6NvJuQZxSWSsgRArohAJ9qoVBbhrc/vPntojFTXDocX5EZegCfeHC4 T5VIlIxwEklT6EGquje6w3Y= =HclZ -----END PGP SIGNATURE-----
-----Original Message----- From: pgsql-www-owner@postgresql.org on behalf of Greg Sabino Mullane Sent: Sat 9/4/2004 8:57 PM To: pgsql-www@postgresql.org Subject: Re: [pgsql-www] Suggestion for improving Archives -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > the followign was searching "nested transaction support": > > "Documents 1-10 of total 383 found. Searching in 390035 documents took > 4.714 seconds." > The self-reported time should really not be used. I just ran a query, for > example, that took 8 seconds as measured by my local clock, but reported > searching in under 2 seconds: so obviously there are some other factors > here. (I'll give maybe half a second for network times on my end). Factors that would apply equally if we used pgsql.ru - what you are most likely seeing it the ads and other graphics loading.If we used pgsql.ru, you would still have those coming from hub.org. > pgsql.ru: very fast, searches all sites at once, no advertisements, > nice "group by site" feature, cool Mozilla plugin, BSD-licensed tech, > written by PG developers Would be linked from main site with ads if we started using it 'officially'. > search.postgresql.org: linked from main site? Yes, should we leave it hidden? :-) Note that in comparison to pgsql.ru, search.pg.org also groups by site, searches allsites at once and is seriously hacked by PG developers (not written I grant you). It is also open source (GPL, not BSD- not that I think that is particularly important to any of the users). Regards, Dave.
> While we are here, the "for files modified" bit of > the search.postgresql.org box does not seem to work: > searching for "nested transactions vadim" brings back 62 > hits, regardless of whether I set it to within one day or > within 2 years. The top hit is from June 2000. There is also > no way to sort it by date, which can be extremely important. > The ads on every page are annoying as well. > Seems to fork fine for me, no results in the last 3 months,5 in the last 6 and 12, 26 in the last 2 years. Sorting by date, rather than relevance, could be added.
On Sun, 5 Sep 2004, John Hansen wrote: > > While we are here, the "for files modified" bit of > > the search.postgresql.org box does not seem to work: > > searching for "nested transactions vadim" brings back 62 > > hits, regardless of whether I set it to within one day or > > within 2 years. The top hit is from June 2000. There is also > > no way to sort it by date, which can be extremely important. > > The ads on every page are annoying as well. > > > > Seems to fork fine for me, no results in the last 3 months,5 in the last > 6 and 12, 26 in the last 2 years. Sorting by date, rather than > relevance, could be added. > Marc again dropped last time modification header, so it's impossible to sort results by date (in general case ) without specific parser. Also, he changed template for message. These changes cause recrawling the whole archive each time and overloading archives.postgresql.org More specific search engine could use another source of information which messages to crawl, but one we use at pgsql.ru is a general search engine and it can't get modification date without proper header. I suggest: 1. Use 3-server architecture (image server, frontend, backend) which could be reduced to 2 servers (image+frontend, backend) - frontend could be plain apache+mod_accel and serve/cache all backends outputs, backend is a modperl or/and php enabled apache. 2. return last modification header - be friendly to crawlers and browsers 3. stop changing message template Oleg > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
> Marc again dropped last time modification header, so it's > impossible to sort results by date (in general case ) without > specific parser. Yes, that is unfortunate, but the code required to make this happen puts stress on the archives to some degree. > Also, he changed template for message. These changes cause > recrawling the whole archive each time and overloading > archives.postgresql.org More specific search engine could use > another source of information which messages to crawl, but > one we use at pgsql.ru is a general search engine and it > can't get modification date without proper header. There should be no need to reindex the entire archive because of a template change, since if you honor the embedded <!--noindex-->..<!--/noindex--> tags, the body text never changes. Unless of course, you want to keep an up-to-date cached copy. > > I suggest: > > 1. Use 3-server architecture (image server, frontend, backend) which > could be reduced to 2 servers (image+frontend, backend) - > frontend could be plain apache+mod_accel and serve/cache > all backends > outputs, backend is a modperl or/and php enabled apache. > 2. return last modification header - be friendly to crawlers > and browsers Tho an accellerator would only work if last-modified header is returned by the backend, this might be worth looking into. > 3. stop changing message template > Template changes are inevitable, they're part of progress :) ... John
> > Oleg, is there anything that I can put into <HEAD></HEAD> for > this? To avoid having to use PHP to do it? > <meta http-equiv="Last-Modified" content="Tue, 01 Jun 2004 09:44:10"> ... John
On Sun, 5 Sep 2004, John Hansen wrote: >> Marc again dropped last time modification header, so it's >> impossible to sort results by date (in general case ) without >> specific parser. > > Yes, that is unfortunate, but the code required to make this happen puts > stress on the archives to some degree. > >> Also, he changed template for message. These changes cause >> recrawling the whole archive each time and overloading >> archives.postgresql.org More specific search engine could use >> another source of information which messages to crawl, but >> one we use at pgsql.ru is a general search engine and it >> can't get modification date without proper header. > > There should be no need to reindex the entire archive because of a > template change, since if you honor the embedded > <!--noindex-->..<!--/noindex--> tags, the body text never changes. > Unless of course, you want to keep an up-to-date cached copy. I think what Oleg is referring to is that search engines generally compare the Last-Modified header before pulling in the whole file, to see if they are the same or not ... php, unfortunately, sets that to now(), so as far as SE's are concerned, every time they index is a new file :( I'm going to play with mhonarc this week to see if I can get it to properly set Last-Modified to Date based on the message itself ... that will clean up that mess ... Oleg, is there anything that I can put into <HEAD></HEAD> for this? To avoid having to use PHP to do it? ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
> > What code ? I've seen that last modified header and now it's gone. > No stress on the archives, it's pure question of several lines of code > Yes, which is exactly what we wanted to avoid, more php code. > it's not a portal page, it's just a message, why should it > changed so often. I think I should teach our crawler to > recognize if changes were cosmetic using fuzzy checksum. > No, but even something as simple as adding a new mailing list would then cause you to recrawl the entire site. I agree that the last-modified header is the best solution. (the value of it being equal to the message date, that is) ... John
On Sun, 5 Sep 2004, John Hansen wrote: > > Marc again dropped last time modification header, so it's > > impossible to sort results by date (in general case ) without > > specific parser. > > Yes, that is unfortunate, but the code required to make this happen puts > stress on the archives to some degree. What code ? I've seen that last modified header and now it's gone. No stress on the archives, it's pure question of several lines of code > > > Also, he changed template for message. These changes cause > > recrawling the whole archive each time and overloading > > archives.postgresql.org More specific search engine could use > > another source of information which messages to crawl, but > > one we use at pgsql.ru is a general search engine and it > > can't get modification date without proper header. > > There should be no need to reindex the entire archive because of a > template change, since if you honor the embedded > <!--noindex-->..<!--/noindex--> tags, the body text never changes. > Unless of course, you want to keep an up-to-date cached copy. > Hmm, this is rather non-standard feature of archives.postgresql.org. The problem is not with index/reindex ! The problem with crawler which doesn't have enough information to make a right decision. I don't like non-standard solution/hack when there are standard and reliable solutions. > > > > I suggest: > > > > 1. Use 3-server architecture (image server, frontend, backend) which > > could be reduced to 2 servers (image+frontend, backend) - > > frontend could be plain apache+mod_accel and serve/cache > > all backends > > outputs, backend is a modperl or/and php enabled apache. > > 2. return last modification header - be friendly to crawlers > > and browsers > > Tho an accellerator would only work if last-modified header is returned > by the backend, this might be worth looking into. > I don't see a problem to return that header. But we'll have standard solution for database driven site with dynamic content. Note, one frontend could serve/hide many backends. > > 3. stop changing message template > > > > Template changes are inevitable, they're part of progress :) > it's not a portal page, it's just a message, why should it changed so often. I think I should teach our crawler to recognize if changes were cosmetic using fuzzy checksum. > ... John > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
k, check out: http://archives.postgresql.org/sfpug/2004-09/msg00003.php I have the meta tag in place ... please confirm that the format is okay, as that is what mhonarc is getting from the message itself ... I can reformat it if I have to using PHP, but would like to avoid it if at all possible ... basically, if the default will work, I'd like to leave it as is ... On Mon, 6 Sep 2004, John Hansen wrote: >> >> Oleg, is there anything that I can put into <HEAD></HEAD> for >> this? To avoid having to use PHP to do it? >> > > <meta http-equiv="Last-Modified" content="Tue, 01 Jun 2004 09:44:10"> > > ... John > ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
> > k, check out: > > http://archives.postgresql.org/sfpug/2004-09/msg00003.php > > I have the meta tag in place ... please confirm that the > format is okay, as that is what mhonarc is getting from the > message itself ... I can reformat it if I have to using PHP, > but would like to avoid it if at all possible ... basically, > if the default will work, I'd like to leave it as is ... > Last-Modified: Sun, 5 Sep 2004 04:38:32 +0100 (BST) The date format for last-modified is not defined afaik, at least various web servers seem to have different formats, so I'm guessing this would be acceptable. ... John
John, > Maybe I'm missing something... > > Could you define 'slow' for me.... > Any search I do on the archives comes back in less than a second. My apologies; using the Archive search was so painfully slow that I'd stopped using it months ago. I didn't know that you'd speeded it up. -- Josh Berkus Aglio Database Solutions San Francisco
> > My apologies; using the Archive search was so painfully slow > that I'd stopped > using it months ago. I didn't know that you'd speeded it up. > :) ... John