Re: Bug: Buffer cache is not scan resistant - Mailing list pgsql-hackers
From | Luke Lonergan |
---|---|
Subject | Re: Bug: Buffer cache is not scan resistant |
Date | |
Msg-id | C3E62232E3BCF24CBA20D72BFDCB6BF802AF2821@MI8NYCMAIL08.Mi8.com Whole thread Raw |
In response to | Bug: Buffer cache is not scan resistant ("Luke Lonergan" <llonergan@greenplum.com>) |
Responses |
Re: Bug: Buffer cache is not scan resistant
|
List | pgsql-hackers |
<p><font size="2">Hi Tom,<br /><br /> Good info - it's the same in Solaris, the routine is uiomove (Sherry wrote it).<br/><br /><br /> - Luke<br /><br /> Msg is shrt cuz m on ma treo<br /><br /> -----Original Message-----<br /> From: Tom Lane [<a href="mailto:tgl@sss.pgh.pa.us">mailto:tgl@sss.pgh.pa.us</a>]<br /> Sent: Monday, March 05, 2007 07:43PM Eastern Standard Time<br /> To: Mark Kirkwood<br /> Cc: Pavan Deolasee; Gavin Sherry; Luke Lonergan; PGSQLHackers; Doug Rady; Sherry Moore<br /> Subject: Re: [HACKERS] Bug: Buffer cache is not scan resistant<br /><br/> Mark Kirkwood <markir@paradise.net.nz> writes:<br /> > Tom Lane wrote:<br /> >> But what I wantedto see was the curve of<br /> >> elapsed time vs shared_buffers?<br /> > ...<br /> > Looks *very* similar.<br/><br /> Yup, thanks for checking.<br /><br /> I've been poking into this myself. I find that I can reproducethe<br /> behavior to some extent even with a slow disk drive (this machine is a<br /> dual 2.8GHz Xeon EM64T runningFedora Core 5; the dd-to-dev-null test<br /> shows the disk read speed as 43MB/sec or so). Test case is a<br /> several-gigtable, no indexes, fully vacuumed so that neither VACUUM nor<br /> COUNT(*) have to do anything but seqscan asfast as they can. Given a<br /> *freshly started* postmaster, I see<br /><br /> regression=# show shared_buffers;<br /> shared_buffers<br /> ----------------<br /> 128MB<br /> (1 row)<br /><br /> regression=# \timing<br /> Timing is on.<br/> regression=# vacuum lineitem;<br /> VACUUM<br /> Time: 63652.333 ms<br /> regression=# vacuum lineitem;<br /> VACUUM<br/> Time: 63562.303 ms<br /> regression=# select count(*) from lineitem;<br /> count<br /> ----------<br /> 10240000<br/> (1 row)<br /><br /> Time: 63142.174 ms<br /> regression=# vacuum lineitem;<br /> VACUUM<br /> Time: 61638.421ms<br /> regression=# vacuum lineitem;<br /> VACUUM<br /> Time: 61785.905 ms<br /><br /> I didn't show it here,but you can repeat the VACUUM all you want before<br /> the SELECT, and its times are stable; and you can repeat allyou want<br /> after the SELECT, and the times are stable but a couple seconds lower.<br /> Restart the postmaster andit goes back to the slower behavior. (I'm<br /> keeping autovac off so it doesn't change the results.)<br /><br /> Idecided to get down and dirty with oprofile, and soon found that the<br /> user-space CPU consumption is indistinguishablein both states:<br /><br /> CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)<br /> CountedGLOBAL_POWER_EVENTS events (time during which processor is not stopped)<br /> with a unit mask of 0x01 (mandatory)count 240000<br /> GLOBAL_POWER_E...|<br /> samples| %|<br /> ------------------<br /> 141065 73.8193/usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux<br /> 26368 13.7984 /home/tgl/testversion/bin/postgres<br/> 12765 6.6799 /libata<br /> 2238 1.1711 /lib64/libc-2.4.so<br /> 1112 0.5819 /dm_mod<br /><br /> CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)<br /> Counted GLOBAL_POWER_EVENTSevents (time during which processor is not stopped)<br /> with a unit mask of 0x01 (mandatory) count 240000<br/> GLOBAL_POWER_E...|<br /> samples| %|<br /> ------------------<br /> 113177 70.2169 /usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux<br/> 26284 16.3070 /home/tgl/testversion/bin/postgres<br /> 12004 7.4475 /libata<br /> 2093 1.2985 /lib64/libc-2.4.so<br /> 996 0.6179 /dm_mod<br /><br /> Inside thekernel, there's only one routine that's significantly different:<br /><br /> CPU: P4 / Xeon with 2 hyper-threads, speed2793.08 MHz (estimated)<br /> Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)<br />with a unit mask of 0x01 (mandatory) count 240000<br /> samples % symbol name<br /> 57779 40.9591 copy_user_generic<br/> 18175 12.8841 __delay<br /> 3994 2.8313 _raw_spin_lock<br /> 2388 1.6928 put_page<br/> 2184 1.5482 mwait_idle<br /> 2083 1.4766 _raw_write_unlock<br /> 1909 1.3533 _raw_write_lock<br/><br /> CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)<br /> Counted GLOBAL_POWER_EVENTSevents (time during which processor is not stopped)<br /> with a unit mask of 0x01 (mandatory) count 240000<br/> samples % symbol name<br /> 37437 33.0783 copy_user_generic<br /> 17891 15.8080 __delay<br />3372 2.9794 _raw_spin_lock<br /> 2218 1.9598 mwait_idle<br /> 2067 1.8263 _raw_write_unlock<br /> 1837 1.6231 _raw_write_lock<br /> 1531 1.3527 put_page<br /><br /> So that's part of the mystery: apparently copy_user_genericis coded in<br /> such a way that it's faster to copy into memory that's already in<br /> processor cache. This strikes me as something that probably<br /> could/should be fixed in the kernel; I don't see any good reasonwhy<br /> overwriting a whole cache line oughtn't be the same speed either way.<br /><br /> The other thing that wasbothering me is why does the SELECT change<br /> VACUUM's behavior? A debugging printout added to ReadBuffer gave the<br/> answer: after postmaster start, we see things like<br /><br /> read block 353094 into buffer 11386<br /> read block353095 into buffer 11387<br /> read block 353096 into buffer 11388<br /> read block 353097 into buffer 11389<br /> readblock 353098 into buffer 11390<br /> read block 353099 into buffer 11391<br /> read block 353100 into buffer 11392<br/> read block 353101 into buffer 11393<br /> read block 353102 into buffer 11394<br /> read block 353103 into buffer11395<br /><br /> and after the SELECT it behaves like<br /><br /> read block 336761 into buffer 9403<br /> read block336762 into buffer 9402<br /> read block 336763 into buffer 9403<br /> read block 336764 into buffer 9402<br /> readblock 336765 into buffer 9403<br /> read block 336766 into buffer 9402<br /> read block 336767 into buffer 9403<br />read block 336768 into buffer 9402<br /> read block 336769 into buffer 9403<br /> read block 336770 into buffer 9402<br/><br /> What's going on is that VACUUM puts each buffer it's finished with on<br /> the tail of the freelist. Inthe post-SELECT state, there are just two<br /> buffers cycling through the freelist (not sure why not only one, but it<br/> doesn't matter) and so the cache footprint is limited. But immediately<br /> after postmaster start, (nearly) allthe buffers are in the freelist and<br /> so they all cycle through VACUUM's usage. In any real-world situation,<br />of course, the freelist is going to be nearly empty most of the time and<br /> so I don't think this part is worth changing.<br/><br /> So I now concede Luke's argument that this behavior is related to L2<br /> cache usage. But the nextquestion is whether we ought to change<br /> regular seqscan's behavior to mimic VACUUM. I'm very worried about<br />pessimizing other cases if we do. ISTM there's a fairly clear case that<br /> this might be fixable at the kernel level. Moreover, the issue only<br /> arises because of the way that the copy-from-kernel-buffer-to-userspace<br /> routinebehaves, which means that if we go to a regime where we stop<br /> relying on OS caching and ask for direct DMA intoour buffers, the<br /> advantage would disappear anyway. Lastly, I think the case where a win<br /> is possible is fairlynarrow --- as soon as you've got anything but the<br /> one seqscan going on, it's not going to help.<br /><br /> regards, tom lane<br /><br /></font>
pgsql-hackers by date: