Re: [HACKERS] Proposal : For Auto-Prewarm. - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: [HACKERS] Proposal : For Auto-Prewarm. |
Date | |
Msg-id | CA+TgmoYNF_wfdwQ3z3713zKy2j0Z9C32WJdtKjvRWzeY7JOL4g@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Proposal : For Auto-Prewarm. (Mithun Cy <mithun.cy@enterprisedb.com>) |
Responses |
Re: [HACKERS] Proposal : For Auto-Prewarm.
Re: [HACKERS] Proposal : For Auto-Prewarm. Re: [HACKERS] Proposal : For Auto-Prewarm. |
List | pgsql-hackers |
On Wed, Feb 22, 2017 at 11:49 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: > Hi all thanks, > I have tried to fix all of the comments given above with some more > code cleanups. While reading this patch tonight, I realized a serious problem with the entire approach, which is that this patch is supposing that we can read relation blocks for every database from a single worker that's not connected to any database. I realize that I suggested that approach, but now I think it's broken, because the patch isn't taking any locks on the relations whose pages it is reading, and that is definitely going to break things. While autoprewarm is busy sucking blocks into the shared buffer cache, somebody could be, for example, dropping one of those relations. DropRelFileNodesAllBuffers and friends expect that nobody is going to be concurrently reading blocks back into the buffer cache because they hold AccessExclusiveLock, and they assume that anybody else who is touching it will hold at least AccessShareLock. But this violates that assumption, and probably some others. This is not easy to fix. The lock has to be taken based on the relation OID, not the relfilenode, but we don't have the relation OID in the dump file, and recording it there won't help, because the relfilenode can change under us if the relation is rewritten with CLUSTER or VACUUM FULL or relevant forms of ALTER TABLE. I don't see a solution other than launching a separate worker for each database, which seems like it could be extremely expensive if there are many databases. Also, I am pretty sure it's no good to take locks before recovery reaches a consistent state. I'm not sure off-hand whether crash-recovery will notice conflicting locks, but even if it does, blocking crash recovery in order to prewarm stuff is bad news. Here are some other review comments (which may not matter unless we can think up a solution to the problems above). - I think auto_pg_prewarm.c is an unnecessarily long name. How about autoprewarm.c? - It looks like you ran pgindent over this without adding the new typedefs to your typedefs.list, so things like the last line of each typedef is formatted incorrectly. - ReadBufferForPrewarm isn't a good name for this interface. You need to give it a generic name (and header comment) that describes what it does, rather than why you added it. Others might want to also use this interface. Actually, an even better idea might be to adjust ReadBufferWithoutRelcache() to serve your need here. That function's header comment seems to contemplate that somebody might want to add a relpersistence argument someday; perhaps that day has arrived. - have_free_buffer's comment shouldn't mention autoprewarm, but it should mention that this is a lockless test, so the results might be slightly stale. See similar comments in various other backend functions for an example of how to write this. - next_task could be static, and with such a generic name, it really MUST be static to avoid namespace conflicts. - load_block() has a race condition. The relation could be dropped after you check smgrexists() and before you access the relation. Also, the fork number validity check looks useless; it should never fail. - I suggest renaming the file that stores the blocks to autoprewarm.blocks or something like that. Calling it just "autopgprewarm" doesn't seem very clear. - I don't see any reason for the dump file to contain a header record with an expected record count. When rereading the file, you can just read until EOF; there's no real need to know the record count before you start. - You should test for multiple flags like this: if ((buf_state & (BM_VALID|VM_TAG_VALID|BM_PERSISTENT)) != 0). However, I also think the test is wrong. Even if the buffer isn't BM_VALID, that's not really a reason not to include it in the dump file. Same with BM_PERSISTENT. I think the reason for the latter restriction is that you're always calling load_block() with RELPERSISTENCE_PERMANENT, but that's not a good idea either. If the relation were made unlogged after you wrote the dump file, then on reload it you'd incorrectly set BM_PERMANENT on the reloaded blocks. - elog() should not be used for user-facing messages, but rather only for internal messages that we don't expect to get generated. Also, the messages you've picked don't conform to the project's message style guidelines. - The error handling loop around load_block() suggests that you're expecting some reads to fail, which I guess is because you could be trying to read blocks from a relation that's been rewritten under a different relfilenode, or partially or entirely truncated. But I don't think it's very reasonable to just let ReadBufferWhatever() spew error messages into the log and hope users don't panic. People will expect an automatic prewarm solution to handle any such cases quietly, not bleat loudly in the log. I suspect that this error-trapping code isn't entirely correct, but there's not much point in fixing it; what we really need to do is get rid of it (somehow). - dump_block_info_periodically() will sleep for too long - perhaps forever - if WaitLatch() repeatedly returns due to WL_LATCH_SET, which can probably happen if for any reason the process receives SIGUSR1 repeatedly. Every time the latch gets set, the timeout is reset, so it may never expire. There are examples of how to write a loop like this correctly in various places in the server; please check one of those. - I don't think you should need an error-catching loop in auto_pgprewarm_main(), either. Just let the worker die if there's an ERROR, and set the restart interval to something other than BGW_NEVER_RESTART. - Setting bgw_main isn't correct for extension code. Please read the documentation on background workers, which explains how to do this correctly in an extension. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: