Thread: backend pid changing
I have a Windows application which connects to a Postgres (8.3) database residing on our company server. Most of the application's users work from their homes, so the application was developed with a lot of security checks. When a client connects to the database, a random hash is generated and sent to the client; this hash is also saved in a Postgres table along with the user id and the return value of pg_backend_pid(). When the client submits queries, it presents its hash value and the server cross-checks this, and the current value of pg_backend_pid(), against the values that were stored previously. If there is a mismatch, the client is instructed to obtain a new hash and begin again. The information about the mismatch is also recorded for future inspection. By examining the logs, I have observed that the backend pid for a particular client sometimes changes during a session. This seems to happen about a dozen times a day, total. Usually this is not a problem, as the client will get a new hash and keep going. Sometimes, however, this seems to happen in the middle of an operation. This happens when the client has sent a large chunk of data that is to be stored in the database. The client sends its authorization information immediately before sending the data, and also with the data chunk. On rare occasions, the backend pid somehow seems to change during the time it takes for the data to be sent. This causes errors and loss of time for the user. I'm sure there are more details that would be needed to give a complete picture of what is going on, yet this message is pretty long already. I am going to stop here and ask whether anyone can make sense of this. That is, make sense of what I have written, and also of why the backend pid would change during an operation as I have described. Thanks to any who can offer information on this. Lewis
On Wed, 4 Jun 2008, Lewis Kapell wrote: > The client sends its authorization information immediately before > sending the data, and also with the data chunk. Well, I have no idea why the backend pid is changing, but here it looks like you have a classic concurrency problem caused by checking a variable twice. It seems you have client-side error recovery on the initial check, but not on the second check. The solution is to eliminate the first check, and implement proper error recovery on the second check, so that the client can just get a new hash and try again. Matthew -- It's one of those irregular verbs - "I have an independent mind," "You are an eccentric," "He is round the twist." -- Bernard Woolly, Yes Prime Minister
Lewis Kapell <lkapell@setonhome.org> writes: > ... By examining the logs, I have observed that the > backend pid for a particular client sometimes changes during a session. That is just about impossible to believe, unless perhaps you have a connection pooler in the loop somewhere? regards, tom lane
We are not using connection pooling, however the clients are tunneling through SSH. Forgot to mention that in my first message. Does that make any difference to it? Thank you, Lewis Kapell Computer Operations Seton Home Study School Tom Lane wrote: > Lewis Kapell <lkapell@setonhome.org> writes: >> ... By examining the logs, I have observed that the >> backend pid for a particular client sometimes changes during a session. > > That is just about impossible to believe, unless perhaps you have a > connection pooler in the loop somewhere? > > regards, tom lane
Hi, I'm trying to make use of a cluster of 40 nodes that my group has, and I'm curious if anyone has experience with PgPool's parallel query mode. Under what circumstances could I expect the most benefit from query parallelization as implemented by PgPool?
Hi John, It has been a while since I played around with PgPool-II. In the tests that I did, it did help with load balancing. For parallel query, it helped for simple queries, such as when querying a single table. If that is your typical use case, you may benefit. For other queries, it was not as effective. For example: SELECT t1.col1, t1.col2 FROM t1 inner join t2 on t1.col1 = t2.col1 WHERE t2.col3 > 1000 ORDER BY t1.col1 Assume that the optimizer decided to process t2 first. It would apply the where predicate t2.col3 > 1000 in parallel across all the nodes, which is a good thing, and pull in those results. But, for t1, it will query all of the nodes, then pull in all of the rows (just t1.col1 and t1.col2 though) into a single node and perform the join and sort there as well. You are not getting much parallelism on that step, particularly noticeable if it is a large table. So, there is some benefit, but it is limited. Also, again, it has been a while since I ran this. It may have since improved (I apologize if this is inaccurate), and I do like the other features of PgPool and what SRA has done. In contrast, GridSQL would parallelize this better. (Full disclosure: I work on the free and open source GridSQL project.) It would likely process t2 first, like pgpool. However, it would send the intermediate results to the other nodes in the cluster. If it turns out that t1.col1 was also the column on which a distribution hash was based for t1, it would ship those intermediate rows to only those nodes that it needs to for joining. Then, on this second step, all of these joins would happen in parallel, with ORDER BY applied. Back at the coordinator, since an ORDER BY is present, GridSQL would do a merge-sort from the results of the other nodes and return them to the client. I hope that helps. On pgfoundry.org there are forums within the pgpool project where they can probably better answer your questions. If you have any questions about GridSQL, please feel free to post in the forums at enterprisedb.com or email me directly. Regards, Mason > -----Original Message----- > From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance- > owner@postgresql.org] On Behalf Of John Beaver > Sent: Wednesday, June 04, 2008 11:59 AM > To: Pgsql-Performance > Subject: [PERFORM] PgPool parallel query performance rules of thumb > > Hi, > I'm trying to make use of a cluster of 40 nodes that my group has, > and I'm curious if anyone has experience with PgPool's parallel query > mode. Under what circumstances could I expect the most benefit from > query parallelization as implemented by PgPool? > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance