Thread: backend pid changing

backend pid changing

From

Lewis Kapell

Date:

04 June 2008, 11:53:41

I have a Windows application which connects to a Postgres (8.3) database
residing on our company server.  Most of the application's users work
from their homes, so the application was developed with a lot of
security checks.

When a client connects to the database, a random hash is generated and
sent to the client; this hash is also saved in a Postgres table along
with the user id and the return value of pg_backend_pid().  When the
client submits queries, it presents its hash value and the server
cross-checks this, and the current value of pg_backend_pid(), against
the values that were stored previously.

If there is a mismatch, the client is instructed to obtain a new hash
and begin again.  The information about the mismatch is also recorded
for future inspection.  By examining the logs, I have observed that the
backend pid for a particular client sometimes changes during a session.
  This seems to happen about a dozen times a day, total.  Usually this
is not a problem, as the client will get a new hash and keep going.

Sometimes, however, this seems to happen in the middle of an operation.
  This happens when the client has sent a large chunk of data that is to
be stored in the database.  The client sends its authorization
information immediately before sending the data, and also with the data
chunk.  On rare occasions, the backend pid somehow seems to change
during the time it takes for the data to be sent.  This causes errors
and loss of time for the user.

I'm sure there are more details that would be needed to give a complete
picture of what is going on, yet this message is pretty long already.  I
am going to stop here and ask whether anyone can make sense of this.
That is, make sense of what I have written, and also of why the backend
pid would change during an operation as I have described.  Thanks to any
who can offer information on this.

Lewis

Re: backend pid changing

From

Matthew Wakeling

Date:

04 June 2008, 12:07:01

On Wed, 4 Jun 2008, Lewis Kapell wrote:
> The client sends its authorization information immediately before
> sending the data, and also with the data chunk.

Well, I have no idea why the backend pid is changing, but here it looks
like you have a classic concurrency problem caused by checking a variable
twice. It seems you have client-side error recovery on the initial check,
but not on the second check. The solution is to eliminate the first check,
and implement proper error recovery on the second check, so that the
client can just get a new hash and try again.

Matthew

--
It's one of those irregular verbs - "I have an independent mind," "You are
an eccentric," "He is round the twist."
                                      -- Bernard Woolly, Yes Prime Minister

Re: backend pid changing

From

Tom Lane

Date:

04 June 2008, 12:10:15

Lewis Kapell <lkapell@setonhome.org> writes:
> ... By examining the logs, I have observed that the
> backend pid for a particular client sometimes changes during a session.

That is just about impossible to believe, unless perhaps you have a
connection pooler in the loop somewhere?

            regards, tom lane

Re: backend pid changing

From

Lewis Kapell

Date:

04 June 2008, 12:18:45

We are not using connection pooling, however the clients are tunneling
through SSH.  Forgot to mention that in my first message.  Does that
make any difference to it?

Thank you,

Lewis Kapell
Computer Operations
Seton Home Study School

Tom Lane wrote:
> Lewis Kapell <lkapell@setonhome.org> writes:
>> ... By examining the logs, I have observed that the
>> backend pid for a particular client sometimes changes during a session.
>
> That is just about impossible to believe, unless perhaps you have a
> connection pooler in the loop somewhere?
>
>             regards, tom lane

PgPool parallel query performance rules of thumb

From

John Beaver

Date:

04 June 2008, 12:58:55

Hi,
    I'm trying to make use of a cluster of 40 nodes that my group has,
and I'm curious if anyone has experience with PgPool's parallel query
mode. Under what circumstances could I expect the most benefit from
query parallelization as implemented by PgPool?

Re: PgPool parallel query performance rules of thumb

From

"Mason Sharp"

Date:

06 June 2008, 12:01:26

Hi John,

It has been a while since I played around with PgPool-II. In the tests
that I did, it did help with load balancing. For parallel query, it
helped for simple queries, such as when querying a single table. If that
is your typical use case, you may benefit. For other queries, it was not
as effective. For example:

SELECT t1.col1, t1.col2
FROM t1 inner join t2 on t1.col1 = t2.col1
WHERE t2.col3 > 1000
ORDER BY t1.col1

Assume that the optimizer decided to process t2 first. It would apply
the where predicate t2.col3 > 1000 in parallel across all the nodes,
which is a good thing, and pull in those results. But, for t1, it will
query all of the nodes, then pull in all of the rows (just t1.col1 and
t1.col2 though) into a single node and perform the join and sort there
as well. You are not getting much parallelism on that step, particularly
noticeable if it is a large table.

So, there is some benefit, but it is limited. Also, again, it has been a
while since I ran this. It may have since improved (I apologize if this
is inaccurate), and I do like the other features of PgPool and what SRA
has done.

In contrast, GridSQL would parallelize this better. (Full disclosure: I
work on the free and open source GridSQL project.) It would likely
process t2 first, like pgpool. However, it would send the intermediate
results to the other nodes in the cluster. If it turns out that t1.col1
was also the column on which a distribution hash was based for t1, it
would ship those intermediate rows to only those nodes that it needs to
for joining. Then, on this second step, all of these joins would happen
in parallel, with ORDER BY applied. Back at the coordinator, since an
ORDER BY is present, GridSQL would do a merge-sort from the results of
the other nodes and return them to the client.

I hope that helps. On pgfoundry.org there are forums within the pgpool
project where they can probably better answer your questions. If you
have any questions about GridSQL, please feel free to post in the forums
at enterprisedb.com or email me directly.

Regards,

Mason


> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-
> owner@postgresql.org] On Behalf Of John Beaver
> Sent: Wednesday, June 04, 2008 11:59 AM
> To: Pgsql-Performance
> Subject: [PERFORM] PgPool parallel query performance rules of thumb
>
> Hi,
>     I'm trying to make use of a cluster of 40 nodes that my group has,
> and I'm curious if anyone has experience with PgPool's parallel query
> mode. Under what circumstances could I expect the most benefit from
> query parallelization as implemented by PgPool?
>
> --
> Sent via pgsql-performance mailing list
(pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance