Connection Pooling, a year later - Mailing list pgsql-hackers
From | Michael Owens |
---|---|
Subject | Connection Pooling, a year later |
Date | |
Msg-id | 200112180028.fBI0Sum06915@postgresql.org Whole thread Raw |
Responses |
Re: Connection Pooling, a year later
Re: Connection Pooling, a year later |
List | pgsql-hackers |
I was browsing through the TODO list and stumbled upon the connection pooling item discussed over a year ago. Is it safe to assume that since it's still listed in the TODO that it's still a desirable feature? Although I don't know claim to know jack about the postgresql source (although I am trying to learn my way around) it does seem like the initial idea proposed by Alfred Perlstein could offer considerable improvements with respect to postgresql's ability to handle larger numbers of connections. Passing file descriptors between processes aside, is it feasible to at least get the backend to where it could theoretically handle this --- multiple, simultaneous clients. Even if the descriptor passing were not the best (most portable) way to get information from the client - postmaster to the backend, there still might be another way by which data from the client socket can be transmitted to a backend, thereby allowing the data from multiple clients to be sent to a single backend. For instance, the postmaster could accept() and also multiplex (select()/poll(), or use kqueue on BSD) across all file descriptors, sending and receiving data to/from the clients. After calling select(), It assigns/maps the client socket descriptor to one of the backends, which it is connected to via domain socket. All subsequent data coming from that client descriptor will be passed directly to that backend via the domain socket. Information sent from the backend to the postmaster through that socket includes the descriptor number of the client to whom that information is intended. The postmaster would multiplex across descriptors to clients and backends alike, effectively connecting the M clients to N backends, without having to pass descriptors. I doubt if what I am going to say in this paragraph is new to any of you, so don't thing I'm being preachy, or that I think this is a novel idea. But allowing multiple backends to process multiple clients, if only considered from a mathematical standpoint, seems like a way to increase (perhaps even dramatically) the maximum number of connections postgresql can service while maintaining a fixed number of backends, and therefore a known performance level. The central issue is the fact that the database server reaches marginal returns with the number of active backend processes, beyond which the overhead from shared resource contention, context switching, etc., degrades performance. This ideal number, varying from machine to machine, can of course be controlled by setting the max number of connections. However, once reached, other clients are then locked out. When that number is reached, the real question becomes whether or not the server is really being used to its full potential. And most likely it is not --- because there is still a significant amount of idle time in each backend. And while the maximum number of connections have been reached, the overall utilization is less than what the machine can be performing, all the while locking other clients out. Then the pendulum can swing the other way: you set the max number of connections well beyond the ideal number (creating a buffer of sorts) to allow for this scenario, so those blocked clients are let in. The problem is that this also opens another worse-case scenario: what if all of the maximum connections are active, sending the number of active backends well beyond the ideal limit? By having the postmaster map multiple clients to a fixed number of backends, you achieve the happy medium: You never exceed the ideal number of active backends, and at the same time you are not limited to only accepting a fixed number of connections. Accepting connections can now be based on load (however you wish to define it), not number. You now make decisions based on utlization. If it were shown that even half of a backend's life consisted of idle time, leasing out that idle time to another active connection would potentially double the average number of simultaneous requests without (theoretically) incurring any significant degradation in performance. I have worked on code the uses this model, and would be glad to adapt/contribute it to postgresql. Currently it does pass files descriptors from a listening process to a queue process using send/recvmsg(). I have no trouble with it on Linux and BSD, but I don't pretend to know anything about portability. However, even it this is an issue, I would be willing to adapt the model mentioned above. Currently, the design I have is to pass in complete transactions, so that the N to M mapping can be achieved as an application server. The downside is that state cannot be maintained within the database (backend) as they are shared between clients on the application server. Every request the server makes to postgresql must be a standalone transaction. So I guess my question is whether or not there is still interest in this, and whether there are still great difficulties in making the necessary changes to the backend so that it could handle multiple clients. If it does seem possible, and there is interest, then I would be willing to take a stab at it with the code I have developed.
pgsql-hackers by date: