Fwd: BUG #15182: Canceling authentication due to timeout aka Denialof Service Attack - Mailing list pgsql-hackers
From | Jeremy Schneider |
---|---|
Subject | Fwd: BUG #15182: Canceling authentication due to timeout aka Denialof Service Attack |
Date | |
Msg-id | 9145334b-9583-124c-012e-7d58039ad417@amazon.com Whole thread Raw |
Responses |
Re: BUG #15182: Canceling authentication due to timeout aka Denial ofService Attack
Re: BUG #15182: Canceling authentication due to timeout aka Denial ofService Attack |
List | pgsql-hackers |
I'd like to bump this old bug that Lloyd filed for more discussion. It seems serious enough to me that we should at least talk about it. Anyone with simply the login privilege and the ability to run SQL can instantly block all new incoming connections to a DB including new superuser connections. session 1: select pg_sleep(9999999999) from pg_stat_activity; session 2: vacuum full pg_authid; -or- truncate table pg_authid; (there are likely other SQL you could run in session 2 as well.) -------- Forwarded Message -------- Subject: BUG #15182: Canceling authentication due to timeout aka Denial of Service Attack Date: Mon, 30 Apr 2018 20:41:11 +0000 From: PG Bug reporting form <noreply@postgresql.org> Reply-To: lalbin@scharp.org, pgsql-bugs@lists.postgresql.org To: pgsql-bugs@lists.postgresql.org CC: lalbin@scharp.org The following bug has been logged on the website: Bug reference: 15182 Logged by: Lloyd Albin Email address: lalbin@scharp.org PostgreSQL version: 10.3 Operating system: OpenSUSE Description: Over the last several weeks our developers caused a Denial of Service Attack against ourselves by accident. When looking at the log files, I noticed that we had authentication timeouts during these time periods. In researching the problem I found this is due to locks being held on shared system catalog items, aka system catalog items that are shared between all databases on the same cluster/server. This can be caused by beginning a long running transaction that queries pg_stat_activity, pg_roles, pg_database, etc and then another connection that runs either a REINDEX DATABASE, REINDEX SYSTEM, or VACUUM FULL. This issue is of particular importance to database resellers who use the same cluster/server for multiple clients, as two clients can cause this issue to happen inadvertently or a single client can either cause it to happen maliciously or inadvertently. Note: The large cloud providers give each of their clients their own cluster/server so this will not affect across cloud clients but can affect an individual client. The problem is that traditional hosting companies will have all clients from one or more web servers share the same PostgreSQL cluster/server. This means that one or two clients could inadvertently stop all the other clients from being able to connect to their databases until the first client does either a COMMIT or ROLLBACK of their transaction which they could hold open for hours, which is what happened to us internally. In Connection 1 we need to BEGIN a transaction and then query a shared system item; pg_authid, pg_database, etc; or a view that depends on a shared system item; pg_stat_activity, pg_roles, etc. Our developers were accessing pg_roles. Connection 1 (Any database, Any User) BEGIN; SELECT * FROM pg_stat_activity; Connection 2 (Any database will do as long as you are the database owner) REINDEX DATABASE postgres; Connection 3 (Any Database, Any User) psql -h sqltest-alt -d sandbox All future Connection 3's will hang for however long the transaction in Connection 1 runs. In our case this was hours and denied everybody else the ability to log into the server until Connection 1 was committed. psql will just hang for hours, even overnight in my testing, but our apps would get the "Canceling authentication due to timeout" after 1 minute. Connection 2 can also do any of these commands to also cause the same issue: REINDEX SYSTEM postgres; VACUUM FULL pg_authid; vacuumdb -f -h sqltest-alt -d lloyd -U lalbin Even worse is that the VACUUM FULL pg_authid; can be started by an unprivileged user and it will wait for the AccessShareLock by connection 1 to be released before returning the error that you don't have permission to perform this action, so even an unprivileged user can cause this to happen. The privilege check needs to happen before the waiting for the AccessExclusiveLock happens. This bug report has been simplified and shorted drastically. To read the full information about this issue please see my blog post: http://lloyd.thealbins.com/Canceling%20authentication%20due%20to%20timeout Lloyd Albin Database Administrator Statistical Center for HIV/AIDS Research and Prevention (SCHARP) Fred Hutchinson Cancer Research Center -- Jeremy Schneider Database Engineer Amazon Web Services
pgsql-hackers by date: