BUG #13818: PostgreSQL crashes after cronjob runs as "postgres" - Mailing list pgsql-bugs
From | ryan@runscope.com |
---|---|
Subject | BUG #13818: PostgreSQL crashes after cronjob runs as "postgres" |
Date | |
Msg-id | 20151214224729.2624.99840@wrigleys.postgresql.org Whole thread Raw |
Responses |
Re: BUG #13818: PostgreSQL crashes after cronjob runs as "postgres"
|
List | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 13818 Logged by: Ryan Park Email address: ryan@runscope.com PostgreSQL version: 9.4.5 Operating system: CentOS 7.2 Description: We are running PostgreSQL 9.4.5 on CentOS 7, uising the RPM from postgresql.org. We perform automatic OS updates, so our server was just upgraded from CentOS 7.1 to CentOS 7.2. Since the upgrade we've been seeing PostgreSQL crash after a cronjob is run as the "postgres" user. More details: we have a cronjob that runs every 3 minutes. This cronjob runs as the OS user "postgres". It makes a very simple database update using PSQL, like this: */3 * * * * /usr/pgsql-9.4/bin/psql -U <username> <dbname> -c 'UPDATE patches SET name = name WHERE patch = 1' Here's the syslog output from /var/log/messages when the cronjob runs: Dec 14 18:24:01 prod118 systemd: Created slice user-1012.slice. Dec 14 18:24:01 prod118 systemd: Starting user-1012.slice. Dec 14 18:24:01 prod118 systemd: Started Session 15015 of user postgres. Dec 14 18:24:01 prod118 systemd: Starting Session 15015 of user postgres. Dec 14 18:24:01 prod118 systemd: Removed slice user-1012.slice. Dec 14 18:24:01 prod118 systemd: Stopping user-1012.slice. Dec 14 18:24:01 prod118 postgres[52611]: [2584-1] PANIC: semop(id=57311261) failed: Invalid argument Dec 14 18:24:01 prod118 postgres[52611]: [2584-2] STATEMENT: COMMIT Dec 14 18:24:01 prod118 postgres[52584]: [2584-1] PANIC: semop(id=57376799) failed: Invalid argument Dec 14 18:24:01 prod118 postgres[52584]: [2584-2] STATEMENT: COMMIT Dec 14 18:24:01 prod118 postgres[52630]: [2584-1] PANIC: semop(id=57311261) failed: Invalid argument Dec 14 18:24:01 prod118 postgres[52630]: [2584-2] STATEMENT: COMMIT Dec 14 18:24:01 prod118 postgres[72998]: [2600-1] LOG: server process (PID 52630) was terminated by signal 6: Aborted Dec 14 18:24:01 prod118 postgres[72998]: [2600-2] DETAIL: Failed process was running: COMMIT Dec 14 18:24:01 prod118 postgres[72998]: [2601-1] LOG: terminating any other active server processes Dec 14 18:24:01 prod118 postgres[52711]: [2585-1] WARNING: terminating connection because of crash of another server process Dec 14 18:24:01 prod118 postgres[52711]: [2585-2] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. Dec 14 18:24:01 prod118 postgres[52711]: [2585-3] HINT: In a moment you should be able to reconnect to the database and repeat your command. Dec 14 18:24:01 prod118 postgres[52715]: [2585-1] WARNING: terminating connection because of crash of another server process Dec 14 18:24:01 prod118 postgres[52714]: [2585-1] WARNING: terminating connection because of crash of another server process Dec 14 18:24:01 prod118 postgres[52715]: [2585-2] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. Dec 14 18:24:01 prod118 postgres[52714]: [2585-2] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. Dec 14 18:24:01 prod118 postgres[52715]: [2585-3] HINT: In a moment you should be able to reconnect to the database and repeat your command. Dec 14 18:24:01 prod118 postgres[52714]: [2585-3] HINT: In a moment you should be able to reconnect to the database and repeat your command. ... and so on. It looks like systemd created a "slice" for the cronjob. A "slice" is a partition of system resources using cgroups, kinda like a Docker container. When the cronjob finished, systemd tore down the slice. My hunch is that when the slice was stopped, it introduced some corruption in shared memory or another shared resource. After the slice was stopped, we immediately saw 3 semop failures in PostgreSQL, which caused the entire Postgres server to crash. The initial postgres process (/usr/pgsql-9.4/bin/postgres) kept running, but all the subprocesses were restarted -- including the checkpointer, writer, etc. This happened over and over, every time the cronjob ran. We tried restarting PostgreSQL and restarting the whole server, but neither of those helped. As I mentioned, our cronjob ran as the "postgres" user. When I switched the cronjob to run from a different user account, everything worked fine and Postgres stopped crashing. I'm guessing that there are fewer shared resources between 2 different users, so nothing gets corrupt when the other user's slice is stopped. Exact sequence of steps from startup: 1. Create a database called "<dbname>" with a "patches" table. This is the schema: CREATE TABLE IF NOT EXISTS patches( patch integer NOT NULL, name character varying(50), successful boolean NOT NULL, applied_by character varying(1000), applied_at timestamp without time zone NOT NULL, CONSTRAINT patches_pkey PRIMARY KEY (patch) ) WITH (OIDS=FALSE); 2. Make sure the user "<username>" has permission to write to that table. 3. Create a cronjob like the one described above. It should be in the "postgres" user's crontab. */3 * * * * /usr/pgsql-9.4/bin/psql -U <username> <dbname> -c 'UPDATE patches SET name = name WHERE patch = 1' What we got: PostgreSQL crashed with the logs listed above. What we expected: PostgreSQL would not crash. PostgreSQL version: Both server and client are version 9.4.5. We are using these RPMs: postgresql94-9.4.5-1PGDG.rhel7.x86_64 postgresql94-contrib-9.4.5-1PGDG.rhel7.x86_64 postgresql94-server-9.4.5-1PGDG.rhel7.x86_64 postgresql94-libs-9.4.5-1PGDG.rhel7.x86_64 Platform information: CentOS 7.2 on x86_64. The issue happened on both Linux kernel 3.10.0-229.20.1.el7.x86_64 and on kernel 3.10.0-327.3.1.el7.x86_64. The version of systemd is 219. This is running on Amazon EC2, on an i2.4xlarge instance. Thank you for your help. Ryan Park Runscope Operations ryan@runscope.com
pgsql-bugs by date: