Thread: BUG #17704: Monitoring Queries trigger 'Signal 7: Bus Error' After Creating Hash Partitioned Table
BUG #17704: Monitoring Queries trigger 'Signal 7: Bus Error' After Creating Hash Partitioned Table
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 17704 Logged by: Garrett Everding Email address: garrett@rave.io PostgreSQL version: 14.6 Operating system: Ubuntu 22.04 Description: Hi, I'm not sure if this is the right place to post this and if this _isn't_ a bug and is instead a misconfiguration I can repost somewhere else. Today I rolled out a new hash partitioned table with 1024 partitions to our production DB (PostgreSQL 14.6 (Ubuntu 14.6-1.pgdg22.04+1) on aarch64-unknown-linux-gnu, compiled by gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit). Previously we had been using pganalyze2 and pgmetrics with no issue. Starting with the deployment of the new partitioned table anytime pgmetrics or pganalyze2 would run we would get a SIGBUS and postgresql would crash. ``` 2022-12-01 17:48:29 UTC [915447]: user=,db=,app=,client= LOG: server process (PID 1341581) was terminated by signal 7: Bus error 2022-12-01 17:48:29 UTC [915447]: user=,db=,app=,client= DETAIL: Failed process was running: SET lock_timeout TO '100ms';SET statement_timeout TO '5s'; select (extract(epoch from now()) * 1e9)::int8 as epoch_ns, quote_ident(table_schema)||'.'||quote_ident(table_name) as tag_table, md5((array_agg((c.*)::text order by ordinal_position))::text) from ( SELECT current_database()::information_schema.sql_identifier AS table_catalog, nc.nspname::information_schema.sql_identifier AS table_schema, c.relname::information_schema.sql_identifier AS table_name, a.attname::information_schema.sql_identifier AS column_name, a.attnum::information_schema.cardinal_number AS ordinal_position, pg_get_expr(ad.adbin, ad.adrelid)::information_schema.character_data AS column_default, CASE WHEN a.attnotnull OR t.typtype = 'd'::"char" AND t.typnotnull THEN 'NO'::text ELSE 'YES'::text END::information_schema.yes_or_no AS is_nullable, CASE WHEN t.typtype = 'd'::"char" THEN CASE WHEN bt.typelem <> 0::oid AND bt.typlen 2022-12-01 17:48:29 UTC [915447]: user=,db=,app=,client= LOG: terminating any other active server processes ``` 2022-12-01 18:21:43 UTC [915447]: user=,db=,app=,client= LOG: server process (PID 1347091) was terminated by signal 7: Bus error 2022-12-01 18:21:43 UTC [915447]: user=,db=,app=,client= DETAIL: Failed process was running: SELECT current_database() AS db, schemaname, tablename, reltuples::bigint AS tups, relpages::bigint AS pages, otta, ROUND(CASE WHEN otta=0 OR sml.relpages=0 OR sml.relpages=otta THEN 0.0 ELSE sml.relpages/otta::numeric END,1) AS tbloat, CASE WHEN relpages < otta THEN 0 ELSE relpages::bigint - otta END AS wastedpages, CASE WHEN relpages < otta THEN 0 ELSE bs*(sml.relpages-otta)::bigint END AS wastedbytes, CASE WHEN relpages < otta THEN '0 bytes'::text ELSE (bs*(relpages-otta))::bigint::text || ' bytes' END AS wastedsize, iname, ituples::bigint AS itups, ipages::bigint AS ipages, iotta, ROUND(CASE WHEN iotta=0 OR ipages=0 OR ipages=iotta THEN 0.0 ELSE ipages/iotta::numeric END,1) AS ibloat, CASE WHEN ipages < iotta THEN 0 ELSE ipages::bigint - iotta END AS wastedipages, CASE WHEN ipages < iotta THEN 0 ELSE (bs*(ipages-iotta))::bigint END AS wastedibytes, CASE WHEN ipages < iotta THEN '0 bytes' ELSE (bs*(ipages-iotta))::bigint::text || ' bytes' END AS wastedisize, CASE WHEN relpages < ott 2022-12-01 18:21:43 UTC [915447]: user=,db=,app=,client= LOG: terminating any other active server processes ``` On another connection open connection during one of the crashes we saw this error: ``` insert into <table> select * from <other table>; WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. SSL SYSCALL error: EOF detected The connection to the server was lost. Attempting reset: Failed. ``` This is a very active db serving 30k-50k TPS with 7TB of data. I haven't been able to repro this on our staging environment.
Re: BUG #17704: Monitoring Queries trigger 'Signal 7: Bus Error' After Creating Hash Partitioned Table
From
Andres Freund
Date:
Hi, On December 1, 2022 1:23:06 PM PST, PG Bug reporting form <noreply@postgresql.org> wrote: >The following bug has been logged on the website: > >Bug reference: 17704 >Logged by: Garrett Everding >Email address: garrett@rave.io >PostgreSQL version: 14.6 >Operating system: Ubuntu 22.04 >Description: > >Hi, > >I'm not sure if this is the right place to post this and if this _isn't_ a >bug and is instead a misconfiguration I can repost somewhere else. Misconfiguration shouldn't trigger this. It's a bug somewhere, but not yet obvious where (could be one of your extensions). >Today I rolled out a new hash partitioned table with 1024 partitions to our >production DB (PostgreSQL 14.6 (Ubuntu 14.6-1.pgdg22.04+1) on >aarch64-unknown-linux-gnu, compiled by gcc (Ubuntu 11.3.0-1ubuntu1~22.04) >11.3.0, 64-bit). Previously we had been using pganalyze2 and pgmetrics with >no issue. Starting with the deployment of the new partitioned table anytime >pgmetrics or pganalyze2 would run we would get a SIGBUS and postgresql would >crash. > >``` >2022-12-01 17:48:29 UTC [915447]: user=,db=,app=,client= LOG: server >process (PID 1341581) was terminated by signal 7: Bus error >2022-12-01 17:48:29 UTC [915447]: user=,db=,app=,client= DETAIL: Failed >process was running: SET lock_timeout TO '100ms';SET statement_timeout TO >'5s'; > select > (extract(epoch from now()) * 1e9)::int8 as epoch_ns, > quote_ident(table_schema)||'.'||quote_ident(table_name) as >tag_table, > md5((array_agg((c.*)::text order by ordinal_position))::text) > from ( > SELECT current_database()::information_schema.sql_identifier AS >table_catalog, > nc.nspname::information_schema.sql_identifier AS table_schema, > c.relname::information_schema.sql_identifier AS table_name, > a.attname::information_schema.sql_identifier AS column_name, > a.attnum::information_schema.cardinal_number AS >ordinal_position, > pg_get_expr(ad.adbin, >ad.adrelid)::information_schema.character_data AS column_default, > CASE > WHEN a.attnotnull OR t.typtype = 'd'::"char" AND >t.typnotnull THEN 'NO'::text > ELSE 'YES'::text > END::information_schema.yes_or_no AS is_nullable, > CASE > WHEN t.typtype = 'd'::"char" THEN > CASE > WHEN bt.typelem <> 0::oid AND bt.typlen >2022-12-01 17:48:29 UTC [915447]: user=,db=,app=,client= LOG: terminating >any other active server processes >``` >2022-12-01 18:21:43 UTC [915447]: user=,db=,app=,client= LOG: server >process (PID 1347091) was terminated by signal 7: Bus error >2022-12-01 18:21:43 UTC [915447]: user=,db=,app=,client= DETAIL: Failed >process was running: > SELECT > current_database() AS db, schemaname, tablename, reltuples::bigint >AS tups, relpages::bigint AS pages, otta, > ROUND(CASE WHEN otta=0 OR sml.relpages=0 OR sml.relpages=otta THEN >0.0 ELSE sml.relpages/otta::numeric END,1) AS tbloat, > CASE WHEN relpages < otta THEN 0 ELSE relpages::bigint - otta END >AS wastedpages, > CASE WHEN relpages < otta THEN 0 ELSE >bs*(sml.relpages-otta)::bigint END AS wastedbytes, > CASE WHEN relpages < otta THEN '0 bytes'::text ELSE >(bs*(relpages-otta))::bigint::text || ' bytes' END AS wastedsize, > iname, ituples::bigint AS itups, ipages::bigint AS ipages, >iotta, > ROUND(CASE WHEN iotta=0 OR ipages=0 OR ipages=iotta THEN 0.0 ELSE >ipages/iotta::numeric END,1) AS ibloat, > CASE WHEN ipages < iotta THEN 0 ELSE ipages::bigint - iotta END AS >wastedipages, > CASE WHEN ipages < iotta THEN 0 ELSE (bs*(ipages-iotta))::bigint >END AS wastedibytes, > CASE WHEN ipages < iotta THEN '0 bytes' ELSE >(bs*(ipages-iotta))::bigint::text || ' bytes' END AS wastedisize, > CASE WHEN relpages < ott >2022-12-01 18:21:43 UTC [915447]: user=,db=,app=,client= LOG: terminating >any other active server processes >``` Unfortunately that's not pointing anywhere obvious. We'll need a backtrace for the crash at least. I'm on my phone rn, so no link. But there's an article on the wiki with information about collecting a backtrace. Regards, Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: Re: BUG #17704: Monitoring Queries trigger 'Signal 7: Bus Error' After Creating Hash Partitioned Table
From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes: > On December 1, 2022 1:23:06 PM PST, PG Bug reporting form <noreply@postgresql.org> wrote: >> Today I rolled out a new hash partitioned table with 1024 partitions to our >> production DB (PostgreSQL 14.6 (Ubuntu 14.6-1.pgdg22.04+1) on >> aarch64-unknown-linux-gnu, compiled by gcc (Ubuntu 11.3.0-1ubuntu1~22.04) >> 11.3.0, 64-bit). Previously we had been using pganalyze2 and pgmetrics with >> no issue. Starting with the deployment of the new partitioned table anytime >> pgmetrics or pganalyze2 would run we would get a SIGBUS and postgresql would >> crash. > Unfortunately that's not pointing anywhere obvious. We'll need a backtrace for the crash at least. Indeed. > I'm on my phone rn, so no link. But there's an article on the wiki with information about collecting a backtrace. https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend regards, tom lane