[HACKERS] Possible issue with expanded object infrastructure on Postgres 9.6.1 - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | [HACKERS] Possible issue with expanded object infrastructure on Postgres 9.6.1 |
Date | |
Msg-id | CAM3SWZRmGTKcO3yB1gwuGte9JLogxy53YQUjj5JegYe8=QXivQ@mail.gmail.com Whole thread Raw |
Responses |
Re: [HACKERS] Possible issue with expanded object infrastructure on Postgres 9.6.1
Re: [HACKERS] Possible issue with expanded object infrastructure onPostgres 9.6.1 |
List | pgsql-hackers |
A customer is on 9.6.1, and complains of a segfault observed at least 3 times. In all cases, the logs look like this: Jan 11 16:11:07 ip-10-0-118-82 kernel: [41913.530453] traps: postgres[6561] general protection ip:55fcf08b0491 sp:7ffc17dfa650 error:0 in postgres[55fcf0557000+638000] Jan 11 16:11:07 ip-10-0-118-82 postgresql[7]: [4-1] LOG: server process (PID 643) was terminated by signal 11: Segmentation fault Jan 11 16:11:07 ip-10-0-118-82 postgresql[7]: [4-2] DETAIL: Failed process was running: SELECT COUNT(count_column) FROM (SELECT 1 AS count_column FROM "albums" INNER JOIN "album_photo_assignments" ON "albums"."id" = "album_photo_assignments"."album_id" WHERE "albums"."deleted_at" IS NULL AND "album_photo_assignments"."photo_id" = $1 LIMIT 1000 OFFSET 0) subquery_for_count Jan 11 16:11:07 ip-10-0-118-82 postgresql[7]: [5-1] LOG: terminating any other active server processes Jan 11 16:11:07 ip-10-0-118-82 postgresql[80]: [6-1] WARNING: terminating connection because of crash of another server process Jan 11 16:11:07 ip-10-0-118-82 postgresql[80]: [6-2] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. Jan 11 16:11:07 ip-10-0-118-82 postgresql[80]: [6-3] HINT: In a moment you should be able to reconnect to the database and repeat your command. The "general protection ip:55fcf08b0491 sp:7ffc17dfa650 error:0 in postgres[55fcf0557000+638000]" details are identical in all 3 segfaults observed. Note that the kernel in question is "4.4.0-57-generic #78~14.04.1-Ubuntu". These crashes were days apart, but I have yet to isolate the problem. The customer tells me that this is a representative query plan: explain SELECT COUNT(count_column) FROM (SELECT 1 AS count_column FROM "albums" INNER JOIN "album_photo_assignments" ON "albums"."id" = "album_photo_assignments"."album_id" WHERE "albums"."deleted_at" IS NULL AND "album_photo_assignments"."photo_id" = 'fd0bcbb3-9c4c-4036-954a-cf01935d004c' LIMIT 1000 OFFSET 0) subquery_for_count ; QUERY PLAN ─────────────────────────────────────────────────────────────────────────────────────────────────────────────Aggregate (cost=16419.94..16419.94rows=1 width=8) -> Limit (cost=1000.06..16419.94 rows=1 width=4) -> Nested Loop (cost=1000.06..16419.94rows=1 width=4) -> Gather (cost=1000.00..16415.87 rows=1 width=16) Workers Planned: 1 -> Parallel Seq Scan on album_photo_assignments (cost=0.00..15415.77 rows=1 width=16) Filter: (photo_id = 'fd0bcbb3-9c4c-4036-954a-cf01935d004c'::uuid) -> Index Scan using albums_pkey on albums (cost=0.06..4.06 rows=1 width=16) Index Cond: (id = album_photo_assignments.album_id) Filter: (deleted_at IS NULL) (10 rows) Note that this isn't a prepared statement, unlike the problematic query we see from the logs. And, I've not yet managed to verify that the problem case involved execution of this particular "$1 substitute constant", since that isn't displayed in the logs. So, while this query plan might well be representative of the problem, it may also not be. I can use GDB to get details of the instruction pointer that appeared in the kernel trap error, which shows a function from the expanded value representation infrastructure: (gdb) info symbol 0x55fcf08b0491 EOH_get_flat_size + 1 in section .text of /usr/lib/postgresql/9.6/bin/postgres (gdb) info symbol 0x55fcf08b0490 EOH_get_flat_size in section .text of /usr/lib/postgresql/9.6/bin/postgres (gdb) disassemble 0x55fcf08b0490 Dump of assembler code for function EOH_get_flat_size: 0x000055fcf08b0490 <+0>: push %rbp 0x000055fcf08b0491 <+1>: mov 0x8(%rdi),%rax 0x000055fcf08b0495 <+5>: mov %rsp,%rbp 0x000055fcf08b0498 <+8>: pop %rbp 0x000055fcf08b0499<+9>: mov (%rax),%rax 0x000055fcf08b049c <+12>: jmpq *%rax End of assembler dump. (I believe that Linux outputs VMA-wise addresses in the kernel trap error quoted above, an address from the userland program (Postgres) address space, so GDB does in fact display interesting disassembly -- the documentation on this seems extremely limited.) Note that there is a text array column in the table "albums". I note that there have been fixes in this area already, that are not in 9.6.1 (commit 77cd0dc, for one). I know next to nothing about this code, and would appreciate some guidance. -- Peter Geoghegan
pgsql-hackers by date: