Thread: postgresql v11.1 Segmentation fault: signal 11: by running SELECT...JIT Issue?
Hello
I just migrated our databases from PostgreSQL version 9.6 to version 11.1. We got a segmentation fault while running this query:
SELECT f_2110 as x FROM baseline_denull
ORDER BY eid ASC
limit 500
OFFSET 131000;
It works in version 11,1 if offset + limit < 131000 approx (it is some number around it).
It works too if I disable jit (it was enabled). So this works:
set jit = 0;
SELECT f_2110 as x FROM baseline_denull
ORDER BY eid ASC
limit 500
OFFSET 131000;
It works all the time in version 9.6
The workaround seems to disable JIT. Is this a configuration problem or a bug?
We are using a compiled version of Postgres because we have tables (like this one) with thousands of columns.
This server was compiled as follows:
In Ubuntu 16.04:
sudo apt update
sudo apt install --yes libcrypto++-utils libssl-dev libcrypto++-dev libsystemd-dev libpthread-stubs0-dev libpthread-workqueue-dev
sudo apt install --yes docbook-xml docbook-xsl fop libxml2-utils xsltproc
sudo apt install --yes gcc zlib1g-dev libreadline6-dev make
sudo apt install --yes llvm-6.0 clang-6.0
sudo apt install --yes build-essential
sudo apt install --yes opensp
sudo locale-gen en_US.UTF-8
sudo apt install --yes libcrypto++-utils libssl-dev libcrypto++-dev libsystemd-dev libpthread-stubs0-dev libpthread-workqueue-dev
sudo apt install --yes docbook-xml docbook-xsl fop libxml2-utils xsltproc
sudo apt install --yes gcc zlib1g-dev libreadline6-dev make
sudo apt install --yes llvm-6.0 clang-6.0
sudo apt install --yes build-essential
sudo apt install --yes opensp
sudo locale-gen en_US.UTF-8
Download the source code:
mkdir -p ~/soft
cd ~/soft
wget https://ftp.postgresql.org/pub/source/v11.1/postgresql-11.1.tar.gz
tar xvzf postgresql-11.1.tar.gz
cd postgresql-11.1/
cd ~/soft
wget https://ftp.postgresql.org/pub/source/v11.1/postgresql-11.1.tar.gz
tar xvzf postgresql-11.1.tar.gz
cd postgresql-11.1/
./configure --prefix=$HOME/soft/postgresql/postgresql-11 --with-extra-version=ps.2.0 --with-llvm --with-openssl --with-systemd --with-blocksize=32 --with-wal-blocksize=32 --with-system-tzdata=/usr/share/zoneinfo
make world
make check # 11 tests fail. I assumed it is because the planner behaves differently because the change of blocksize.
make check # 11 tests fail. I assumed it is because the planner behaves differently because the change of blocksize.
make install-world
$HOME/soft/postgresql/postgresql-11/bin/initdb -D $HOME/soft/postgresql/postgresql-11/data/
Changes in ./data/postgresql.conf:
listen_addresses = '*'
max_connections = 300
work_mem = 32MB
maintenance_work_mem = 256MB
shared_buffers = 1024MB
log_timezone = 'US/Pacific'
log_destination = 'csvlog'
logging_collector = on
log_filename = 'postgresql-%Y-%m-%d.log'
log_rotation_size = 0
log_min_duration_statement = 1000
debug_print_parse = off
debug_print_rewritten = off
debug_print_plan = off
log_temp_files = 100000000
max_connections = 300
work_mem = 32MB
maintenance_work_mem = 256MB
shared_buffers = 1024MB
log_timezone = 'US/Pacific'
log_destination = 'csvlog'
logging_collector = on
log_filename = 'postgresql-%Y-%m-%d.log'
log_rotation_size = 0
log_min_duration_statement = 1000
debug_print_parse = off
debug_print_rewritten = off
debug_print_plan = off
log_temp_files = 100000000
jit = on # As a workaround, I turned off... but I want it on.
The database is created as:
CREATE DATABASE xxx
WITH
OWNER = user
ENCODING = 'UTF8'
LC_COLLATE = 'en_US.UTF-8'
LC_CTYPE = 'en_US.UTF-8'
TABLESPACE = pg_default
CONNECTION LIMIT = -1;
WITH
OWNER = user
ENCODING = 'UTF8'
LC_COLLATE = 'en_US.UTF-8'
LC_CTYPE = 'en_US.UTF-8'
TABLESPACE = pg_default
CONNECTION LIMIT = -1;
the table baseline_denull has 1765 columns, mainly numbers, like:
CREATE TABLE public.baseline_denull
(
eid integer,
f_19 integer,
f_21 integer,
f_23 integer,
f_31 integer,
f_34 integer,
f_35 integer,
f_42 text COLLATE pg_catalog."default",
f_43 text COLLATE pg_catalog."default",
f_45 text COLLATE pg_catalog."default",
f_46 integer,
f_47 integer,
f_48 double precision,
f_49 double precision,
f_50 double precision,
f_51 double precision,
f_52 integer,
f_53 date,
f_54 integer,
f_68 integer,
f_74 integer,
f_77 double precision,
f_78 double precision,
f_84 integer[],
f_87 integer[],
f_92 integer[],
f_93 integer[],
f_94 integer[],
f_95 integer[],
f_96 integer[],
f_102 integer[],
f_120 integer,
f_129 integer,
(
eid integer,
f_19 integer,
f_21 integer,
f_23 integer,
f_31 integer,
f_34 integer,
f_35 integer,
f_42 text COLLATE pg_catalog."default",
f_43 text COLLATE pg_catalog."default",
f_45 text COLLATE pg_catalog."default",
f_46 integer,
f_47 integer,
f_48 double precision,
f_49 double precision,
f_50 double precision,
f_51 double precision,
f_52 integer,
f_53 date,
f_54 integer,
f_68 integer,
f_74 integer,
f_77 double precision,
f_78 double precision,
f_84 integer[],
f_87 integer[],
f_92 integer[],
f_93 integer[],
f_94 integer[],
f_95 integer[],
f_96 integer[],
f_102 integer[],
f_120 integer,
f_129 integer,
etc
and 1 index:
CREATE INDEX baseline_denull_eid_idx
ON public.baseline_denull USING btree
(eid)
TABLESPACE pg_default;
ON public.baseline_denull USING btree
(eid)
TABLESPACE pg_default;
I have a core saved, It says:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: user xxx 172.17.0.64(36654) SELECT '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f3c0c08c290 in ?? ()
(gdb) bt
#0 0x00007f3c0c08c290 in ?? ()
#1 0x0000000000000000 in ?? ()
(gdb) quit
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: user xxx 172.17.0.64(36654) SELECT '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f3c0c08c290 in ?? ()
(gdb) bt
#0 0x00007f3c0c08c290 in ?? ()
#1 0x0000000000000000 in ?? ()
(gdb) quit
How could I enable JIT again without getting a segmentation fault?
Regards,
Pablo
Re: postgresql v11.1 Segmentation fault: signal 11: by running SELECT... JIT Issue?
From
Tom Lane
Date:
pabloa98 <pabloa98@gmail.com> writes: > I just migrated our databases from PostgreSQL version 9.6 to version 11.1. > We got a segmentation fault while running this query: > SELECT f_2110 as x FROM baseline_denull > ORDER BY eid ASC > limit 500 > OFFSET 131000; > the table baseline_denull has 1765 columns, mainly numbers, like: Hm, that sounds like it matches this recent bug fix: Author: Andres Freund <andres@anarazel.de> Branch: master [b23852766] 2018-11-27 10:07:03 -0800 Branch: REL_11_STABLE [aee085bc0] 2018-11-27 10:07:43 -0800 Fix jit compilation bug on wide tables. The function generated to perform JIT compiled tuple deforming failed when HeapTupleHeader's t_hoff was bigger than a signed int8. I'd failed to realize that LLVM's getelementptr would treat an int8 index argument as signed, rather than unsigned. That means that a hoff larger than 127 would result in a negative offset being applied. Fix that by widening the index to 32bit. Add a testcase with a wide table. Don't drop it, as it seems useful to verify other tools deal properly with wide tables. Thanks to Justin Pryzby for both reporting a bug and then reducing it to a reproducible testcase! Reported-By: Justin Pryzby Author: Andres Freund Discussion: https://postgr.es/m/20181115223959.GB10913@telsasoft.com Backpatch: 11, just as jit compilation was This would result in failures on wide rows that contain some null entries. If your table is mostly-not-null, that would fit the observation that it only crashes on a few rows. Can you try REL_11_STABLE branch tip and see if it works for you? regards, tom lane
Re: postgresql v11.1 Segmentation fault: signal 11: by running SELECT... JIT Issue?
From
Andrew Gierth
Date:
>>>>> "pabloa98" == pabloa98 <pabloa98@gmail.com> writes: pabloa98> the table baseline_denull has 1765 columns, Uhh... #define MaxHeapAttributeNumber 1600 /* 8 * 200 */ Did you modify that? (The back of my envelope says that on 64bit, the largest usable t_hoff would be 248, of which 23 is fixed overhead leaving 225 as the max null bitmap size, giving a hard limit of 1800 for MaxTupleAttributeNumber and 1799 for MaxHeapAttributeNumber. And the concerns expressed in the comments above those #defines would obviously apply.) -- Andrew (irc:RhodiumToad)
Re: postgresql v11.1 Segmentation fault: signal 11: by runningSELECT... JIT Issue?
From
pabloa98
Date:
I did not modify it.
I guess I should make it bigger than 1765. is 2400 or 3200 fine?
My apologies if my questions look silly. I do not know about the internal format of the database.
On Mon, Jan 28, 2019 at 11:58 PM Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
>>>>> "pabloa98" == pabloa98 <pabloa98@gmail.com> writes:
pabloa98> the table baseline_denull has 1765 columns,
Uhh...
#define MaxHeapAttributeNumber 1600 /* 8 * 200 */
Did you modify that?
(The back of my envelope says that on 64bit, the largest usable t_hoff
would be 248, of which 23 is fixed overhead leaving 225 as the max null
bitmap size, giving a hard limit of 1800 for MaxTupleAttributeNumber and
1799 for MaxHeapAttributeNumber. And the concerns expressed in the
comments above those #defines would obviously apply.)
--
Andrew (irc:RhodiumToad)
Re: postgresql v11.1 Segmentation fault: signal 11: by runningSELECT... JIT Issue?
From
pabloa98
Date:
I found this article:
It seems I should modify: uint8 t_hoff;
and replace it with something like: uint32 t_hoff; or uint64 t_hoff;
And perhaps should I modify this too?
The fix is easy enough, just adding a
v_hoff = LLVMBuildZExt(b, v_hoff, LLVMInt32Type(), "");
fixes the issue for me.
If that is the case, I am not sure what kind of modification we should do.
I feel I need to explain why we create these huge tables. Basically we want to process big matrices for machine learning.
Using tables with classic columns let us write very clear code. If we have to start using arrays as columns, things would become complicated and not intuitive (besides, some columns store vectors as arrays... ).
We could use JSONB (we do, but for json documents). The problem is, storing large amounts of jsonb columns create performance issues (compared with normal tables).
Since almost everybody is doing ML to apply to different products, perhaps are there other companies interested in a version of Postgres that could deal with tables with thousands of columns?
I did not find any postgres package ready to use like that though.
Pablo
On Tue, Jan 29, 2019 at 12:11 AM pabloa98 <pabloa98@gmail.com> wrote:
I did not modify it.I guess I should make it bigger than 1765. is 2400 or 3200 fine?My apologies if my questions look silly. I do not know about the internal format of the database.PabloOn Mon, Jan 28, 2019 at 11:58 PM Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:>>>>> "pabloa98" == pabloa98 <pabloa98@gmail.com> writes:
pabloa98> the table baseline_denull has 1765 columns,
Uhh...
#define MaxHeapAttributeNumber 1600 /* 8 * 200 */
Did you modify that?
(The back of my envelope says that on 64bit, the largest usable t_hoff
would be 248, of which 23 is fixed overhead leaving 225 as the max null
bitmap size, giving a hard limit of 1800 for MaxTupleAttributeNumber and
1799 for MaxHeapAttributeNumber. And the concerns expressed in the
comments above those #defines would obviously apply.)
--
Andrew (irc:RhodiumToad)
Re: postgresql v11.1 Segmentation fault: signal 11: by running SELECT... JIT Issue?
From
Andrew Gierth
Date:
>>>>> "pabloa98" == pabloa98 <pabloa98@gmail.com> writes: pabloa98> I did not modify it. Then how did you create a table with more than 1600 columns? If I try and create a table with 1765 columns, I get: ERROR: tables can have at most 1600 columns -- Andrew (irc:RhodiumToad)
Re: postgresql v11.1 Segmentation fault: signal 11: by running SELECT... JIT Issue?
From
Andrew Gierth
Date:
>>>>> "pabloa98" == pabloa98 <pabloa98@gmail.com> writes: pabloa98> I found this article: pabloa98> https://manual.limesurvey.org/Instructions_for_increasing_the_maximum_number_of_columns_in_PostgreSQL_on_Linux Those instructions contain obvious errors. pabloa98> It seems I should modify: uint8 t_hoff; pabloa98> and replace it with something like: uint32 t_hoff; or uint64 t_hoff; At the very least, that ought to be uint16 t_hoff; since there is never any possibility of hoff being larger than 32k since that's the largest allowed pagesize. However, if you modify that, it's then up to you to ensure that all the code that assumes it's a uint8 is found and fixed. I have no idea what else would break. -- Andrew (irc:RhodiumToad)
Re: postgresql v11.1 Segmentation fault: signal 11: by runningSELECT... JIT Issue?
From
pabloa98
Date:
I appreciate your advice. I will check the number of columns in that table.
On Tue, Jan 29, 2019, 1:53 AM Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
>>>>> "pabloa98" == pabloa98 <pabloa98@gmail.com> writes:
pabloa98> I found this article:
pabloa98> https://manual.limesurvey.org/Instructions_for_increasing_the_maximum_number_of_columns_in_PostgreSQL_on_Linux
Those instructions contain obvious errors.
pabloa98> It seems I should modify: uint8 t_hoff;
pabloa98> and replace it with something like: uint32 t_hoff; or uint64 t_hoff;
At the very least, that ought to be uint16 t_hoff; since there is never
any possibility of hoff being larger than 32k since that's the largest
allowed pagesize. However, if you modify that, it's then up to you to
ensure that all the code that assumes it's a uint8 is found and fixed.
I have no idea what else would break.
--
Andrew (irc:RhodiumToad)
Re: postgresql v11.1 Segmentation fault: signal 11: by runningSELECT... JIT Issue?
From
pabloa98
Date:
I checked the table. It has 1265 columns. Sorry about the typo.
Pablo
On Tue, Jan 29, 2019 at 1:10 AM Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
>>>>> "pabloa98" == pabloa98 <pabloa98@gmail.com> writes:
pabloa98> I did not modify it.
Then how did you create a table with more than 1600 columns? If I try
and create a table with 1765 columns, I get:
ERROR: tables can have at most 1600 columns
--
Andrew (irc:RhodiumToad)
Re: postgresql v11.1 Segmentation fault: signal 11: by runningSELECT... JIT Issue?
From
Justin Pryzby
Date:
On Mon, Nov 26, 2018 at 07:00:35PM -0800, Andres Freund wrote: > The fix is easy enough, just adding a > v_hoff = LLVMBuildZExt(b, v_hoff, LLVMInt32Type(), ""); > fixes the issue for me. On Tue, Jan 29, 2019 at 12:38:38AM -0800, pabloa98 wrote: > And perhaps should I modify this too? > If that is the case, I am not sure what kind of modification we should do. Andres commited the fix in November, and it's included in postgres11.2, which is scheduled to be released Thursday. So we'll both be able to re-enable JIT on our wide tables again. https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b23852766 Justin
Re: postgresql v11.1 Segmentation fault: signal 11: by runningSELECT... JIT Issue?
From
pabloa98
Date:
I tried. It works
Thanks for the information.
P
On Mon, Jan 28, 2019, 7:28 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
pabloa98 <pabloa98@gmail.com> writes:
> I just migrated our databases from PostgreSQL version 9.6 to version 11.1.
> We got a segmentation fault while running this query:
> SELECT f_2110 as x FROM baseline_denull
> ORDER BY eid ASC
> limit 500
> OFFSET 131000;
> the table baseline_denull has 1765 columns, mainly numbers, like:
Hm, that sounds like it matches this recent bug fix:
Author: Andres Freund <andres@anarazel.de>
Branch: master [b23852766] 2018-11-27 10:07:03 -0800
Branch: REL_11_STABLE [aee085bc0] 2018-11-27 10:07:43 -0800
Fix jit compilation bug on wide tables.
The function generated to perform JIT compiled tuple deforming failed
when HeapTupleHeader's t_hoff was bigger than a signed int8. I'd
failed to realize that LLVM's getelementptr would treat an int8 index
argument as signed, rather than unsigned. That means that a hoff
larger than 127 would result in a negative offset being applied. Fix
that by widening the index to 32bit.
Add a testcase with a wide table. Don't drop it, as it seems useful to
verify other tools deal properly with wide tables.
Thanks to Justin Pryzby for both reporting a bug and then reducing it
to a reproducible testcase!
Reported-By: Justin Pryzby
Author: Andres Freund
Discussion: https://postgr.es/m/20181115223959.GB10913@telsasoft.com
Backpatch: 11, just as jit compilation was
This would result in failures on wide rows that contain some null
entries. If your table is mostly-not-null, that would fit the
observation that it only crashes on a few rows.
Can you try REL_11_STABLE branch tip and see if it works for you?
regards, tom lane