An issue with missing rows - Mailing list pgsql-bugs
From | Viktor Semykin |
---|---|
Subject | An issue with missing rows |
Date | |
Msg-id | CACaAQPXmQJR-XtMGsuXRBAq21GYWxzPaWKHTDELmq5OWFYDiFA@mail.gmail.com Whole thread Raw |
Responses |
Re: An issue with missing rows
|
List | pgsql-bugs |
I have faced a weird issue with PostgreSQL database in conjunction with nodejs+Sequelize. I saw this issue happening in 13.1 and 10.5. nodejs is 12.16. Please let me know if I have to re-format this email somehow for comfortable reading. I'll try to explain step by step what's going on. Sequelize library test suite includes a test that have been passing successfully until I broke it. After my changes, the following queries fail about once in 20 times. In a nutshell, my queries look like this (here and below all createdAt and updatedAt fields may be inaccurate): ```sql CREATE SCHEMA IF NOT EXISTS archive; DROP TABLE IF EXISTS "archive"."Tasks" CASCADE; DROP TABLE IF EXISTS "archive"."Users" CASCADE; CREATE TABLE IF NOT EXISTS "archive"."Users" ("id" SERIAL , "createdAt" TIMESTAMP WITH TIME ZONE NOT NULL, "updatedAt" TIMESTAMP WITH TIME ZONE NOT NULL, PRIMARY KEY ("id")); CREATE TABLE IF NOT EXISTS "archive"."Tasks" ("id" INTEGER , "title" VARCHAR(255), "createdAt" TIMESTAMP WITH TIME ZONE NOT NULL, "updatedAt" TIMESTAMP WITH TIME ZONE NOT NULL, "UserId" INTEGER REFERENCES "archive"."Users" ("id") ON DELETE SET NULL ON UPDATE CASCADE, PRIMARY KEY ("id")); INSERT INTO "archive"."Users" ("id","createdAt","updatedAt") VALUES (1,'...','...') RETURNING "id","createdAt","updatedAt"; INSERT INTO "archive"."Users" ("id","createdAt","updatedAt") VALUES (2,'...','...') RETURNING "id","createdAt","updatedAt"; INSERT INTO "archive"."Tasks" ("id","title","createdAt","updatedAt","UserId") VALUES (1,'b','...','...',1) RETURNING "id","title","createdAt","updatedAt","UserId"; INSERT INTO "archive"."Tasks" ("id","title","createdAt","updatedAt","UserId") VALUES (2,'d','...','...',1) RETURNING "id","title","createdAt","updatedAt","UserId"; INSERT INTO "archive"."Tasks" ("id","title","createdAt","updatedAt","UserId") VALUES (3,'c','...','...',1) RETURNING "id","title","createdAt","updatedAt","UserId"; INSERT INTO "archive"."Tasks" ("id","title","createdAt","updatedAt","UserId") VALUES (4,'a','...','...',1) RETURNING "id","title","createdAt","updatedAt","UserId"; INSERT INTO "archive"."Tasks" ("id","title","createdAt","updatedAt","UserId") VALUES (5,'a','...','...',2) RETURNING "id","title","createdAt","updatedAt","UserId"; INSERT INTO "archive"."Tasks" ("id","title","createdAt","updatedAt","UserId") VALUES (6,'c','...','...',2) RETURNING "id","title","createdAt","updatedAt","UserId"; INSERT INTO "archive"."Tasks" ("id","title","createdAt","updatedAt","UserId") VALUES (7,'b','...','...',2) RETURNING "id","title","createdAt","updatedAt","UserId"; SELECT "User"."id", "User"."createdAt", "User"."updatedAt" FROM "archive"."Users" AS "User" ORDER BY "User"."id" ASC; SELECT "Task".* FROM (SELECT * FROM (SELECT "id", "title", "createdAt", "updatedAt", "UserId" FROM "archive"."Tasks" AS "Task" WHERE "Task"."UserId" = 1 ORDER BY "Task"."id" ASC LIMIT 2) AS sub UNION ALL SELECT * FROM (SELECT "id", "title", "createdAt", "updatedAt", "UserId" FROM "archive"."Tasks" AS "Task" WHERE "Task"."UserId" = 2 ORDER BY "Task"."id" ASC LIMIT 2) AS sub) AS "Task"; DROP SCHEMA IF EXISTS archive CASCADE; ``` in other words, in instantiates a test data like this: ``` sequelize_test=# select * from "archive"."Users"; id | createdAt | updatedAt ----+-------------------------------+------------------------------- 1 | 2021-02-13 16:59:04.190643-03 | 2021-02-13 16:59:04.190643-03 2 | 2021-02-13 16:59:04.217696-03 | 2021-02-13 16:59:04.217696-03 (2 rows) sequelize_test=# select * from "archive"."Tasks"; id | title | createdAt | updatedAt | UserId ----+-------+-------------------------------+-------------------------------+-------- 1 | b | 2021-02-13 16:59:04.222593-03 | 2021-02-13 16:59:04.222593-03 | 1 2 | d | 2021-02-13 16:59:04.238252-03 | 2021-02-13 16:59:04.238252-03 | 1 3 | c | 2021-02-13 16:59:04.24318-03 | 2021-02-13 16:59:04.24318-03 | 1 4 | a | 2021-02-13 16:59:04.256812-03 | 2021-02-13 16:59:04.256812-03 | 1 5 | a | 2021-02-13 16:59:04.262364-03 | 2021-02-13 16:59:04.262364-03 | 2 6 | c | 2021-02-13 16:59:04.277525-03 | 2021-02-13 16:59:04.277525-03 | 2 7 | b | 2021-02-13 16:59:04.281861-03 | 2021-02-13 16:59:04.281861-03 | 2 ``` Then I'm trying to fetch all Users, and for each one of them only two tasks ordered by "id". This is what my query (the Faulty Query) looks like: ```sql SELECT "User"."id", "User"."createdAt", "User"."updatedAt", "tasks"."id" AS "tasks.id", "tasks"."title" AS "tasks.title", "tasks"."createdAt" AS "tasks.createdAt", "tasks"."updatedAt" AS "tasks.updatedAt", "tasks"."UserId" AS "tasks.UserId" FROM "archive"."Users" AS "User" LEFT OUTER JOIN LATERAL (SELECT * FROM "archive"."Tasks" AS "tasks" WHERE "User"."id" = "tasks"."UserId" LIMIT 2) AS "tasks" ON TRUE ORDER BY "User"."id" ASC; ``` and most of the time it returns result as expected: ``` id | createdAt | updatedAt | tasks.id | tasks.title | tasks.createdAt | tasks.updatedAt | tasks.UserId ----+-------------------------------+-------------------------------+----------+-------------+-------------------------------+-------------------------------+-------------- 1 | 2021-02-13 16:59:04.190643-03 | 2021-02-13 16:59:04.190643-03 | 1 | b | 2021-02-13 16:59:04.222593-03 | 2021-02-13 16:59:04.222593-03 | 1 1 | 2021-02-13 16:59:04.190643-03 | 2021-02-13 16:59:04.190643-03 | 2 | d | 2021-02-13 16:59:04.238252-03 | 2021-02-13 16:59:04.238252-03 | 1 2 | 2021-02-13 16:59:04.217696-03 | 2021-02-13 16:59:04.217696-03 | 5 | a | 2021-02-13 16:59:04.262364-03 | 2021-02-13 16:59:04.262364-03 | 2 2 | 2021-02-13 16:59:04.217696-03 | 2021-02-13 16:59:04.217696-03 | 6 | c | 2021-02-13 16:59:04.277525-03 | 2021-02-13 16:59:04.277525-03 | 2 ``` note that Tasks ids are 1,2,5,6. But there are chances that it returns wrong tasks, for example 1,2,5,7. The problem is that whenever I add logging it works correctly 100% of the time. The good news I was lucky enough to catch the error with Wireshark. Sequelize uses connection pooling so the queries are spread across 5 separate tcp connections and Wireshark output is quite messy. I had to parse the output with an ad-hoc script that generates a visual graph of the underlying events. This graph is attached as graph.svg. It's handy to open it in the browser and zoom out. Graphviz is not my strongest point, so let me briefly explain what this graph means. Each node is a pgsql-relevant network frame. Red edges denote TCP connections. Black arrows denote sequence order. The red nodes are what matters. Each frame begins with its sequence number and a direction (in/out). Body of each node shows the list of commands sent/received. Frames 1062-1084 are boring preludes. Then at 1086 comes the CREATE SCHEMA query. Then come some Sequelize noise like DROP TABLE IF EXISTS of tables that do not exist. Then in 1098 and 1105 it actually queries table creation. At 1109 it starts INSERTing data into "Users". Then the pool comes into play. At 1122 it spawns one more connection and at 1130 it starts INSERTing the second row into "Users". At 1142 and 1146 the Postgres reports both connections are "Ready for query". Then at 1148 it starts INSERTing a row into "Tasks" in the first connection, and at 1165 a row in the second connection. At 1180 the fun begins, it spawns three more connections and keeps INSERTing rows into "Tasks" using all the connections. Around 1263-1270 all the INSERT queries are executed and the connections reported "Ready for query". At 1273 it starts SELECTing the Faulty Query and 1275 returns the seem-to-be incorrect result. The Tasks rows in the response are `(1,'b'),(2,'d'),(5,'a'),(7,'b')`. It looks like (6,'c') haven't landed yet. So I come to the main question: did I find a bug or my assumptions about row visibility are just wrong?
Attachment
pgsql-bugs by date: