Re: stress test for parallel workers - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: stress test for parallel workers |
Date | |
Msg-id | 17389.1563945314@sss.pgh.pa.us Whole thread Raw |
In response to | Re: stress test for parallel workers (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: stress test for parallel workers
|
List | pgsql-hackers |
Thomas Munro <thomas.munro@gmail.com> writes: > On Wed, Jul 24, 2019 at 10:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> In any case, the evidence from the buildfarm is pretty clear that >> there is *some* connection. We've seen a lot of recent failures >> involving "postmaster exited during a parallel transaction", while >> the number of postmaster failures not involving that is epsilon. > I don't have access to the build farm history in searchable format > (I'll go and ask for that). Yeah, it's definitely handy to be able to do SQL searches in the history. I forget whether Dunstan or Frost is the person to ask for access, but there's no reason you shouldn't have it. > Do you have an example to hand? Is this > failure always happening on Linux? I dug around a bit further, and while my recollection of a lot of "postmaster exited during a parallel transaction" failures is accurate, there is a very strong correlation I'd not noticed: it's just a few buildfarm critters that are producing those. To wit, I find that string in these recent failures (checked all runs in the past 3 months): sysname | branch | snapshot -----------+---------------+--------------------- lorikeet | HEAD | 2019-06-16 20:28:25 lorikeet | HEAD | 2019-07-07 14:58:38 lorikeet | HEAD | 2019-07-02 10:38:08 lorikeet | HEAD | 2019-06-14 14:58:24 lorikeet | HEAD | 2019-07-04 20:28:44 lorikeet | HEAD | 2019-04-30 11:00:49 lorikeet | HEAD | 2019-06-19 20:29:27 lorikeet | HEAD | 2019-05-21 08:28:26 lorikeet | REL_11_STABLE | 2019-07-11 08:29:08 lorikeet | REL_11_STABLE | 2019-07-09 08:28:41 lorikeet | REL_12_STABLE | 2019-07-16 08:28:37 lorikeet | REL_12_STABLE | 2019-07-02 21:46:47 lorikeet | REL9_6_STABLE | 2019-07-02 20:28:14 vulpes | HEAD | 2019-06-14 09:18:18 vulpes | HEAD | 2019-06-27 09:17:19 vulpes | HEAD | 2019-07-21 09:01:45 vulpes | HEAD | 2019-06-12 09:11:02 vulpes | HEAD | 2019-07-05 08:43:29 vulpes | HEAD | 2019-07-15 08:43:28 vulpes | HEAD | 2019-07-19 09:28:12 wobbegong | HEAD | 2019-06-09 20:43:22 wobbegong | HEAD | 2019-07-02 21:17:41 wobbegong | HEAD | 2019-06-04 21:06:07 wobbegong | HEAD | 2019-07-14 20:43:54 wobbegong | HEAD | 2019-06-19 21:05:04 wobbegong | HEAD | 2019-07-08 20:55:18 wobbegong | HEAD | 2019-06-28 21:18:46 wobbegong | HEAD | 2019-06-02 20:43:20 wobbegong | HEAD | 2019-07-04 21:01:37 wobbegong | HEAD | 2019-06-14 21:20:59 wobbegong | HEAD | 2019-06-23 21:36:51 wobbegong | HEAD | 2019-07-18 21:31:36 (32 rows) We already knew that lorikeet has its own peculiar stability problems, and these other two critters run different compilers on the same Fedora 27 ppc64le platform. So I think I've got to take back the assertion that we've got some lurking generic problem. This pattern looks way more like a platform-specific issue. Overaggressive OOM killer would fit the facts on vulpes/wobbegong, perhaps, though it's odd that it only happens on HEAD runs. regards, tom lane
pgsql-hackers by date: