Re: Automatically sizing the IO worker pool - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Automatically sizing the IO worker pool |
Date | |
Msg-id | CA+hUKGJULUTkT2LpeHTSt3KHbJrYNBT-kj1-OhMRV_PnUQ_57A@mail.gmail.com Whole thread Raw |
In response to | Automatically sizing the IO worker pool (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: Automatically sizing the IO worker pool
|
List | pgsql-hackers |
On Sun, May 25, 2025 at 7:20 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > On Sun, Apr 13, 2025 at 04:59:54AM GMT, Thomas Munro wrote: > > It's hard to know how to set io_workers=3. If it's too small, > > io_method=worker's small submission queue overflows and it silently > > falls back to synchronous IO. If it's too high, it generates a lot of > > pointless wakeups and scheduling overhead, which might be considered > > an independent problem or not, but having the right size pool > > certainly mitigates it. Here's a patch to replace that GUC with: > > > > io_min_workers=1 > > io_max_workers=8 > > io_worker_idle_timeout=60s > > io_worker_launch_interval=500ms > > > > It grows the pool when a backlog is detected (better ideas for this > > logic welcome), and lets idle workers time out. > > I like the idea. In fact, I've been pondering about something like a > "smart" configuration for quite some time, and convinced that a similar > approach needs to be applied to many performance-related GUCs. > > Idle timeout and launch interval serving as a measure of sensitivity > makes sense to me, growing the pool when a backlog (queue_depth > > nworkers, so even a slightest backlog?) is detected seems to be somewhat > arbitrary. From what I understand the pool growing velocity is constant > and do not depend on the worker demand (i.e. queue_depth)? It may sounds > fancy, but I've got an impression it should be possible to apply what's > called a "low-pass filter" in the control theory (sort of a transfer > function with an exponential decay) to smooth out the demand and adjust > the worker pool based on that. > > As a side note, it might be far fetched, but there are instruments in > queueing theory to figure out how much workers are needed to guarantee a > certain low queueing probability, but for that one needs to have an > average arrival rate (in our case, average number of IO operations > dispatched to workers) and an average service rate (average number of IO > operations performed by workers). Hi Dmitry, Thanks for looking, and yeah these are definitely the right sort of questions. I will be both unsurprised and delighted if someone can bring some more science to this problem. I did read up on Erlang's formula C ("This formula is used to determine the number of agents or customer service representatives needed to staff a call centre, for a specified desired probability of queuing" according to Wikipedia) and a bunch of related textbook stuff. And yeah I had a bunch of exponential moving averages of various values using scaled fixed point arithmetic (just a bunch of shifts and adds) to smooth inputs, in various attempts. But ... I'm not even sure if we can say that our I/O arrivals have a Poisson distribution, since they are not all independent. I tried more things too, while I was still unsure what I should even be optimising for. My current answer to that is: low latency with low variance, as seen with io_uring. In this version I went back to basics and built something that looks more like the controls of a classic process/thread pool (think Apache) or connection pool (think JDBC), with a couple of additions based on intuition: (1) a launch interval, which acts as a bit of damping against overshooting on brief bursts that are too far apart, and (2) the queue length > workers * k as a simple way to determine that latency is being introduced by not having enough workers. Perhaps there is a good way to compute an adaptive value for k with some fancy theories, but k=1 seems to have *some* basis: that's the lowest number which the pool is too small and *certainly* introducing latency, but any lower constant is harder to defend because we don't know how many workers are already awake and about to consume tasks. Something from queuing theory might provide an adaptive value, but in the end, I figured we really just want to know if the queue is growing ie in danger of overflowing (note: the queue is small! 64, and not currently changeable, more on that later, and the overflow behaviour is synchronous I/O as back-pressure). You seem to be suggesting that k=1 sounds too low, not too high, but there is that separate time-based defence against overshoot in response to rare bursts. You could get more certainty about jobs already about to be consumed by a worker that is about to dequeue, by doing a lot more book keeping: assigning them to workers on submission (separate states, separate queues, various other ideas I guess). But everything I tried like that caused latency or latency variance to go up, because it missed out on the chance for another worker to pick it up sooner opportunistically. This arrangement has the most stable and predictable pool size and lowest avg latency and stddev(latency) I have managed to come up with so far. That said, we have plenty of time to experiment with better ideas if you want to give it a shot or propose concrete ideas, given that I missed v18 :-) About control theory... yeah. That's an interesting bag of tricks. FWIW Melanie and (more recently) I have looked into textbook control algorithms at a higher level of the I/O stack (and Melanie gave a talk about other applications in eg VACUUM at pgconf.dev). In read_stream.c, where I/O demand is created, we've been trying to set the desired I/O concurrency level and thus lookahead distance with adaptive feedback. We've tried a lot of stuff. I hope we can share some concept patches some time soon, well, maybe in this cycle. Some interesting recent experiments produced graphs that look a lot like the ones in the book "Feedback Control for Computer Systems" (an easy software-person book I found for people without an engineering/control theory background where the problems match our world more closely, cf typical texts that are about controlling motors and other mechanical stuff...). Experimental goals are: find the the smallest concurrent I/O request level (and thus lookahead distance and thus speculative work done and buffers pinned) that keeps the I/O stall probability near zero (and keep adapting, since other queries and applications are sharing system I/O queues), and if that's not even possible, find the highest concurrent I/O request level that doesn't cause extra latency due to queuing in lower levels (I/O workers, kernel, ..., disks). That second part is quite hard. In other words, if higher levels own that problem and bring the adaptivity, then perhaps io_method=worker can get away with being quite stupid. Just a thought...
pgsql-hackers by date: