RE: [Proposal] Add accumulated statistics - Mailing list pgsql-hackers
From | imai.yoshikazu@fujitsu.com |
---|---|
Subject | RE: [Proposal] Add accumulated statistics |
Date | |
Msg-id | OSAPR01MB46097457DD12DF85A3C42F4794600@OSAPR01MB4609.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | RE: [Proposal] Add accumulated statistics ("Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com>) |
Responses |
RE: [Proposal] Add accumulated statistics
|
List | pgsql-hackers |
Hi, On Tue, Jan 15, 2019 at 1:14 AM, Tsunakawa, Takayuki wrote: [ ... absent for a long time ] I read the discussions of this thread. If we want wait event info, we can currently do sampling from pg_stat_activity and get pseudo wait event total duration. (I understand wait event sampling does *not* represent wait event count but wait event duration.) It is said that, in most cases, sampling is enough to find the cause of issue. On the other hand, it is said that samplingmissed wait events that don't spent much time in the process. Can we say we don't need such wait events that seemsapparently not important? Also we can know the pseudo total duration from the sampling though, we can't know the how many times wait events occurredand how long wait events take time in the average each time. I think it would be better to also know wait event averagetime to investigate the cause of the issue. For example, counts, total duration and average times of each wait events after executing pgbench, which is gotten from theserver applied my patch, is like below. transaction type: <builtin: TPC-B (sort of)> scaling factor: 100 query mode: simple number of clients: 8 number of threads: 8 duration: 60 s number of transactions actually processed: 132259 latency average = 3.630 ms tps = 2203.909727 (including connections establishing) tps = 2204.248691 (excluding connections establishing) wait_event_type | wait_event | calls | times | times/calls -----------------+----------------------+--------+-----------+------------ Client | ClientRead | 925732 | 115270950 | 124.52 IO | DataFileRead | 220357 | 3219893 | 14.61 LWLock | WALWriteLock | 195560 | 192684987 | 985.30 IO | DataFileWrite | 156208 | 4440091 | 28.42 IO | WALWrite | 51035 | 2925808 | 57.33 IO | WALSync | 50812 | 49626240 | 976.66 Lock | transactionid | 6179 | 9905727 | 1603.13 LWLock | buffer_content | 5453 | 295123 | 54.12 IO | DataFileExtend | 2367 | 90430 | 38.20 LWLock | wal_insert | 2053 | 33620 | 16.38 LWLock | WALBufMappingLock | 1301 | 23761 | 18.26 LWLock | CLogControlLock | 514 | 3959 | 7.70 LWLock | buffer_mapping | 162 | 1313 | 8.10 LWLock | XidGenLock | 153 | 2399 | 15.68 LWLock | lock_manager | 122 | 5051 | 41.40 Lock | tuple | 105 | 114632 | 1091.73 LWLock | ProcArrayLock | 99 | 1828 | 18.46 LWLock | CheckpointerCommLock | 33 | 854 | 25.88 Lock | extend | 24 | 20493 | 853.88 IO | RelationMapRead | 22 | 71 | 3.23 LWLock | buffer_io | 11 | 7398 | 672.55 IO | SLRUWrite | 4 | 249 | 62.25 IO | SLRUSync | 4 | 0 | 0.00 IPC | ClogGroupUpdate | 3 | 223 | 74.33 LWLock | SyncRepLock | 1 | 1 | 1.00 We can see that WALWriteLock has the longest total duration(times), and ClientRead is the next one. We might get this informationfrom the sampling. We can also see that WALWriteLock longest total duration is because of its high average time,while ClientRead total duration is because of its many counts. We can also know that transactionid, tuple, WALWriteLock,WALSync, extend have the longer average time than the others. If we want to improve the performance even forjust a bit, such info is useful I think. The patch which is proposed in this thread was not accepted because it loads overhead on the server/backend. The function that measures the wait event time, gettimeofday and clock_gettime, takes time to execute and it is on the frequentlycalled path(pgstat_report_wait_start/end), it is supposed a lot of overhead will be introduced. Even if we only count the wait event, there would be overhead, which is pointed out in the another thread.[1] The overhead which is induced by getting wait event info was discussed from old times, but I couldn't find the actual measuringresults, so I want to measure its overhead. [1] https://www.postgresql.org/message-id/CA%2BTgmobf1NJD%2B_DfQG5qccG5YFSnxk3CgC2mh0-UHabznCQtYA%40mail.gmail.com -- Yoshikazu Imai
pgsql-hackers by date: