pg_stat_statements with query tree based normalization - Mailing list pgsql-hackers
From | Greg Smith |
---|---|
Subject | pg_stat_statements with query tree based normalization |
Date | |
Msg-id | 4EC09C36.9000706@2ndQuadrant.com Whole thread Raw |
Responses |
Re: pg_stat_statements with query tree based normalization
|
List | pgsql-hackers |
Attached is a patch and associated test program that adds query normalization of pg_stat_statements, based on transforming the query tree into a series of integers and using them to match against previous queries. Currently pg_stat_statements only works usefully if all queries executed use prepared statements. This is a troublesome limitation for a couple of reasons. Prepared statements can easily have performance regressions compared to the unprepared version, since the optimizer won't necessarily be able to use as much data about MCVs etc. as in the regular case. People regularly avoid those for that performance reason, and there are ORM limitations here too. Someone was telling me the other day Ruby just added prepared statement support recently as one example. That means that instead of using the existing pg_stat_statements extension, most sites have to log statements to a text file, and then parse those logs using programs like pgfouine. The whole process is inefficient, makes the database look clumsy, and is a major source of complaints about PostgreSQL. In many shared/managed hosting environments, looking at things exposed at the database-level is vastly easier than getting at the log files on the server too. There's several legitimate complaints here that make this problem rise to be an advocacy/adoption issue as far as I'm concerned. Code and this whole idea by Peter Geoghegan, with me as first round reviewer and diff minimizing nitpicker. Thanks to attendees and sponsors of the PgWest conference for helping to fund initial exploration of server side query normalization using regular expressions. The more complicated query tree approach used here was sponsored by Heroku. Hurray for parallel development tracks, I was happy to take my regex idea to the bike shed and put it out of its misery once this one worked. Here's a quick demo that's based on the existing sample in the docs. Build the server and the contrib modules for pg_stat_statements and pgbench, edit postgresql.conf to add: shared_preload_libraries = 'pg_stat_statements' pg_stat_statements.max = 10000 pg_stat_statements.track = all Then test like this using pgbench: createdb bench psql -d bench -c "CREATE EXTENSION pg_stat_statements" pgbench -i bench psql -d bench -c "SELECT pg_stat_statements_reset()" pgbench -c10 -t30 bench psql -x -d bench -c "SELECT query, calls, total_time, rows FROM pg_stat_statements ORDER BY total_time DESC LIMIT 5;" Unlike the example in the docs, there's no "-M prepared" here, but as hoped all the similar statements are grouped together anyway. The no log file parsing or regex necessary output looks like this: -[ RECORD 1 ]------------------------------------------------------------------ query | UPDATE pgbench_branches SET bbalance = bbalance + ? WHERE bid = ?; calls | 300 total_time | 1.084803 rows | 300 -[ RECORD 2 ]------------------------------------------------------------------ query | UPDATE pgbench_tellers SET tbalance = tbalance + ? WHERE tid = ?; calls | 300 total_time | 0.875213 rows | 300 ... Interestingly it even showed vacuum running against pgbench_branches in my test case. I've been focused on usability testing, and we could really use someone familiar with the query node internals to review those changes too. Probably the most controversial detail here is how exactly the "query" field here is generated, and that's the newest code too. I feel strongly that we need to have a stable string there for common use cases, so that successive views of this data across time or even server restarts will have the same key for that query's name. That's what administrations really expect this sort of feature to do. I consider it a feature that you can tell this form of normalization from the type that you get from a proper prepared statement, where these parameters would be named "$1" etc. There are surely some edge cases where this may need escaping around, I haven't really dug into looking for those yet. The approach Peter used adds a single integer to the Const structure in order to have enough information to substitute "?" in place of those. Adding and maintaining that is the only change outside of the extension made here, and that overhead is paid by everyone--not just consumers of this new code. Neither of us like that, but if we're going to increase something Const is not a bad choice; it's not that slim of a structure already. There's 7 int length fields, a Datum, and two booleans in there. It's going from about 44 bytes to 48, so maybe a 10% size increase in that one node type. I can't even measure a performance change on simple pgbench tests, run variation is way bigger. Might be easier to measure if your query was mostly Const nodes now that I think about it. If someone can come up with another way to implement this that avoid that overhead, obviously we'd do whatever we could to eliminate it. I don't think it's unreasonable server bloat given how valuable this feature is. As I recently mentioned in the other thread that whacks around this extension, it's possible to install and use most of this code as an extension, you just don't have this part and accordingly the query text is less stable. That's not a code consideration, but presuming that's doesn't complicate the 9.2 implementation I hear enough requests for this feature that I think it is a useful bonus for the community, If some form of this gets committed, I'd expect the updated pg_stat_statements to be fairly popular as an extension, packaged to add this to 8.4 through 9.1 servers. Back to the code...an extensive regression testing program for this new feature is included, which has made it much easier to modify the underlying implementation without breaking it. (If David Wheeler has read this far down, he's now laughing at me) It verifies many different statements that should be considered the same are. Running the program requires Python, psycopg2, and the Dell Store 2 database: http://pgfoundry.org/forum/forum.php?forum_id=603 Exactly how much testing of this should be done using the PostgreSQL regression tests instead is one of the things I don't have a good recommendation on yet. The latest version of this patch right now is at https://github.com/greg2ndQuadrant/postgres/tree/pg_stat_statements_norm with their neato diff view at https://github.com/greg2ndQuadrant/postgres/compare/master...pg_stat_statements_norm That currently has only minor cleanup drift from Peter's development one at https://github.com/Peter2ndQuadrant/postgres/branches which is the place to watch for updates. The git repo tree includes both the changes and the regression test program, but I don't expect the test program to be committed in core. There are some known bugs/limitations that Peter and I have found in recent testing of the program, all of which are fixable given a bit more development time. Just to really invert the status quo, my testing showed this doesn't report correct numbers when using prepared statements right now. Peter tells me the code presently assumes that there is one call to the planner hook followed by calls to the normal executor hooks, each set of calls corresponding to one statement. And that assumption falls down when prepared statements are used. Tom mentioned something about synchronization issues in this area on the other pg_stat_statements thread too, so it's not a surprise so much as a known sticky point in the implementation. Peter has a fix in mind for this already; I wanted to get community feedback moving rather than block waiting for that. Timestamp literals aren't normalized properly yet, that just slipped past the submission deadline. There are some DEBUG1 logging statements still around that Peter wanted to kill before submission. I thought they should stay around for initial review at least. As for documentation, I consider this so valuable to server operation that I think I want to mention it beyond just the notes in contrib on how to use it. Still muddling over what I want to say there. I also wonder if it makes sense to allow disabling this feature, just for people who want the rest of pg_stat_statements but not paying for this. The main use case I could see for that are people who are using prepared statements, and are happy with the existing implementation. Again, I didn't want to dive too deep into measuring overhead when it's quite possible we'll get feedback that requires rework making that redundant. Suggestions for anything from the usage to implementation approach is welcome. As I'm sure it's clear by the end of this long commentary, this is considered an important feature here. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
Attachment
pgsql-hackers by date: