Re: Caching of Queries - Mailing list pgsql-performance
From | Josh Berkus |
---|---|
Subject | Re: Caching of Queries |
Date | |
Msg-id | 200410011010.40705.josh@agliodbs.com Whole thread Raw |
In response to | Re: Caching of Queries (Matt Clark <matt@ymogen.net>) |
Responses |
Re: Caching of Queries
Re: Caching of Queries |
List | pgsql-performance |
People: Transparent "query caching" is the "industry standard" for how these things are handled. However, Postgres' lack of this feature has made me consider other approaches, and I'm starting to wonder if the "standard" query caching -- where a materialized query result, or some reduction thereof, is cached in database memory -- isn't the best way to cache things. I'm going to abbreviate it "SQC" for the rest of this e-mail. Obviously, the draw of SQC is its transparency to developers. With it, the Java/Perl/PHP programmers and the DBA don't have to communicate at all -- you set it up, give it some RAM, and it "just works". As someone who frequently has to consult based on limited knowledge, I can understand the appeal. However, one of the problems with SQC, aside from the ones already mentioned of stale data and/or cache-clearing, is that (at least in applications like MySQL's) it is indiscriminate and caches, at least breifly, unique queries as readily as common ones. Possibly Oracle's implementation is more sophisticated; I've not had an opportunity. The other half of that problem is that an entire query is cached, rather than just the relevant data to uniquely identify the request to the application. This is bad in two respects; one that the entire query needs to be parsed to see if a new query is materially equivalent, and that two materially different queries which could utilize overlapping ranges of the same underlying result set must instead cache their results seperately, eating up yet more memory. To explain what I'm talking about, let me give you a counter-example of another approach. I have a data-warehousing application with a web front-end. The data in the application is quite extensive and complex, and only a summary is presented to the public users -- but that summary is a query involving about 30 lines and 16 joins. This summary information is available in 3 slightly different forms. Further, the client has indicated that an up to 1/2 hour delay in data "freshness" is acceptable. The first step is forcing that "materialized" view of the data into memory. Right now I'm working on a reliable way to do that without using Memcached, which won't install on our Solaris servers. Temporary tables have the annoying property of being per-connection, which doesn't work in a pool of 60 connections. The second step, which I completed first due to the lack of technical obstacles, is to replace all queries against this data with calls to a Set-Returning Function (SRF). This allowed me to re-direct where the data was coming from -- presumably the same thing could be done through RULES, but it would have been considerably harder to implement. The first thing the SRF does is check the criteria passed to it against a set of cached (in a table) criteria with that user's permission level which is < 1/2 hour old. If the same criteria are found, then the SRF is returned a set of row identifiers for the materialized view (MV), and looks up the rows in the MV and returns those to the web client. If no identical set of criteria are found, then the query is run to get a set of identifiers which are then cached, and the SRF returns the queried rows. Once I surmount the problem of storing all the caching information in protected memory, the advantages of this approach over SQC are several: 1) The materialized data is available in 3 different forms; a list, a detail view, and a spreadsheet. Each form as somewhat different columns and different rules about ordering, which would likely confuse an SQC planner. In this implementation, all 3 forms are able to share the same cache. 2) The application is comparing only sets of unambguous criteria rather than long queries which would need to be compared in planner form in order to determine query equivalence. 3) With the identifier sets, we are able to cache other information as well, such as a count of rows, further limiting the number of queries we must run. 4) This approach is ideally suited to the pagination and re-sorting common to a web result set. As only the identifiers are cached, the results can be re-sorted and broken in to pages after the cache read, a fast, all-in-memory operation. In conclusion, what I'm saying is that while forms of transparent query caching (plan, materialized or whatever) may be desirable for other reasons, it's quite possible to acheive a superior level of "query caching" through tight integration with the front-end application. If people are interested in this, I'd love to see some suggestions on ways to force the materialized view into dedicated memory. -- Josh Berkus Aglio Database Solutions San Francisco
pgsql-performance by date: