Thread: Guidelines on best indexing strategy for varying searches on 20+ columns
Guidelines on best indexing strategy for varying searches on 20+ columns
From
Niels Kristian Schjødt
Date:
Hi, I’m running a search engine for cars. It’s backed by a postgresql 9.3 installation. Now I’m unsure about the best approach/strategy on doing index optimization for the fronted search. The problem: The table containing the cars holds a around 1,5 million rows. People that searches for cars needs different criteria tosearch by. Some search by brand/model, some by year, some by mileage, some by price and some by special equipment etc.etc. - and often they combine a whole bunch of criteria together. Of cause some, like brand/mode and price, are usedmore frequently than others. In total we offer: 9 category criteria like brand/model or body type, plus 5 numeric criterialike price or mileage, plus 12 boolean criteria like equipment. Lastly people can order the results by differentcolumns (year, price, mileage and a score we create about the cars). By default we order by our own generated score. What I’ve done so far: I have analyzed the usage of the criteria “lightly”, and created a few indexes (10). Among those, are e.g. indexes on price,mileage and a combined index on brand/model. Since we are only interested in showing results for cars which is actuallyfor sale, the indexes are made as partial indexes on a sales state column. Questions: 1. How would you go about analyzing and determining what columns should be indexed, and how? 2. What is the best strategy when optimizing indexes for searches happening on 20 + columns, where the use and the combinationsvaries a lot? (To just index everything, to index some of the columns, to do combined indexes, to only do singlecolumn indexes etc. etc.) 3. I expect that it does not make sense to index all columns? 4. I expect it does not make sense to index boolean columns? 5. Is it better to do a combined index on 5 frequently used columns rather than having individual indexes on each of them? 6. Would it be a goof idea to have all indexes sorted by my default sorting? 7. Do you have so experiences with other approaches that could greatly improve performance (e.g. forcing indexes to stayin memory etc.)?
Re: Guidelines on best indexing strategy for varying searches on 20+ columns
From
Merlin Moncure
Date:
On Wed, Jun 25, 2014 at 3:48 AM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote: > Hi, > I’m running a search engine for cars. It’s backed by a postgresql 9.3 installation. > > Now I’m unsure about the best approach/strategy on doing index optimization for the fronted search. > > The problem: > > The table containing the cars holds a around 1,5 million rows. People that searches for cars needs different criteria tosearch by. Some search by brand/model, some by year, some by mileage, some by price and some by special equipment etc.etc. - and often they combine a whole bunch of criteria together. Of cause some, like brand/mode and price, are usedmore frequently than others. In total we offer: 9 category criteria like brand/model or body type, plus 5 numeric criterialike price or mileage, plus 12 boolean criteria like equipment. Lastly people can order the results by differentcolumns (year, price, mileage and a score we create about the cars). By default we order by our own generated score. > > What I’ve done so far: > > I have analyzed the usage of the criteria “lightly”, and created a few indexes (10). Among those, are e.g. indexes on price,mileage and a combined index on brand/model. Since we are only interested in showing results for cars which is actuallyfor sale, the indexes are made as partial indexes on a sales state column. > > Questions: > > 1. How would you go about analyzing and determining what columns should be indexed, and how? mainly frequency of access. > 2. What is the best strategy when optimizing indexes for searches happening on 20 + columns, where the use and the combinationsvaries a lot? (To just index everything, to index some of the columns, to do combined indexes, to only do singlecolumn indexes etc. etc.) don't make 20 indexes. consider installing pg_trgm (for optimized LIKE searching) or hstore (for optmized key value searching) and then using GIST/GIN for multiple attribute search. with 9.4 we have another fancy technique to explore: jsonb searching via GIST/GIN. > 3. I expect that it does not make sense to index all columns? well, maybe. if you only ever search one column at a time, then it might make sense. but if you need to search arbitrary criteria and frequently combine a large number, then no -- particularly if your dataset is very large and individual criteria are not very selective. > 4. I expect it does not make sense to index boolean columns? in general, no. an important exception is if you are only interested in true or false and the number of records that have that interesting value is tiny relative to the size of the table. in that case, a partial index can be used for massive optimization. > 5. Is it better to do a combined index on 5 frequently used columns rather than having individual indexes on each of them? Only if you search those 5 columns together a significant portion of the time. > 6. Would it be a goof idea to have all indexes sorted by my default sorting? index order rarely matters. if you always search values backwards and the table is very large you may want to consider it. unfortunately this often doesn't work for composite indexes so sometimes we must explore the old school technique of reversing the value. > 7. Do you have so experiences with other approaches that could greatly improve performance (e.g. forcing indexes to stayin memory etc.)? as noted above, fancy indexing is the first place to look. start with pg_trgm (for like optmization), hstore, and the new json stuff. the big limitation you will hit is that that most index strategies, at least fo the prepackaged stuff will support '=', or partial string (particularly with pg_trgm like), but not > or <: for range operations you have to post process the search or try to work the index from another angle. merlin
Re: Guidelines on best indexing strategy for varying searches on 20+ columns
From
Niels Kristian Schjødt
Date:
Thanks for your suggestions, very useful. See comments inline:
Interesting, do you have any good resources on this approach?
So, to just clarify: I’m often combining a large number of search criteria and the individual criteria is often not very selective, in that case, are you arguing for or against indexing all columns? :-)
Thanks, hadn’t been thinking about using partial indexes here as an option.
Den 25/06/2014 kl. 23.48 skrev Merlin Moncure <mmoncure@gmail.com>:
On Wed, Jun 25, 2014 at 3:48 AM, Niels Kristian Schjødt
<nielskristian@autouncle.com> wrote:Hi,
I’m running a search engine for cars. It’s backed by a postgresql 9.3 installation.
Now I’m unsure about the best approach/strategy on doing index optimization for the fronted search.
The problem:
The table containing the cars holds a around 1,5 million rows. People that searches for cars needs different criteria to search by. Some search by brand/model, some by year, some by mileage, some by price and some by special equipment etc. etc. - and often they combine a whole bunch of criteria together. Of cause some, like brand/mode and price, are used more frequently than others. In total we offer: 9 category criteria like brand/model or body type, plus 5 numeric criteria like price or mileage, plus 12 boolean criteria like equipment. Lastly people can order the results by different columns (year, price, mileage and a score we create about the cars). By default we order by our own generated score.
What I’ve done so far:
I have analyzed the usage of the criteria “lightly”, and created a few indexes (10). Among those, are e.g. indexes on price, mileage and a combined index on brand/model. Since we are only interested in showing results for cars which is actually for sale, the indexes are made as partial indexes on a sales state column.
Questions:
1. How would you go about analyzing and determining what columns should be indexed, and how?
mainly frequency of access.2. What is the best strategy when optimizing indexes for searches happening on 20 + columns, where the use and the combinations varies a lot? (To just index everything, to index some of the columns, to do combined indexes, to only do single column indexes etc. etc.)
don't make 20 indexes. consider installing pg_trgm (for optimized
LIKE searching) or hstore (for optmized key value searching) and then
using GIST/GIN for multiple attribute search. with 9.4 we have
another fancy technique to explore: jsonb searching via GIST/GIN.
3. I expect that it does not make sense to index all columns?
well, maybe. if you only ever search one column at a time, then it
might make sense. but if you need to search arbitrary criteria and
frequently combine a large number, then no -- particularly if your
dataset is very large and individual criteria are not very selective.
4. I expect it does not make sense to index boolean columns?
in general, no. an important exception is if you are only interested
in true or false and the number of records that have that interesting
value is tiny relative to the size of the table. in that case, a
partial index can be used for massive optimization.
5. Is it better to do a combined index on 5 frequently used columns rather than having individual indexes on each of them?
Only if you search those 5 columns together a significant portion of the time.6. Would it be a goof idea to have all indexes sorted by my default sorting?
index order rarely matters. if you always search values backwards and
the table is very large you may want to consider it. unfortunately
this often doesn't work for composite indexes so sometimes we must
explore the old school technique of reversing the value.7. Do you have so experiences with other approaches that could greatly improve performance (e.g. forcing indexes to stay in memory etc.)?
as noted above, fancy indexing is the first place to look. start
with pg_trgm (for like optmization), hstore, and the new json stuff.
the big limitation you will hit is that that most index strategies, at
least fo the prepackaged stuff will support '=', or partial string
(particularly with pg_trgm like), but not > or <: for range operations
you have to post process the search or try to work the index from
another angle.
merlin
On Wed, Jun 25, 2014 at 1:48 AM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote: > Hi, > I’m running a search engine for cars. It’s backed by a postgresql 9.3 installation. > > Now I’m unsure about the best approach/strategy on doing index optimization for the fronted search. > > The problem: > > The table containing the cars holds a around 1,5 million rows. People that searches for cars needs different criteria tosearch by. Some search by brand/model, some by year, some by mileage, some by price and some by special equipment etc.etc. - and often they combine a whole bunch of criteria together. Of cause some, like brand/mode and price, are usedmore frequently than others. In total we offer: 9 category criteria like brand/model or body type, plus 5 numeric criterialike price or mileage, plus 12 boolean criteria like equipment. Lastly people can order the results by differentcolumns (year, price, mileage and a score we create about the cars). By default we order by our own generated score. > > What I’ve done so far: > > I have analyzed the usage of the criteria “lightly”, and created a few indexes (10). Among those, are e.g. indexes on price,mileage and a combined index on brand/model. Since we are only interested in showing results for cars which is actuallyfor sale, the indexes are made as partial indexes on a sales state column. I'd probably partition the data on whether it is for sale, and then search only the for-sale partition. > > Questions: > > 1. How would you go about analyzing and determining what columns should be indexed, and how? I'd start out with intuition about which columns are likely to be used most often, and in a selective way. And followup by logging slow queries so they can be dissected at leisure. > 2. What is the best strategy when optimizing indexes for searches happening on 20 + columns, where the use and the combinationsvaries a lot? (To just index everything, to index some of the columns, to do combined indexes, to only do singlecolumn indexes etc. etc.) There is no magic index. Based on your description, you are going to be seq scanning your table a lot. Focus on making it as small as possible, but vertical partitioning it so that the not-for-sale entries are hived off to an historical table, and horizontally partitioning it so that large columns rarely used in the where clause are in a separate table (Ideally you would tell postgresql to aggressively toast those columns, but there is no knob with which to do that) > 3. I expect that it does not make sense to index all columns? You mean individually, or jointly? Either way, probably not. > 4. I expect it does not make sense to index boolean columns? In some cases it can, for example if the data distribution is very lopsided and the value with the smaller side is frequently specified. > 5. Is it better to do a combined index on 5 frequently used columns rather than having individual indexes on each of them? How often are the columns specified together? If they are completely independent it probably makes little sense to index them together. > 6. Would it be a goof idea to have all indexes sorted by my default sorting? You don't get to choose. An btree index is sorted by the columns specified in the index, according to the operators specified (or defaulted). Unless you mean that you want to add the default sort column to be the lead column in each index, that actually might make sense. > 7. Do you have so experiences with other approaches that could greatly improve performance (e.g. forcing indexes to stayin memory etc.)? If your queries are as unstructured as you imply, I'd forget about indexes for the most part, as you will have a hard time findings ones that work. Concentrate on making seq scans as fast as possible. If most of your queries end in something like "ORDER by price limit 10" then concentrate on index scans over price. You will probably want to include heuristics in your UI such that if people configure queries to download half your database, you disallow that. You will probably find that 90% of the workload comes from people who are just playing around with your website and don't actually intend to do business with you. Cheers, Jeff
Re: Guidelines on best indexing strategy for varying searches on 20+ columns
From
Niels Kristian Schjødt
Date:
Thanks for the answers. > Den 30/06/2014 kl. 20.04 skrev Jeff Janes <jeff.janes@gmail.com>: > > On Wed, Jun 25, 2014 at 1:48 AM, Niels Kristian Schjødt > <nielskristian@autouncle.com> wrote: >> Hi, >> I’m running a search engine for cars. It’s backed by a postgresql 9.3 installation. >> >> Now I’m unsure about the best approach/strategy on doing index optimization for the fronted search. >> >> The problem: >> >> The table containing the cars holds a around 1,5 million rows. People that searches for cars needs different criteriato search by. Some search by brand/model, some by year, some by mileage, some by price and some by special equipmentetc. etc. - and often they combine a whole bunch of criteria together. Of cause some, like brand/mode and price,are used more frequently than others. In total we offer: 9 category criteria like brand/model or body type, plus 5numeric criteria like price or mileage, plus 12 boolean criteria like equipment. Lastly people can order the results bydifferent columns (year, price, mileage and a score we create about the cars). By default we order by our own generatedscore. >> >> What I’ve done so far: >> >> I have analyzed the usage of the criteria “lightly”, and created a few indexes (10). Among those, are e.g. indexes onprice, mileage and a combined index on brand/model. Since we are only interested in showing results for cars which is actuallyfor sale, the indexes are made as partial indexes on a sales state column. > > I'd probably partition the data on whether it is for sale, and then > search only the for-sale partition. Hmm okay, I already did all the indexes partial based on the for-sales state. If the queries always queries for-sale, andall indexes are partial based on those, will it then still help performance / make sense to partition the tables? > >> >> Questions: >> >> 1. How would you go about analyzing and determining what columns should be indexed, and how? > > I'd start out with intuition about which columns are likely to be used > most often, and in a selective way. And followup by logging slow > queries so they can be dissected at leisure. > >> 2. What is the best strategy when optimizing indexes for searches happening on 20 + columns, where the use and the combinationsvaries a lot? (To just index everything, to index some of the columns, to do combined indexes, to only do singlecolumn indexes etc. etc.) > > There is no magic index. Based on your description, you are going to > be seq scanning your table a lot. Focus on making it as small as > possible, but vertical partitioning it so that the not-for-sale > entries are hived off to an historical table, and horizontally > partitioning it so that large columns rarely used in the where clause > are in a separate table (Ideally you would tell postgresql to > aggressively toast those columns, but there is no knob with which to > do that) > > >> 3. I expect that it does not make sense to index all columns? > > You mean individually, or jointly? Either way, probably not. > >> 4. I expect it does not make sense to index boolean columns? > > In some cases it can, for example if the data distribution is very > lopsided and the value with the smaller side is frequently specified. > >> 5. Is it better to do a combined index on 5 frequently used columns rather than having individual indexes on each of them? > > How often are the columns specified together? If they are completely > independent it probably makes little sense to index them together. > >> 6. Would it be a goof idea to have all indexes sorted by my default sorting? > > You don't get to choose. An btree index is sorted by the columns > specified in the index, according to the operators specified (or > defaulted). Unless you mean that you want to add the default sort > column to be the lead column in each index, that actually might make > sense. > >> 7. Do you have so experiences with other approaches that could greatly improve performance (e.g. forcing indexes to stayin memory etc.)? > > If your queries are as unstructured as you imply, I'd forget about > indexes for the most part, as you will have a hard time findings ones > that work. Concentrate on making seq scans as fast as possible. If > most of your queries end in something like "ORDER by price limit 10" > then concentrate on index scans over price. You will probably want to > include heuristics in your UI such that if people configure queries to > download half your database, you disallow that. You will probably > find that 90% of the workload comes from people who are just playing > around with your website and don't actually intend to do business with > you. > > Cheers, > > Jeff