Thread: best practise/pattern for large OR / LIKE searches
Hey All,
I am wondering if there is a common pattern for these sort of queries :
SELECT * FROM tbl WHERE datanumber LIKE '%12345%' OR LIKE '%54321%' OR LIKE '%8766%' OR LIKE '%009%', ..
The number of OR/LIKES are in the order of 50-100 items...
the table tbl is a couple of million rows.
The datanumber is a string that are maximum 10 characters long, no spaces and can contain numbers and letters.
Apart from creating a couple of index table to make the LIKE left anchored something like this :
tbl <----> tbl_4letters
tbl <----> tbl_5letters
tbl <----> tbl_3letters
or creating a functional index 'of some sort' are there any other brilliant ideas out there to solve such a problem (GIN/GIS???) ?
Searches are currently taking to long and we would like to optimize them, but before we dive into our own solution we
where wondering if there already common solutions for this...
Kind Regards,
Ries van Twisk
Hello one year ago there was proposal for index support for LIKE %some%. The problem was extreme size of index size. I thing so you can write own C function, that can check string faster than repeated LIKE some like SELECT * FROM tbl WHERE contains(datanumber, '12345','54321',....) regards Pavel Stehule 2009/8/26 Ries van Twisk <pg@rvt.dds.nl>: > Hey All, > I am wondering if there is a common pattern for these sort of queries : > SELECT * FROM tbl WHERE datanumber LIKE '%12345%' OR LIKE '%54321%' OR LIKE > '%8766%' OR LIKE '%009%', .. > The number of OR/LIKES are in the order of 50-100 items... > the table tbl is a couple of million rows. > The datanumber is a string that are maximum 10 characters long, no spaces > and can contain numbers and letters. > Apart from creating a couple of index table to make the LIKE left anchored > something like this : > tbl <----> tbl_4letters > tbl <----> tbl_5letters > tbl <----> tbl_3letters > or creating a functional index 'of some sort' are there any other brilliant > ideas out there to solve such a problem (GIN/GIS???) ? > Searches are currently taking to long and we would like to optimize them, > but before we dive into our own solution we > where wondering if there already common solutions for this... > Kind Regards, > Ries van Twisk > > > > >
Hi Pavel, can you provide some link or other directions to the proposal? I guess it was posted to this list or somewhere else? Tomas > Hello > > one year ago there was proposal for index support for LIKE %some%. The > problem was extreme size of index size. > > I thing so you can write own C function, that can check string faster > than repeated LIKE > > some like > > SELECT * FROM tbl WHERE contains(datanumber, '12345','54321',....) > > regards > Pavel Stehule > > 2009/8/26 Ries van Twisk <pg@rvt.dds.nl>: >> Hey All, >> I am wondering if there is a common pattern for these sort of queries : >> SELECT * FROM tbl WHERE datanumber LIKE '%12345%' OR LIKE '%54321%' OR >> LIKE >> '%8766%' OR LIKE '%009%', .. >> The number of OR/LIKES are in the order of 50-100 items... >> the table tbl is a couple of million rows. >> The datanumber is a string that are maximum 10 characters long, no >> spaces >> and can contain numbers and letters. >> Apart from creating a couple of index table to make the LIKE left >> anchored >> something like this : >> tbl <----> tbl_4letters >> tbl <----> tbl_5letters >> tbl <----> tbl_3letters >> or creating a functional index 'of some sort' are there any other >> brilliant >> ideas out there to solve such a problem (GIN/GIS???) ? >> Searches are currently taking to long and we would like to optimize >> them, >> but before we dive into our own solution we >> where wondering if there already common solutions for this... >> Kind Regards, >> Ries van Twisk >> >> >> >> >> > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general >
2009/8/26 <tv@fuzzy.cz>: > Hi Pavel, > > can you provide some link or other directions to the proposal? I guess it > was posted to this list or somewhere else? Please, ask to Oleg Bartunov http://www.sai.msu.su/~megera/wiki/wildspeed regards Pavel Stehule > > Tomas > >> Hello >> >> one year ago there was proposal for index support for LIKE %some%. The >> problem was extreme size of index size. >> >> I thing so you can write own C function, that can check string faster >> than repeated LIKE >> >> some like >> >> SELECT * FROM tbl WHERE contains(datanumber, '12345','54321',....) >> >> regards >> Pavel Stehule >> >> 2009/8/26 Ries van Twisk <pg@rvt.dds.nl>: >>> Hey All, >>> I am wondering if there is a common pattern for these sort of queries : >>> SELECT * FROM tbl WHERE datanumber LIKE '%12345%' OR LIKE '%54321%' OR >>> LIKE >>> '%8766%' OR LIKE '%009%', .. >>> The number of OR/LIKES are in the order of 50-100 items... >>> the table tbl is a couple of million rows. >>> The datanumber is a string that are maximum 10 characters long, no >>> spaces >>> and can contain numbers and letters. >>> Apart from creating a couple of index table to make the LIKE left >>> anchored >>> something like this : >>> tbl <----> tbl_4letters >>> tbl <----> tbl_5letters >>> tbl <----> tbl_3letters >>> or creating a functional index 'of some sort' are there any other >>> brilliant >>> ideas out there to solve such a problem (GIN/GIS???) ? >>> Searches are currently taking to long and we would like to optimize >>> them, >>> but before we dive into our own solution we >>> where wondering if there already common solutions for this... >>> Kind Regards, >>> Ries van Twisk >>> >>> >>> >>> >>> >> >> -- >> Sent via pgsql-general mailing list (pgsql-general@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-general >> > > >
The wildspeed function seems to be what I was looking for. an dI remember that I have seen it before on the list... just I couldn't remember names or anything... Ries On Aug 26, 2009, at 7:28 AM, Pavel Stehule wrote: > 2009/8/26 <tv@fuzzy.cz>: >> Hi Pavel, >> >> can you provide some link or other directions to the proposal? I >> guess it >> was posted to this list or somewhere else? > > Please, ask to Oleg Bartunov > > http://www.sai.msu.su/~megera/wiki/wildspeed > > regards > Pavel Stehule > >> >> Tomas >> >>> Hello >>> >>> one year ago there was proposal for index support for LIKE %some%. >>> The >>> problem was extreme size of index size. >>> >>> I thing so you can write own C function, that can check string >>> faster >>> than repeated LIKE >>> >>> some like >>> >>> SELECT * FROM tbl WHERE contains(datanumber, '12345','54321',....) >>> >>> regards >>> Pavel Stehule >>> >>> 2009/8/26 Ries van Twisk <pg@rvt.dds.nl>: >>>> Hey All, >>>> I am wondering if there is a common pattern for these sort of >>>> queries : >>>> SELECT * FROM tbl WHERE datanumber LIKE '%12345%' OR LIKE >>>> '%54321%' OR >>>> LIKE >>>> '%8766%' OR LIKE '%009%', .. >>>> The number of OR/LIKES are in the order of 50-100 items... >>>> the table tbl is a couple of million rows. >>>> The datanumber is a string that are maximum 10 characters long, no >>>> spaces >>>> and can contain numbers and letters. >>>> Apart from creating a couple of index table to make the LIKE left >>>> anchored >>>> something like this : >>>> tbl <----> tbl_4letters >>>> tbl <----> tbl_5letters >>>> tbl <----> tbl_3letters >>>> or creating a functional index 'of some sort' are there any other >>>> brilliant >>>> ideas out there to solve such a problem (GIN/GIS???) ? >>>> Searches are currently taking to long and we would like to optimize >>>> them, >>>> but before we dive into our own solution we >>>> where wondering if there already common solutions for this... >>>> Kind Regards, >>>> Ries van Twisk >>>> >>>> >>>> >>>> >>>> >>> >>> -- >>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org) >>> To make changes to your subscription: >>> http://www.postgresql.org/mailpref/pgsql-general >>> >> >> >> regards, Ries van Twisk ------------------------------------------------------------------------------------------------- tags: Freelance TYPO3 Glassfish JasperReports JasperETL Flex Blaze-DS WebORB PostgreSQL DB-Architect email: ries@vantwisk.nl web: http://www.rvantwisk.nl/ skype: callto://r.vantwisk Phone: +1-810-476-4196 Cell: +593 9901 7694 SIP: +1-747-690-5133
> SELECT * FROM tbl WHERE datanumber LIKE '%12345%' OR LIKE '%54321%' > OR LIKE '%8766%' OR LIKE '%009%', .. > > The number of OR/LIKES are in the order of 50-100 items... > the table tbl is a couple of million rows. Are the fixed strings in the wildcards "words" (i.e., are they completely arbitrarily embedded in the text, or are they delimited in some regular way)? If they are "words," you might consider using the full text functionality to create an index of them, and searching using that. -- -- Christophe Pettus xof@thebuild.com
On 2009-08-26, Ries van Twisk <pg@rvt.dds.nl> wrote: > > --Apple-Mail-1173-222712773 > Content-Type: text/plain; > charset=US-ASCII; > format=flowed; > delsp=yes > Content-Transfer-Encoding: 7bit > > Hey All, > > I am wondering if there is a common pattern for these sort of queries : > > SELECT * FROM tbl WHERE datanumber LIKE '%12345%' OR LIKE '%54321%' OR > LIKE '%8766%' OR LIKE '%009%', .. SELECT * FROM tbl WHERE datanumber LIKE ANY ARRAY('%12345%','%54321%','%8766%'...) > The number of OR/LIKES are in the order of 50-100 items... > the table tbl is a couple of million rows. regex might perfrom better than LIKE ANY SELECT * FROM tbl WHERE '12345|54321|8766|009' ~ datanumber; regex is compiled to a finite state machine and then the datanumber column is scanned in a single pass (for each row) > Searches are currently taking to long and we would like to optimize > them, but before we dive into our own solution we > where wondering if there already common solutions for this... try regex first if that's too slow you may need to write a dictionary function that splits datanuimber into it's components and use full text index/search. (this will slow down updates as they will do upto 20 inserts into the index) searches should then be optimally fast
Hello > regex is compiled to a finite state machine and then the datanumber > column is scanned in a single pass (for each row) > >> Searches are currently taking to long and we would like to optimize >> them, but before we dive into our own solution we >> where wondering if there already common solutions for this... > > try regex first if that's too slow you may need to write a > dictionary function that splits datanuimber into it's components > and use full text index/search. (this will slow down updates as they will do > upto 20 inserts into the index) > > searches should then be optimally fast > I did some tests: 1) I fill test table insert into test SELECT array_to_string(array_agg(array_to_string(ARRAY(select substring('01234567890' from (random()*10)::int + 1 for 1) from generate_series(1,(random()*10+5)::int + i - i)),'')),',') as b from generate_series(1,100000) g(i) group by (random()*1000)::int; 2. I tested searching of 5 or 13 values. I did tests on 8.4 and 8.1 8.1 using like 190ms(440ms*) using regexp 115ms(259ms*) * for 13 values - so there regexp is faster than like on 8.4 using like 80ms(151ms) using regexp 131ms(267ms) so like is faster then regexp on 8.4. fulltext test (8.4) 420ms(470ms) -- without index 14ms(26ms) -- with GiST index 1ms(2ms) -- with Gin index some samples of test queries: select * from test where to_tsvector('simple',a) @@ to_tsquery('simple','296426496|7707431116555|98173598191|302598|53174827|02292064629|188631468777|4756243248|920473|16602317|76613513|78640|9176323'); select * from test where a ~ '296426496|7707431116555|98173598191|302598|53174827|02292064629|188631468777|4756243248|920473|16602317|76613513|78640|9176323'; select * from test where a like '%296426496%' or a like '% 7707431116555%' or a like '%98173598191%' or a like '%302598%' or a like '%53174827%' or a like '%02292064629%' or a like '%188631468777%' or a like '%4756243248%' or a like '%920473%' or a like '%16602317%' or a like '%76613513%' or a like '%78640%' or a like '%9176323%'; regards Pavel Stehule