Full text indexing (and errors!) - Mailing list pgsql-sql
From | Mitch Vincent |
---|---|
Subject | Full text indexing (and errors!) |
Date | |
Msg-id | 00db01bfc345$e5400100$0300000a@doot.org Whole thread Raw |
Responses |
Re: Full text indexing (and errors!)
|
List | pgsql-sql |
I posted this to -hackers but thought it might be of some use to people here as well. I look forward to any and all comments.. Thanks! ---------------------------------- I finally finished the fulltextindex ( in contrib/fulltextindex) and am wondering if I might be able to do something that I haven't already to make these queries a bit faster.. There is a lot of data here, I know, but as it is right now it is faster for me to use grep (or agrep) to search the files and put the matching files into a temporay table and qualify results on that... I had hoped this would be the case :-) Lets see.. (Most of this is in contrib/fulltextindex but I'll repeat it here) I have 2 tables. Table "applicants_resumes" Attribute | Type | Modifier -------------+---------+----------app_id | integer |resume_text | text | Index: resumes_oid_index And Table "resumes_fti"Attribute | Type | Modifier -----------+--------------+----------string | varchar(255) |id | oid | Index: resume_fti_index The resumes_fti table holds the broken up chunks of text (duh) .. I also did the CLUSTER resumes_fti_index on resumes_fti When I run a query I get : ipa=# explain select c.* from applicants c, resumes_fti f1 where f1.string ~ '^engineer' and f1.id = c.oid; NOTICE: QUERY PLAN: Hash Join (cost=918.11..7969.43 rows=56443 width=607) -> Seq Scan on applicants c (cost=0.00..1331.56 rows=9856 width=603)-> Hash (cost=5.00..5.00 rows=168041 width=4) -> Index Scan using resume_fti_index on resumes_fti f1 (cost=0.00..5.00 rows=168041 width=4) EXPLAIN ipa=# query: select c.* from applicants c, resumes_fti f1 where f1.string ~ '^engineer' and f1.id = c.oid; ProcessQuery ! system usage stats: ! 304.304711 elapsed 9.111341 user 7.992034 system sec ! [9.172874 user 8.012545 sys total] ! 53309/1599 [53411/1599] filesystem blocks in/out ! 0/295 [0/605] page faults/reclaims, 0 [0] swaps ! 0 [0] signals rcvd, 0/0 [4/5] messages rcvd/sent ! 53168/226 [53255/232] voluntary/involuntary context switches ! postgres usage stats: ! Shared blocks: 58171 read, 0 written, buffer hit rate = 14.19% ! Local blocks: 0 read, 0 written, buffer hit rate = 0.00% ! Direct blocks: 0 read, 0 written CommitTransactionCommand proc_exit(0) shmem_exit(0) exit(0) ... As you can see, really really slow. So I thought baout creating an index on 'string' in resumes_fti (makes since as this is suppose to be a full text index) -- Note that's not listed in the README which was very odd to me. I don't understand the point in breaking all the text out into chunks if you don't then index it. The way it is after you get done with the README is just a broken up version of what you already had (a bunch of text fields). It was my understanding that the fulltextindex was meant to get around the inability to index a full text field by breaking the full text field up into smaller varchar() fields which could them be indexed... Anyway, I tried -- ipa=# CREATE INDEX "rstring_fti_index" on resumes_fti(lower(string)); FATAL 1: Memory exhausted in AllocSetAlloc() pqReadData() -- backend closed the channel unexpectedly. This probably means the backend terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. ipa=# ... and as you can see it messed up pretty badly after about 15 minutes. This machine is a PII400 with 256 Megs of RAM and 520 megs of swap running PostgreSQL 7.0.. Sorry about the length of this post and thanks for any pointers/comments/ideas.. -Mitch Vincent