[Fwd: PostgreSQL to index MEDLINE] - Mailing list pgsql-general
From | Justin Clift |
---|---|
Subject | [Fwd: PostgreSQL to index MEDLINE] |
Date | |
Msg-id | 3ED2EB2E.4060509@postgresql.org Whole thread Raw |
Responses |
Re: [Fwd: PostgreSQL to index MEDLINE]
|
List | pgsql-general |
Hi everyone, Does anyone feel like assisting Olaf here? Regards and best wishes, Justin Clift -------- Original Message -------- Subject: PostgreSQL to index MEDLINE Date: Wed, 21 May 2003 11:47:39 +0200 From: Olaf Bininda-Emonds <Olaf.Bininda@tz.agrar.tu-muenchen.de> To: justin@postgresql.org Hi, I have a question concerning the suitability of PostgreSQL to yield a word index of MEDLINE, an XML-formatted database of over 12 000 000 articles from the biomedical literature. A computer scientist friend suggested that PostgreSQL might be the answer that we require. Briefly, our project is as follows: We would like to establish word co-occurences within each MEDLINE entry, with one word representing a gene and the other word representing one or more user-input keywords of functional traits. As the MEDLINE database is over 30 GB in size (and split into 396 separate files), we need an efficient, indexed way to search it. Our specific needs are as follows: 1) An index of all the words in the fields "title", "abstract" (not always present), and "medical subject headings" such that I can search the index to reveal which article (represented by the MedlineID field) contains a desired word. This will have two purposes: 1a) To build a subindex detailing which human genes are being referred to in an article. This is slightly complicated, in part because of the ever-changing nomenclature of genes. Officially, a gene is known by an alphanumeric symbol (of up to eight characters; e.g., BRCA1) and a name of one or more words ("breast cancer 1, early onset"). Searching for the symbol itself is insufficient, either because the symbol has not been used (e.g., older references) or can have other, common meanings (e.g., the symbol for "pyruvate carboxylase" is PC). Therefore, determining whether a gene is being mentioned in a given article requires a combination of the gene symbol being present and/or a sufficient number of the words in its name ("sufficient" being determined by the length of the gene symbol). I have a perl script that can determine this; however, it currently searches through all 396 files on a pattern matching basis. Using an index would obviously be that more efficient, allowing me to query for a lists of articles that contain for a given gene, the gene symbol and each of the words in the gene name. The intersection of these lists would then be the articles held to contain a reference to that gene. 1b) To enable dynamic searches of the index for user-input keywords of functional traits. Searching for phrases ("lipid metabolism") would be ideal, but not necessary. In both cases, the matches would be for exact words (i.e., no stem searches). 2) The program should run in a UNIX environment (as implemented in Mac OS X) as I will be writing my own (hideously-inefficient) perl script to access it for queries. Generally, the protocol will be as follows: a) input gene-article subindex into memory b) obtain list of user-input keywords (and phrases, which will be split into individual words if phrases cannot be searched for) c) query index for MEDLINE articles containing each keyword in turn; articles will be scored according to the number of keywords they contain (i.e., an article with two keywords is scored twice as heavily as an article with only one keyword) d) for each article in the list in (c), determine whether each of ca. 16000 genes is also mentioned in that article (via subindex); scores for genes will be positively or negatively incremented according to whether or not they are present e) for each gene, determine its association score (= number of articles in which it co-occurs with at least one keyword - number of articles containing a keyword but not that gene). Therefore, the index should be written to disk as it will be accessed repeatedly for each keyword. --- I would appreciate your advice as to whether PostgreSQL is suitable for this task and, if so, how best to implement it and set up the word index. I have attached a sample of the MEDLINE database for you. With thanks in advance for your help! Best, Olaf ------------------------------------------------------------------------ ----------------- Olaf Bininda-Emonds Lehrstuhl für Tierzucht Technical University of Munich Alte Akademie 12 85354 Freising-Weihenstephan Germany Phone: +49 (0)8161 713741 Fax: +49 (0)8161 713107 e-mail: Olaf.Bininda@tierzucht.tum.de WWW: http://www.tierzucht.tum.de/Bininda-Emonds/ -- "My grandfather once told me that there are two kinds of people: those who work and those who take the credit. He told me to try to be in the first group; there was less competition there." - Indira Gandhi
Attachment
pgsql-general by date: