Re: genomic locus - Mailing list pgsql-hackers
From | Teodor Sigaev |
---|---|
Subject | Re: genomic locus |
Date | |
Msg-id | 02df3c8c-1e20-c3e3-1956-f998bcfef21b@sigaev.ru Whole thread Raw |
In response to | Re: genomic locus (Gene Selkov <selkovjr@gmail.com>) |
Responses |
Re: genomic locus
|
List | pgsql-hackers |
Hmm, would you try to implement separate type for querying? Similar to tsquery, lquery (for ltree), jsquery etc. Gene Selkov wrote: > >> On Dec 17, 2017, at 7:57 PM, Robert Haas <robertmhaas@gmail.com >> <mailto:robertmhaas@gmail.com>> wrote: >> >> On Fri, Dec 15, 2017 at 2:49 PM, Gene Selkov <selkovjr@gmail.com >> <mailto:selkovjr@gmail.com>> wrote: >>> I need a data type to represent genomic positions, which will consist of a >>> string and a pair of integers with interval logic and access methods. Sort >>> of like my seg type, but more straightforward. >> >> Have you thought about just using a composite type? > > Yes, I have. That is sort of what I have been doing; a composite type certainly > gets the job done but I don’t feel it reduces query complexity, at least from > the user’s point of view. Maybe I don’t know enough. > > Here’s an example of how I imagine a composite genomic locus (conventionally > represented as text ‘:’ integer ‘-‘ integer): > > CREATE TYPE locus AS (contig text, coord int4range); > CREATE TABLE test_locus ( > pos locus, > ref text, > alt text, > id text > ); > CREATE INDEX test_locus_coord_ix ON test_locus (((pos).coord)); > \copy test_locus from test_locus.tab > > Where test_locus.tab has stuff like: > > (chr3,"[178916937,178916940]")GAACHP2_PIK3CA_2 > (chr3,"[178916939,178916948]")AGAAAAGATCHP2_PIK3CA_2 > (chr3,"[178916940,178916941]")GACHP2_PIK3CA_2 > (chr3,"[178916943,178916944]")AGCHP2_PIK3CA_2 > (chr3,"[178916943,178916946]")AAGCHP2_PIK3CA_2 > (chr3,"[178916943,178916952]")AAGATCCTCCHP2_PIK3CA_2 > (chr3,"[178916944,178916945]")AGCHP2_PIK3CA_2 > (chr3,"[178916945,178916946]")GCCHP2_PIK3CA_2 > (chr3,"[178916945,178916946]")GTCHP2_PIK3CA_2 > (chr3,"[178916945,178916948]")GATCHP2_PIK3CA_2 > > When the table is loaded, I can pull the subset shown above with this query: > > SELECT * FROM test_locus WHERE (pos).contig = 'chr3' AND (pos).coord && > '[178916937, 178916948]’; > pos | ref | alt | id > --------------------------------+-----------+-----+--------------- > (chr3,"[178916937,178916941)") | GAA | | CHP2_PIK3CA_2 > (chr3,"[178916939,178916949)") | AGAAAAGAT | | CHP2_PIK3CA_2 > . . . . > > So far so good. It gets the job done. However, it is only a small step towards a > fully encapsulated, monolithic type I want it to be. The above query It is > marginally better than its atomic-type equivalent: > > SELECT * FROM test WHERE contig = 'chr3' AND greatest(start, 178916937) <= > least(stop, 178916948); > contig | start | stop | ref | alt | id > --------+-----------+-----------+-----------+-----+--------------- > chr3 | 178916937 | 178916940 | GAA | | CHP2_PIK3CA_2 > chr3 | 178916939 | 178916948 | AGAAAAGAT | | CHP2_PIK3CA_2 > . . . . > > and it requires addition syntax transformations steps to go from conventional > locus representation 'chr3:178916937-178916940' to composite > '(chr3,"[178916937,178916940]”)’ and back. > > Of course, the relative benefits of partial encapsulation I achieve by bundling > text with int4range accumulate, compared to (text, int4, int4), as queries grow > more complex. But because the elements of a composite type still require a > separate query term for each of them (unless there is some magic I am not aware > of), the complexity of a typical query I need to run exceeds my feeble > sight-reading capacity. I want things that are conceptually simple to be > expressed in simple terms, if possible. > > Like so: > > CREATE EXTENSION locus; > CREATE TABLE test_locus ( > pos locus, > ref text, > alt text, > id text > ); > \copy test_locus from data/oncomine.hotspot.tab > > SELECT * FROM test_locus WHERE pos && 'chr3:178916937-178916948'; > pos | ref | alt | id > --------------------------+-----------+-----+--------------- > chr3:178916937-178916940 | GAA | | CHP2_PIK3CA_2 > chr3:178916939-178916948 | AGAAAAGAT | | CHP2_PIK3CA_2 > chr3:178916940-178916941 | G | A | CHP2_PIK3CA_2 > chr3:178916943-178916944 | A | G | CHP2_PIK3CA_2 > chr3:178916943-178916946 | AAG | | CHP2_PIK3CA_2 > chr3:178916943-178916952 | AAGATCCTC | | CHP2_PIK3CA_2 > chr3:178916944-178916945 | A | G | CHP2_PIK3CA_2 > chr3:178916945-178916946 | G | C | CHP2_PIK3CA_2 > chr3:178916945-178916946 | G | T | CHP2_PIK3CA_2 > chr3:178916945-178916948 | GAT | | CHP2_PIK3CA_2 > (10 rows) > > I have encountered some pesky geometry / indexing problems while building this > extension (https://github.com/selkovjr/locus), but I hope I can solve them at > least at the level afforded by the composite type, while keeping the clean > interface of a monolithic type. I understand I could probably achieve the same > cleanliness by defining functions and operators over the complex type, but by > the time I’m done with that, will I have coded about the same amount of stuff as > required to build an extended type? > > > Regards, > > —Gene > > > > -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
pgsql-hackers by date: