Re: Update with last known location? - Mailing list pgsql-novice
From | James David Smith |
---|---|
Subject | Re: Update with last known location? |
Date | |
Msg-id | CAMu32ACWq4hJcfANx_+91SRQ6hxs-nmcOdHgzUji-cnQVquaMg@mail.gmail.com Whole thread Raw |
In response to | Re: Update with last known location? (James David Smith <james.david.smith@gmail.com>) |
Responses |
Re: Update with last known location?
|
List | pgsql-novice |
All, Here's a SQL fiddle of my problem: http://sqlfiddle.com/#!15/77157 Thanks James On 30 January 2014 11:19, James David Smith <james.david.smith@gmail.com> wrote: > Hi Erik / all, > > I don't think that will work, as what happens if one of the people has > two missing periods of time within their day? > > I've made a self-contained example of my problem below. Would you mind > trying to produce in the code below what you think I should do? Or if > anyone else fancies having a go then please do. > > I very much appreciate your help by the way. Thank you. I'm really at > a loss with this. :-( > > James > -------------------------------------- > > DROP TABLE test_data; > > CREATE TABLE test_data( > ppid integer, > point_time timestamp without time zone, > the_geom integer); > > INSERT INTO test_data VALUES > ('1', '2012-01-01 07:00', '1'), > ('1', '2012-01-01 07:01', '1'), > ('1', '2012-01-01 07:02', '1'), > ('1', '2012-01-01 07:03', NULL), -- null should be replaced with 1 > ('1', '2012-01-01 07:04', NULL), -- null should be replaced with 1 > ('1', '2012-01-01 07:05', '5'), > ('1', '2012-01-01 07:06', '5'), > ('1', '2012-01-01 07:07', '5'), > ('1', '2012-01-01 07:08', NULL), -- null should be replaced with 5 > ('1', '2012-01-01 07:09', NULL), -- null should be replaced with 5 > ('1', '2012-01-01 07:10', NULL), -- null should be replaced with 5 > ('1', '2012-01-01 07:11', NULL), -- null should be replaced with 5 > ('2', '2013-05-02 07:12', '24'), > ('2', '2013-05-02 07:13', '24'), > ('2', '2013-05-02 07:14', '24'), > ('2', '2013-05-02 07:15', NULL), -- null should be replaced with 24 > ('2', '2013-05-02 07:16', NULL), -- null should be replaced with 24 > ('2', '2013-05-02 07:17', '44'), > ('2', '2013-05-02 07:18', '44'), > ('2', '2013-05-02 07:19', NULL), -- null should be replaced with 44 > ('2', '2013-05-02 07:20', '4'), > ('2', '2013-05-02 07:21', '4'), > ('2', '2013-05-02 07:22', '4'); > > WITH missing_geoms AS ( > SELECT ppid, > point_time, > the_geom > FROM test_data > WHERE the_geom IS NULL) > --- > ,filled_geoms AS ( > SELECT ppid, > point_time, > the_geom > FROM test_data > WHERE the_geom IS NOT NULL) > --- > ,partitioned AS ( > SELECT missing_geoms.ppid as missing_geoms_ppid, > missing_geoms.point_time as missing_geoms_point_time, > missing_geoms.the_geom as missing_geoms_the_geom, > filled_geoms.ppid as filled_geoms_ppid, > filled_geoms.point_time as filled_geoms_point_time, > filled_geoms.the_geom as filled_geoms_the_geom > FROM missing_geoms > LEFT JOIN filled_geoms > ON filled_geoms.point_time < missing_geoms.point_time > AND filled_geoms.ppid = missing_geoms.ppid > ORDER BY missing_geoms_ppid, > missing_geoms_point_time) > --- > SELECT * > FROM partitioned; > > > > > > On 29 January 2014 22:10, Erik Darling <edarling80@gmail.com> wrote: >> Hi James, >> >> Yeah, that looks like the right place to me. >> >> >> On Wed, Jan 29, 2014 at 1:14 PM, James David Smith >> <james.david.smith@gmail.com> wrote: >>> >>> Hi Erik, >>> >>> Do you mean in this section of the SQL? >>> ..... >>> filled_geoms AS ( >>> SELECT >>> ppid, >>> point_time, >>> the_geom >>> FROM >>> hybrid_location >>> WHERE >>> the_geom IS NOT NULL) >>> ... >>> >>> Thanks >>> >>> James >>> >>> On 29 January 2014 17:57, Erik Darling <edarling80@gmail.com> wrote: >>> > Hi James, >>> > >>> > I think you're still stuck with sort of unnecessary ('too much' ) data >>> > coming from the right side of your left join. If so, one option I would >>> > consider is using DENSE_RANK() the way you use ROW_NUMBER(), in the >>> > filled_geoms table. If you partition by id and order by date descending, >>> > you >>> > can do an additional d_rank = 1 filter to only get the most recent >>> > activity. >>> > I believe this is what you want to set your NULL values to, no? >>> > >>> > >>> > >>> > >>> > On Wed, Jan 29, 2014 at 12:41 PM, James David Smith >>> > <james.david.smith@gmail.com> wrote: >>> >> >>> >> On 29 January 2014 16:02, Erik Darling <edarling80@gmail.com> wrote: >>> >> > I would re-suggest using a CTE to contain each dataset to ensure your >>> >> > selects are distilling them correctly, and then using a final query >>> >> > to >>> >> > join >>> >> > them. You can then either update your data directly through the >>> >> > CTE(s), >>> >> > or >>> >> > insert the results to another table to do some further testing. I >>> >> > think >>> >> > you'll find this method presents the data a bit more ergonomically >>> >> > for >>> >> > analysis. >>> >> > >>> >> > http://www.postgresql.org/docs/9.3/static/queries-with.html >>> >> > >>> >> > >>> >> > >>> >> > On Wed, Jan 29, 2014 at 10:45 AM, James David Smith >>> >> > <james.david.smith@gmail.com> wrote: >>> >> >> >>> >> >> Hi Erik/all, >>> >> >> >>> >> >> I just tried that, but it's tricky. The 'extra' data is indeed >>> >> >> coming >>> >> >> from the right side of the join, but it's hard to select only the >>> >> >> max >>> >> >> from it. Maybe it's possible but I've not managed to do it. Here is >>> >> >> where I am, which is so very close. >>> >> >> >>> >> >> SELECT >>> >> >> DISTINCT(a.ppid, a.point_time, a.the_geom) as >>> >> >> row_that_needs_geom_updating, >>> >> >> max(b.point_time) OVER (PARTITION BY a.ppid, a.point_time) as >>> >> >> last_known_position_time >>> >> >> FROM >>> >> >> test a >>> >> >> INNER JOIN >>> >> >> (SELECT ppid, >>> >> >> point_time, >>> >> >> the_geom >>> >> >> FROM test >>> >> >> WHERE the_geom IS NOT NULL) b >>> >> >> ON b.point_time < a.point_time >>> >> >> AND a.ppid = b.ppid >>> >> >> WHERE a.the_geom IS NULL; >>> >> >> >>> >> >> If you see attached screen-print, the output is the rows that I >>> >> >> want. >>> >> >> However I've had to use DISTINCT to stop the duplication. Also I've >>> >> >> not managed to pull through 'the_geom' from the JOIN. I'm not sure >>> >> >> how. Anyone? >>> >> >> >>> >> >> But it's kind of working. :-) >>> >> >> >>> >> >> Worst case if I can't figure out how to solve this in one query I'll >>> >> >> have to store the result of the above, and then use it as a basis >>> >> >> for >>> >> >> another query I think. >>> >> >> >>> >> >> Thanks >>> >> >> >>> >> >> James >>> >> >> >>> >> >> >>> >> >> >>> >> >> On 29 January 2014 12:56, Erik Darling <edarling80@gmail.com> wrote: >>> >> >> > I would try partitioning the second time you call row_number, >>> >> >> > perhaps >>> >> >> > by >>> >> >> > ID, >>> >> >> > and then selecting the MAX() from that, since I think the too much >>> >> >> > data >>> >> >> > you're referring to is coming from the right side of your join. >>> >> >> > >>> >> >> > On Jan 29, 2014 7:23 AM, "James David Smith" >>> >> >> > <james.david.smith@gmail.com> >>> >> >> > wrote: >>> >> >> >> >>> >> >> >> On 28 January 2014 23:15, Gavin Flower >>> >> >> >> <GavinFlower@archidevsys.co.nz> >>> >> >> >> wrote: >>> >> >> >> > On 29/01/14 11:00, Kevin Grittner wrote: >>> >> >> >> >> >>> >> >> >> >> James David Smith <james.david.smith@gmail.com> wrote: >>> >> >> >> >> >>> >> >> >> >>> Given the data is so large I don't want to be taking the data >>> >> >> >> >>> out >>> >> >> >> >>> to a CSV or whatever and then loading it back in. I'd like to >>> >> >> >> >>> do >>> >> >> >> >>> this within the database using SQL. I thought I would be able >>> >> >> >> >>> to >>> >> >> >> >>> do this using a LOOP to be honest. >>> >> >> >> >> >>> >> >> >> >> I would be amazed if you couldn't do this with a single >>> >> >> >> >> UPDATE >>> >> >> >> >> statement. I've generally found declarative forms of such >>> >> >> >> >> work >>> >> >> >> >> to >>> >> >> >> >> be at least one order of magnitude faster than going to either >>> >> >> >> >> a >>> >> >> >> >> PL >>> >> >> >> >> or a script approach. I would start by putting together a >>> >> >> >> >> SELECT >>> >> >> >> >> query using window functions and maybe a CTE or two to list >>> >> >> >> >> all >>> >> >> >> >> the >>> >> >> >> >> primary keys which need updating and the new values they >>> >> >> >> >> should >>> >> >> >> >> have. Once that SELECT was looking good, I would put it in >>> >> >> >> >> the >>> >> >> >> >> FROM clause of an UPDATE statement. >>> >> >> >> >> >>> >> >> >> >> That should work, but if you are updating a large percentage >>> >> >> >> >> of >>> >> >> >> >> the >>> >> >> >> >> table, I would go one step further before running this against >>> >> >> >> >> the >>> >> >> >> >> production tables. I would put a LIMIT on the above-mentioned >>> >> >> >> >> SELECT of something like 10000 rows, and script a loop that >>> >> >> >> >> alternates between the UPDATE and a VACUUM ANALYZE on the >>> >> >> >> >> table. >>> >> >> >> >> >>> >> >> >> >> -- >>> >> >> >> >> Kevin Grittner >>> >> >> >> >> EDB: http://www.enterprisedb.com >>> >> >> >> >> The Enterprise PostgreSQL Company >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> > James, you might consider dropping as many indexes on the table >>> >> >> >> > as >>> >> >> >> > you >>> >> >> >> > safely can, and rebuilding them after the mass update. If you >>> >> >> >> > have >>> >> >> >> > lots >>> >> >> >> > of >>> >> >> >> > such indexes, you will find this apprtoach to be a lot faster. >>> >> >> >> > >>> >> >> >> > >>> >> >> >> > Cheers, >>> >> >> >> > Gavin >>> >> >> >> >>> >> >> >> Hi all, >>> >> >> >> >>> >> >> >> Thanks for your help and assistance. I think that window >>> >> >> >> functions, >>> >> >> >> and inparticular the PARTITION function, is 100% the way to go. >>> >> >> >> I've >>> >> >> >> been concentrating on a SELECT statement for now and am close but >>> >> >> >> not >>> >> >> >> quite close enough. The below query gets all the data I want, but >>> >> >> >> *too* much. What I've essentially done is: >>> >> >> >> >>> >> >> >> - Select all the rows that don't have any geom information >>> >> >> >> - Join them with all rows before this point that *do* have geom >>> >> >> >> information. >>> >> >> >> - Before doing this join, use partition to generate row numbers. >>> >> >> >> >>> >> >> >> The attached screen grab shows the result of my query below. >>> >> >> >> Unfortunately this is generating alot of joins that I don't want. >>> >> >> >> This >>> >> >> >> won't be practical when doing it with 75,000 people. >>> >> >> >> >>> >> >> >> Thoughts and code suggestions very much appreciated... if needed >>> >> >> >> I >>> >> >> >> could put together some SQL to create an example table? >>> >> >> >> >>> >> >> >> Thanks >>> >> >> >> >>> >> >> >> SELECT row_number() OVER (PARTITION BY test.point_time ORDER >>> >> >> >> BY >>> >> >> >> test.point_time) as test_row, >>> >> >> >> test.ppid as test_ppid, >>> >> >> >> test.point_time as test_point_time, >>> >> >> >> test.the_geom as test_the_geom, >>> >> >> >> a.ppid as a_ppid, >>> >> >> >> a.point_time as a_point_time, >>> >> >> >> a.the_geom as a_the_geom, >>> >> >> >> a.a_row >>> >> >> >> FROM test >>> >> >> >> LEFT JOIN ( >>> >> >> >> SELECT the_geom, >>> >> >> >> ppid, >>> >> >> >> point_time, >>> >> >> >> row_number() OVER (ORDER BY ppid, point_time) as >>> >> >> >> a_row >>> >> >> >> FROM test >>> >> >> >> WHERE the_geom IS NOT NULL) a >>> >> >> >> ON a.point_time < test.point_time >>> >> >> >> AND a.ppid = test.ppid >>> >> >> >> WHERE test.the_geom IS NULL >>> >> >> >> ORDER BY test.point_time) >>> >> >> >> >>> >> >>> >> >>> >> Hi Erik / all, >>> >> >>> >> So I think I've managed to re-write my queries using CTEs. The below >>> >> code now does get me the data that I want from this. But to do so it >>> >> is going to create a frankly huge table in the bit of the SQL where it >>> >> makes the table called 'partitioned'. My rough guess is that it'll >>> >> have to make a table of about 100 billion rows in order to get data I >>> >> need ( about 108 million rows). >>> >> >>> >> Could someone please glance through it for me and suggest how to write >>> >> it more efficiently? >>> >> >>> >> Thanks >>> >> >>> >> James >>> >> >>> >> WITH missing_geoms AS ( >>> >> SELECT ppid, >>> >> point_time, >>> >> the_geom >>> >> FROM hybrid_location >>> >> WHERE the_geom IS NULL) >>> >> ----------------- >>> >> ,filled_geoms AS ( >>> >> SELECT ppid, >>> >> point_time, >>> >> the_geom >>> >> FROM hybrid_location >>> >> WHERE the_geom IS NOT NULL) >>> >> ---------------- >>> >> ,partitioned AS ( >>> >> SELECT missing_geoms.ppid, >>> >> missing_geoms.point_time, >>> >> missing_geoms.the_geom, >>> >> filled_geoms.ppid, >>> >> filled_geoms.point_time, >>> >> filled_geoms.the_geom, >>> >> row_number() OVER ( PARTITION BY missing_geoms.ppid, >>> >> missing_geoms.point_time >>> >> ORDER BY missing_geoms.ppid, >>> >> missing_geoms.point_time, >>> >> filled_geoms.ppid, >>> >> filled_geoms.point_time DESC) >>> >> FROM missing_geoms >>> >> LEFT JOIN filled_geoms >>> >> ON filled_geoms.point_time < missing_geoms.point_time >>> >> AND filled_geoms.ppid = missing_geoms.ppid) >>> >> -------------- >>> >> SELECT * >>> >> FROM partitioned >>> >> WHERE row_number = 1; >>> >> >>> >> James >>> > >>> > >> >>
pgsql-novice by date: