Re: Update with last known location? - Mailing list pgsql-novice
From | James David Smith |
---|---|
Subject | Re: Update with last known location? |
Date | |
Msg-id | CAMu32AAoVFvoN88sGQV60GvLvN4PiRVBeCne9MuqhC4bqqsseQ@mail.gmail.com Whole thread Raw |
In response to | Re: Update with last known location? (Erik Darling <edarling80@gmail.com>) |
Responses |
Re: Update with last known location?
|
List | pgsql-novice |
Hi Erik, Do you mean in this section of the SQL? ..... filled_geoms AS ( SELECT ppid, point_time, the_geom FROM hybrid_location WHERE the_geom IS NOT NULL) ... Thanks James On 29 January 2014 17:57, Erik Darling <edarling80@gmail.com> wrote: > Hi James, > > I think you're still stuck with sort of unnecessary ('too much' ) data > coming from the right side of your left join. If so, one option I would > consider is using DENSE_RANK() the way you use ROW_NUMBER(), in the > filled_geoms table. If you partition by id and order by date descending, you > can do an additional d_rank = 1 filter to only get the most recent activity. > I believe this is what you want to set your NULL values to, no? > > > > > On Wed, Jan 29, 2014 at 12:41 PM, James David Smith > <james.david.smith@gmail.com> wrote: >> >> On 29 January 2014 16:02, Erik Darling <edarling80@gmail.com> wrote: >> > I would re-suggest using a CTE to contain each dataset to ensure your >> > selects are distilling them correctly, and then using a final query to >> > join >> > them. You can then either update your data directly through the CTE(s), >> > or >> > insert the results to another table to do some further testing. I think >> > you'll find this method presents the data a bit more ergonomically for >> > analysis. >> > >> > http://www.postgresql.org/docs/9.3/static/queries-with.html >> > >> > >> > >> > On Wed, Jan 29, 2014 at 10:45 AM, James David Smith >> > <james.david.smith@gmail.com> wrote: >> >> >> >> Hi Erik/all, >> >> >> >> I just tried that, but it's tricky. The 'extra' data is indeed coming >> >> from the right side of the join, but it's hard to select only the max >> >> from it. Maybe it's possible but I've not managed to do it. Here is >> >> where I am, which is so very close. >> >> >> >> SELECT >> >> DISTINCT(a.ppid, a.point_time, a.the_geom) as >> >> row_that_needs_geom_updating, >> >> max(b.point_time) OVER (PARTITION BY a.ppid, a.point_time) as >> >> last_known_position_time >> >> FROM >> >> test a >> >> INNER JOIN >> >> (SELECT ppid, >> >> point_time, >> >> the_geom >> >> FROM test >> >> WHERE the_geom IS NOT NULL) b >> >> ON b.point_time < a.point_time >> >> AND a.ppid = b.ppid >> >> WHERE a.the_geom IS NULL; >> >> >> >> If you see attached screen-print, the output is the rows that I want. >> >> However I've had to use DISTINCT to stop the duplication. Also I've >> >> not managed to pull through 'the_geom' from the JOIN. I'm not sure >> >> how. Anyone? >> >> >> >> But it's kind of working. :-) >> >> >> >> Worst case if I can't figure out how to solve this in one query I'll >> >> have to store the result of the above, and then use it as a basis for >> >> another query I think. >> >> >> >> Thanks >> >> >> >> James >> >> >> >> >> >> >> >> On 29 January 2014 12:56, Erik Darling <edarling80@gmail.com> wrote: >> >> > I would try partitioning the second time you call row_number, perhaps >> >> > by >> >> > ID, >> >> > and then selecting the MAX() from that, since I think the too much >> >> > data >> >> > you're referring to is coming from the right side of your join. >> >> > >> >> > On Jan 29, 2014 7:23 AM, "James David Smith" >> >> > <james.david.smith@gmail.com> >> >> > wrote: >> >> >> >> >> >> On 28 January 2014 23:15, Gavin Flower >> >> >> <GavinFlower@archidevsys.co.nz> >> >> >> wrote: >> >> >> > On 29/01/14 11:00, Kevin Grittner wrote: >> >> >> >> >> >> >> >> James David Smith <james.david.smith@gmail.com> wrote: >> >> >> >> >> >> >> >>> Given the data is so large I don't want to be taking the data >> >> >> >>> out >> >> >> >>> to a CSV or whatever and then loading it back in. I'd like to do >> >> >> >>> this within the database using SQL. I thought I would be able to >> >> >> >>> do this using a LOOP to be honest. >> >> >> >> >> >> >> >> I would be amazed if you couldn't do this with a single UPDATE >> >> >> >> statement. I've generally found declarative forms of such work >> >> >> >> to >> >> >> >> be at least one order of magnitude faster than going to either a >> >> >> >> PL >> >> >> >> or a script approach. I would start by putting together a SELECT >> >> >> >> query using window functions and maybe a CTE or two to list all >> >> >> >> the >> >> >> >> primary keys which need updating and the new values they should >> >> >> >> have. Once that SELECT was looking good, I would put it in the >> >> >> >> FROM clause of an UPDATE statement. >> >> >> >> >> >> >> >> That should work, but if you are updating a large percentage of >> >> >> >> the >> >> >> >> table, I would go one step further before running this against >> >> >> >> the >> >> >> >> production tables. I would put a LIMIT on the above-mentioned >> >> >> >> SELECT of something like 10000 rows, and script a loop that >> >> >> >> alternates between the UPDATE and a VACUUM ANALYZE on the table. >> >> >> >> >> >> >> >> -- >> >> >> >> Kevin Grittner >> >> >> >> EDB: http://www.enterprisedb.com >> >> >> >> The Enterprise PostgreSQL Company >> >> >> >> >> >> >> >> >> >> >> > James, you might consider dropping as many indexes on the table as >> >> >> > you >> >> >> > safely can, and rebuilding them after the mass update. If you >> >> >> > have >> >> >> > lots >> >> >> > of >> >> >> > such indexes, you will find this apprtoach to be a lot faster. >> >> >> > >> >> >> > >> >> >> > Cheers, >> >> >> > Gavin >> >> >> >> >> >> Hi all, >> >> >> >> >> >> Thanks for your help and assistance. I think that window functions, >> >> >> and inparticular the PARTITION function, is 100% the way to go. >> >> >> I've >> >> >> been concentrating on a SELECT statement for now and am close but >> >> >> not >> >> >> quite close enough. The below query gets all the data I want, but >> >> >> *too* much. What I've essentially done is: >> >> >> >> >> >> - Select all the rows that don't have any geom information >> >> >> - Join them with all rows before this point that *do* have geom >> >> >> information. >> >> >> - Before doing this join, use partition to generate row numbers. >> >> >> >> >> >> The attached screen grab shows the result of my query below. >> >> >> Unfortunately this is generating alot of joins that I don't want. >> >> >> This >> >> >> won't be practical when doing it with 75,000 people. >> >> >> >> >> >> Thoughts and code suggestions very much appreciated... if needed I >> >> >> could put together some SQL to create an example table? >> >> >> >> >> >> Thanks >> >> >> >> >> >> SELECT row_number() OVER (PARTITION BY test.point_time ORDER BY >> >> >> test.point_time) as test_row, >> >> >> test.ppid as test_ppid, >> >> >> test.point_time as test_point_time, >> >> >> test.the_geom as test_the_geom, >> >> >> a.ppid as a_ppid, >> >> >> a.point_time as a_point_time, >> >> >> a.the_geom as a_the_geom, >> >> >> a.a_row >> >> >> FROM test >> >> >> LEFT JOIN ( >> >> >> SELECT the_geom, >> >> >> ppid, >> >> >> point_time, >> >> >> row_number() OVER (ORDER BY ppid, point_time) as a_row >> >> >> FROM test >> >> >> WHERE the_geom IS NOT NULL) a >> >> >> ON a.point_time < test.point_time >> >> >> AND a.ppid = test.ppid >> >> >> WHERE test.the_geom IS NULL >> >> >> ORDER BY test.point_time) >> >> >> >> >> >> Hi Erik / all, >> >> So I think I've managed to re-write my queries using CTEs. The below >> code now does get me the data that I want from this. But to do so it >> is going to create a frankly huge table in the bit of the SQL where it >> makes the table called 'partitioned'. My rough guess is that it'll >> have to make a table of about 100 billion rows in order to get data I >> need ( about 108 million rows). >> >> Could someone please glance through it for me and suggest how to write >> it more efficiently? >> >> Thanks >> >> James >> >> WITH missing_geoms AS ( >> SELECT ppid, >> point_time, >> the_geom >> FROM hybrid_location >> WHERE the_geom IS NULL) >> ----------------- >> ,filled_geoms AS ( >> SELECT ppid, >> point_time, >> the_geom >> FROM hybrid_location >> WHERE the_geom IS NOT NULL) >> ---------------- >> ,partitioned AS ( >> SELECT missing_geoms.ppid, >> missing_geoms.point_time, >> missing_geoms.the_geom, >> filled_geoms.ppid, >> filled_geoms.point_time, >> filled_geoms.the_geom, >> row_number() OVER ( PARTITION BY missing_geoms.ppid, >> missing_geoms.point_time >> ORDER BY missing_geoms.ppid, >> missing_geoms.point_time, >> filled_geoms.ppid, >> filled_geoms.point_time DESC) >> FROM missing_geoms >> LEFT JOIN filled_geoms >> ON filled_geoms.point_time < missing_geoms.point_time >> AND filled_geoms.ppid = missing_geoms.ppid) >> -------------- >> SELECT * >> FROM partitioned >> WHERE row_number = 1; >> >> James > >
pgsql-novice by date: