Home > mailing lists
Re: Update with last known location? - Mailing list pgsql-novice

From	James David Smith
Subject	Re: Update with last known location?
Date	January 30, 2014 11:19:56
Msg-id	CAMu32AAD5rQQ35UWVeoKnaxhXW61-Zs8Da6F=jTxgqQpxEk++Q@mail.gmail.com Whole thread
In response to	Re: Update with last known location? (Erik Darling <edarling80@gmail.com>)
Responses	Re: Update with last known location? Re: Update with last known location? Re: Update with last known location?
List	pgsql-novice
Tree view
Hi Erik / all,

I don't think that will work, as what happens if one of the people has
two missing periods of time within their day?

I've made a self-contained example of my problem below. Would you mind
trying to produce in the code below what you think I should do?  Or if
anyone else fancies having a go then please do.

I very much appreciate your help by the way. Thank you. I'm really at
a loss with this. :-(

James
--------------------------------------

DROP TABLE test_data;

CREATE TABLE test_data(
ppid integer,
point_time timestamp without time zone,
the_geom integer);

INSERT INTO test_data VALUES
('1', '2012-01-01 07:00', '1'),
('1', '2012-01-01 07:01', '1'),
('1', '2012-01-01 07:02', '1'),
('1', '2012-01-01 07:03', NULL), -- null should be replaced with 1
('1', '2012-01-01 07:04', NULL), -- null should be replaced with 1
('1', '2012-01-01 07:05', '5'),
('1', '2012-01-01 07:06', '5'),
('1', '2012-01-01 07:07', '5'),
('1', '2012-01-01 07:08', NULL), -- null should be replaced with 5
('1', '2012-01-01 07:09', NULL), -- null should be replaced with 5
('1', '2012-01-01 07:10', NULL), -- null should be replaced with 5
('1', '2012-01-01 07:11', NULL), -- null should be replaced with 5
('2', '2013-05-02 07:12', '24'),
('2', '2013-05-02 07:13', '24'),
('2', '2013-05-02 07:14', '24'),
('2', '2013-05-02 07:15', NULL), -- null should be replaced with 24
('2', '2013-05-02 07:16', NULL), -- null should be replaced with 24
('2', '2013-05-02 07:17', '44'),
('2', '2013-05-02 07:18', '44'),
('2', '2013-05-02 07:19', NULL), -- null should be replaced with 44
('2', '2013-05-02 07:20', '4'),
('2', '2013-05-02 07:21', '4'),
('2', '2013-05-02 07:22', '4');

WITH         missing_geoms AS (
SELECT        ppid,
        point_time,
        the_geom
FROM        test_data
WHERE        the_geom IS NULL)
---
        ,filled_geoms AS (
SELECT        ppid,
        point_time,
        the_geom
FROM        test_data
WHERE        the_geom IS NOT NULL)
---
        ,partitioned AS (
SELECT        missing_geoms.ppid as missing_geoms_ppid,
        missing_geoms.point_time as missing_geoms_point_time,
        missing_geoms.the_geom as missing_geoms_the_geom,
        filled_geoms.ppid as filled_geoms_ppid,
        filled_geoms.point_time as filled_geoms_point_time,
        filled_geoms.the_geom as filled_geoms_the_geom
FROM        missing_geoms
LEFT JOIN    filled_geoms
ON        filled_geoms.point_time < missing_geoms.point_time
AND        filled_geoms.ppid = missing_geoms.ppid
ORDER BY    missing_geoms_ppid,
        missing_geoms_point_time)
---
SELECT        *
FROM        partitioned;





On 29 January 2014 22:10, Erik Darling <edarling80@gmail.com> wrote:
> Hi James,
>
> Yeah, that looks like the right place to me.
>
>
> On Wed, Jan 29, 2014 at 1:14 PM, James David Smith
> <james.david.smith@gmail.com> wrote:
>>
>> Hi Erik,
>>
>> Do you mean in this section of the SQL?
>> .....
>> filled_geoms AS (
>> SELECT
>> ppid,
>> point_time,
>> the_geom
>> FROM
>> hybrid_location
>> WHERE
>> the_geom IS NOT NULL)
>> ...
>>
>> Thanks
>>
>> James
>>
>> On 29 January 2014 17:57, Erik Darling <edarling80@gmail.com> wrote:
>> > Hi James,
>> >
>> > I think you're still stuck with sort of unnecessary ('too much' ) data
>> > coming from the right side of your left join. If so, one option I would
>> > consider is using DENSE_RANK() the way you use ROW_NUMBER(), in the
>> > filled_geoms table. If you partition by id and order by date descending,
>> > you
>> > can do an additional d_rank = 1 filter to only get the most recent
>> > activity.
>> > I believe this is what you want to set your NULL values to, no?
>> >
>> >
>> >
>> >
>> > On Wed, Jan 29, 2014 at 12:41 PM, James David Smith
>> > <james.david.smith@gmail.com> wrote:
>> >>
>> >> On 29 January 2014 16:02, Erik Darling <edarling80@gmail.com> wrote:
>> >> > I would re-suggest using a CTE to contain each dataset to ensure your
>> >> > selects are distilling them correctly, and then using a final query
>> >> > to
>> >> > join
>> >> > them. You can then either update your data directly through the
>> >> > CTE(s),
>> >> > or
>> >> > insert the results to another table to do some further testing. I
>> >> > think
>> >> > you'll find this method presents the data a bit more ergonomically
>> >> > for
>> >> > analysis.
>> >> >
>> >> > http://www.postgresql.org/docs/9.3/static/queries-with.html
>> >> >
>> >> >
>> >> >
>> >> > On Wed, Jan 29, 2014 at 10:45 AM, James David Smith
>> >> > <james.david.smith@gmail.com> wrote:
>> >> >>
>> >> >> Hi Erik/all,
>> >> >>
>> >> >> I just tried that, but it's tricky. The 'extra' data is indeed
>> >> >> coming
>> >> >> from the right side of the join, but it's hard to select only the
>> >> >> max
>> >> >> from it. Maybe it's possible but I've not managed to do it. Here is
>> >> >> where I am, which is so very close.
>> >> >>
>> >> >> SELECT
>> >> >> DISTINCT(a.ppid, a.point_time, a.the_geom) as
>> >> >> row_that_needs_geom_updating,
>> >> >> max(b.point_time) OVER (PARTITION BY a.ppid, a.point_time) as
>> >> >> last_known_position_time
>> >> >> FROM
>> >> >> test a
>> >> >> INNER JOIN
>> >> >> (SELECT    ppid,
>> >> >> point_time,
>> >> >> the_geom
>> >> >> FROM    test
>> >> >> WHERE    the_geom IS NOT NULL) b
>> >> >> ON b.point_time < a.point_time
>> >> >> AND a.ppid = b.ppid
>> >> >> WHERE a.the_geom IS NULL;
>> >> >>
>> >> >> If you see attached screen-print, the output is the rows that I
>> >> >> want.
>> >> >> However I've had to use DISTINCT to stop the duplication. Also I've
>> >> >> not managed to pull through 'the_geom' from the JOIN. I'm not sure
>> >> >> how. Anyone?
>> >> >>
>> >> >> But it's kind of working. :-)
>> >> >>
>> >> >> Worst case if I can't figure out how to solve this in one query I'll
>> >> >> have to store the result of the above, and then use it as a basis
>> >> >> for
>> >> >> another query I think.
>> >> >>
>> >> >> Thanks
>> >> >>
>> >> >> James
>> >> >>
>> >> >>
>> >> >>
>> >> >> On 29 January 2014 12:56, Erik Darling <edarling80@gmail.com> wrote:
>> >> >> > I would try partitioning the second time you call row_number,
>> >> >> > perhaps
>> >> >> > by
>> >> >> > ID,
>> >> >> > and then selecting the MAX() from that, since I think the too much
>> >> >> > data
>> >> >> > you're referring to is coming from the right side of your join.
>> >> >> >
>> >> >> > On Jan 29, 2014 7:23 AM, "James David Smith"
>> >> >> > <james.david.smith@gmail.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> On 28 January 2014 23:15, Gavin Flower
>> >> >> >> <GavinFlower@archidevsys.co.nz>
>> >> >> >> wrote:
>> >> >> >> > On 29/01/14 11:00, Kevin Grittner wrote:
>> >> >> >> >>
>> >> >> >> >> James David Smith <james.david.smith@gmail.com> wrote:
>> >> >> >> >>
>> >> >> >> >>> Given the data is so large I don't want to be taking the data
>> >> >> >> >>> out
>> >> >> >> >>> to a CSV or whatever and then loading it back in. I'd like to
>> >> >> >> >>> do
>> >> >> >> >>> this within the database using SQL. I thought I would be able
>> >> >> >> >>> to
>> >> >> >> >>> do this using a LOOP to be honest.
>> >> >> >> >>
>> >> >> >> >> I would be amazed if you couldn't do this with  a single
>> >> >> >> >> UPDATE
>> >> >> >> >> statement.  I've generally found declarative forms of such
>> >> >> >> >> work
>> >> >> >> >> to
>> >> >> >> >> be at least one order of magnitude faster than going to either
>> >> >> >> >> a
>> >> >> >> >> PL
>> >> >> >> >> or a script approach.  I would start by putting together a
>> >> >> >> >> SELECT
>> >> >> >> >> query using window functions and maybe a CTE or two to list
>> >> >> >> >> all
>> >> >> >> >> the
>> >> >> >> >> primary keys which need updating and the new values they
>> >> >> >> >> should
>> >> >> >> >> have.  Once that SELECT was looking good, I would put it in
>> >> >> >> >> the
>> >> >> >> >> FROM clause of an UPDATE statement.
>> >> >> >> >>
>> >> >> >> >> That should work, but if you are updating a large percentage
>> >> >> >> >> of
>> >> >> >> >> the
>> >> >> >> >> table, I would go one step further before running this against
>> >> >> >> >> the
>> >> >> >> >> production tables.  I would put a LIMIT on the above-mentioned
>> >> >> >> >> SELECT of something like 10000 rows, and script a loop that
>> >> >> >> >> alternates between the UPDATE and a VACUUM ANALYZE on the
>> >> >> >> >> table.
>> >> >> >> >>
>> >> >> >> >> --
>> >> >> >> >> Kevin Grittner
>> >> >> >> >> EDB: http://www.enterprisedb.com
>> >> >> >> >> The Enterprise PostgreSQL Company
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> > James, you might consider dropping as many indexes on the table
>> >> >> >> > as
>> >> >> >> > you
>> >> >> >> > safely can, and rebuilding them after the mass update.  If you
>> >> >> >> > have
>> >> >> >> > lots
>> >> >> >> > of
>> >> >> >> > such indexes, you will find this apprtoach to be a lot faster.
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > Cheers,
>> >> >> >> > Gavin
>> >> >> >>
>> >> >> >> Hi all,
>> >> >> >>
>> >> >> >> Thanks for your help and assistance. I think that window
>> >> >> >> functions,
>> >> >> >> and inparticular the PARTITION function, is 100% the way to go.
>> >> >> >> I've
>> >> >> >> been concentrating on a SELECT statement for now and am close but
>> >> >> >> not
>> >> >> >> quite close enough. The below query gets all the data I want, but
>> >> >> >> *too* much. What I've essentially done is:
>> >> >> >>
>> >> >> >> - Select all the rows that don't have any geom information
>> >> >> >> - Join them with all rows before this point that *do* have geom
>> >> >> >> information.
>> >> >> >> - Before doing this join, use partition to generate row numbers.
>> >> >> >>
>> >> >> >> The attached screen grab shows the result of my query below.
>> >> >> >> Unfortunately this is generating alot of joins that I don't want.
>> >> >> >> This
>> >> >> >> won't be practical when doing it with 75,000 people.
>> >> >> >>
>> >> >> >> Thoughts and code suggestions very much appreciated... if needed
>> >> >> >> I
>> >> >> >> could put together some SQL to create an example table?
>> >> >> >>
>> >> >> >> Thanks
>> >> >> >>
>> >> >> >> SELECT     row_number() OVER (PARTITION BY test.point_time ORDER
>> >> >> >> BY
>> >> >> >> test.point_time) as test_row,
>> >> >> >>     test.ppid as test_ppid,
>> >> >> >>     test.point_time as test_point_time,
>> >> >> >>     test.the_geom as test_the_geom,
>> >> >> >>     a.ppid as a_ppid,
>> >> >> >>     a.point_time as a_point_time,
>> >> >> >>     a.the_geom as a_the_geom,
>> >> >> >>     a.a_row
>> >> >> >> FROM     test
>> >> >> >>     LEFT JOIN (
>> >> >> >>         SELECT    the_geom,
>> >> >> >>             ppid,
>> >> >> >>             point_time,
>> >> >> >>             row_number() OVER (ORDER BY ppid, point_time) as
>> >> >> >> a_row
>> >> >> >>         FROM     test
>> >> >> >>         WHERE     the_geom IS NOT NULL) a
>> >> >> >> ON    a.point_time < test.point_time
>> >> >> >> AND    a.ppid = test.ppid
>> >> >> >> WHERE    test.the_geom IS NULL
>> >> >> >> ORDER BY test.point_time)
>> >> >> >>
>> >>
>> >>
>> >> Hi Erik / all,
>> >>
>> >> So I think I've managed to re-write my queries using CTEs. The below
>> >> code now does get me the data that I want from this. But to do so it
>> >> is going to create a frankly huge table in the bit of the SQL where it
>> >> makes the table called 'partitioned'. My rough guess is that it'll
>> >> have to make a table of about 100 billion rows in order to get data I
>> >> need ( about 108 million rows).
>> >>
>> >> Could someone please glance through it for me and suggest how to write
>> >> it more efficiently?
>> >>
>> >> Thanks
>> >>
>> >> James
>> >>
>> >> WITH         missing_geoms AS (
>> >> SELECT        ppid,
>> >>         point_time,
>> >>         the_geom
>> >> FROM        hybrid_location
>> >> WHERE        the_geom IS NULL)
>> >> -----------------
>> >>         ,filled_geoms AS (
>> >> SELECT        ppid,
>> >>         point_time,
>> >>         the_geom
>> >> FROM        hybrid_location
>> >> WHERE        the_geom IS NOT NULL)
>> >> ----------------
>> >>         ,partitioned AS (
>> >> SELECT        missing_geoms.ppid,
>> >>         missing_geoms.point_time,
>> >>         missing_geoms.the_geom,
>> >>         filled_geoms.ppid,
>> >>         filled_geoms.point_time,
>> >>         filled_geoms.the_geom,
>> >>         row_number() OVER (    PARTITION BY     missing_geoms.ppid,
>> >>                             missing_geoms.point_time
>> >>                     ORDER BY    missing_geoms.ppid,
>> >>                             missing_geoms.point_time,
>> >>                             filled_geoms.ppid,
>> >>                             filled_geoms.point_time DESC)
>> >> FROM        missing_geoms
>> >> LEFT JOIN    filled_geoms
>> >> ON        filled_geoms.point_time < missing_geoms.point_time
>> >> AND        filled_geoms.ppid = missing_geoms.ppid)
>> >> --------------
>> >> SELECT         *
>> >> FROM        partitioned
>> >> WHERE        row_number = 1;
>> >>
>> >> James
>> >
>> >
>
>
pgsql-novice by date:
From: Erik Darling
Date: 29 January 2014, 22:10:21
Subject: Re: Update with last known location?
From: Carel Combrink
Date: 30 January 2014, 13:15:12
Subject: BYTEA: PostgreSQL 9.1 vs 9.3
Re: Update with last known location? - Mailing list pgsql-novice

Previous

Next