Thread: newsfeed type query

newsfeed type query

From

Jonathan Vanasco

Date:

28 April 2015, 22:26:41

I'm trying to upgrade some code that powers a newfeed type stream, and hoping someone can offer some insight on better
waysto structure some parts of the query 

The part that has me stumped right now...

There are several criteria for why something could appear in a stream.  for example, here are 2 handling a posting:

    * a posting by a friend
    * a posting in a group

the general way I've handled this so far has been simple:

    select * from posting where author_id in (select friend_id from friends where user_id = ?) or group_id in (select
group_idfrom memberships where user_id = ?); 

now i need to pull in the context of the match (friend, group, both), but I can't figure out how to do this cleanly.

1. if i just add 'case' statements to the select to note the origin, those subselects run again.   (ie, the same
subqueryis executed twice) 
2. if i structure this as a union (and note the origin with a string), it takes a lot more work to integrate and sort
the2 separate selects ( eg "select id, timestamp, 'by-friend'" unioned with "in-group") 

does anyone have ideas on other approaches to structuring this?

Re: newsfeed type query

From

Melvin Davidson

Date:

28 April 2015, 22:56:36

Since you very nicely DID NOT provide the pg version, O/S or table structure(s), which is what you should do REGARDLESS of the

type of question (it's just the smart and polite thing to do when asking for help) The best I can suggest is:
SELECT
CASE WHEN context = 'friend' THEN p.junka
            WHEN context = 'group' THEN p.junkb
            WHEN context = 'both'   THEN p.junka || ' ' || p.junkb
    END
   FROM posting p
where p.author_id in (SELECT f.friend_id
                                    FROM friends f
                                  WHERE f.user_id = ?)
       OR p.group_id in (SELECT m.group_id
                                    FROM memberships m
                                  WHERE m.user_id = ?);

On Tue, Apr 28, 2015 at 6:26 PM, Jonathan Vanasco <postgres@2xlp.com> wrote:

I'm trying to upgrade some code that powers a newfeed type stream, and hoping someone can offer some insight on better ways to structure some parts of the query

The part that has me stumped right now...

There are several criteria for why something could appear in a stream. for example, here are 2 handling a posting:

* a posting by a friend
* a posting in a group

the general way I've handled this so far has been simple:

select * from posting where author_id in (select friend_id from friends where user_id = ?) or group_id in (select group_id from memberships where user_id = ?);

now i need to pull in the context of the match (friend, group, both), but I can't figure out how to do this cleanly.

1. if i just add 'case' statements to the select to note the origin, those subselects run again. (ie, the same subquery is executed twice)
2. if i structure this as a union (and note the origin with a string), it takes a lot more work to integrate and sort the 2 separate selects ( eg "select id, timestamp, 'by-friend'" unioned with "in-group")

does anyone have ideas on other approaches to structuring this?

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Melvin Davidson
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.

Re: newsfeed type query

From

Jonathan Vanasco

Date:

28 April 2015, 23:57:56

Sorry, I was trying to ask something very abstract as I have similar situations on multiple groups of queries/tables (and they're all much more complex).

I'm on pg 9.3

The relevant structure is:

posting:

timestamp_publish

group_id__in

user_id__author

friends:

user_id__a

user_id__b

memberships:

user_id

group_id

role_id

-- working sql

CREATE TABLE groups(

id SERIAL NOT NULL PRIMARY KEY

);

CREATE TABLE users(

id SERIAL NOT NULL PRIMARY KEY

);

CREATE TABLE friends (

user_id__a INT NOT NULL REFERENCES users( id ),

user_id__b INT NOT NULL REFERENCES users( id )

);

CREATE TABLE memberships (

user_id INT NOT NULL REFERENCES users( id ),

group_id INT NOT NULL REFERENCES groups( id ),

role_id INT NOT NULL

);

CREATE TABLE posting (

id SERIAL NOT NULL,

timestamp_publish timestamp not null,

group_id__in INT NOT NULL REFERENCES groups(id),

user_id__author INT NOT NULL REFERENCES users(id),

is_published BOOL

);

The output that I'm trying to get is:

posting.id

{the context of the select}

posting.timestamp_publish (this may need to get correlated into other queries)

These approaches had bad performance:

-- huge selects / memory

-- it needs to load everything from 2 tables before it limits

EXPLAIN ANALYZE

SELECT id, feed_context FROM (

SELECT id, timestamp_publish, 'in-group' AS feed_context FROM posting

WHERE (

group_id__in IN (SELECT group_id FROM memberships WHERE user_id = 57 AND role_id IN (1,2,3))

AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP AT TIME ZONE 'UTC')

)

UNION

SELECT id, timestamp_publish, 'by-user' AS feed_context FROM posting

WHERE (

user_id__author IN (SELECT user_id__b FROM friends WHERE user_id__a = 57)

AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP AT TIME ZONE 'UTC')

)

) AS feed

ORDER BY timestamp_publish DESC

LIMIT 10

;

-- selects minimized, but repetitive subqueries

SELECT

id,

CASE

WHEN group_id__in IN (SELECT group_id FROM memberships WHERE user_id = 57 AND role_id IN (1,2,3)) THEN True

ELSE NULL

END AS feed_context_group,

CASE

WHEN user_id__author IN (SELECT user_id__b FROM friends WHERE user_id__a = 57) THEN True

ELSE NULL

END AS feed_context_user

FROM posting

WHERE (

group_id__in IN (SELECT group_id FROM memberships WHERE user_id = 57 AND role_id IN (1,2,3))

user_id__author IN (SELECT user_id__b FROM friends WHERE user_id__a = 57)

)

AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP AT TIME ZONE 'UTC')

ORDER BY timestamp_publish DESC

LIMIT 10

;

On Apr 28, 2015, at 6:56 PM, Melvin Davidson wrote:

Since you very nicely DID NOT provide the pg version, O/S or table structure(s), which is what you should do REGARDLESS of the
type of question (it's just the smart and polite thing to do when asking for help) The best I can suggest is:
SELECT
CASE WHEN context = 'friend' THEN p.junka
            WHEN context = 'group' THEN p.junkb
            WHEN context = 'both'   THEN p.junka || ' ' || p.junkb
    END
   FROM posting p
where p.author_id in (SELECT f.friend_id
                                    FROM friends f
                                  WHERE f.user_id = ?)
       OR p.group_id in (SELECT m.group_id
                                    FROM memberships m
                                  WHERE m.user_id = ?);

Re: newsfeed type query

From

Jim Nasby

Date:

29 April 2015, 01:26:49

On 4/28/15 6:57 PM, Jonathan Vanasco wrote:
> The relevant structure is:
>
>      posting:
>          id
>          timestamp_publish
>          group_id__in
>          user_id__author
>      friends:
>          user_id__a
>          user_id__b
>
>      memberships:
>          user_id
>          group_id
>          role_id

Try this...

SELECT ...
     , f.user_id__b IS NOT NULL AS in_friends
     , m.user_id IS NOT NULL AS in_group
   FROM posting p
     LEFT JOIN friends f ON( f.user_id__b = p.user_id__author )
     LEFT JOIN memberships m ON( m.group_id = p.group_id__in )
   WHERE is_published AND timestamp_publish ...
     AND (
       f.user_id__a = 57
       OR ( m.user_id = 57 AND m.group_id IN (1,2,3) )
     )

I'm not sure how fast it'll be though. I suspect your best bet is to put
the UNION approach inside a set returning function; that way the ugly is
contained in one place.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: newsfeed type query

From

Ladislav Lenart

Date:

29 April 2015, 10:25:12

Hello.


On 29.4.2015 00:26, Jonathan Vanasco wrote:
>
> I'm trying to upgrade some code that powers a newfeed type stream, and hoping someone can offer some insight on
betterways to structure some parts of the query 
>
> The part that has me stumped right now...
>
> There are several criteria for why something could appear in a stream.  for example, here are 2 handling a posting:
>
>     * a posting by a friend
>     * a posting in a group
>
> the general way I've handled this so far has been simple:
>
>     select * from posting where author_id in (select friend_id from friends where user_id = ?) or group_id in (select
group_idfrom memberships where user_id = ?); 
>
> now i need to pull in the context of the match (friend, group, both), but I can't figure out how to do this cleanly.
>
> 1. if i just add 'case' statements to the select to note the origin, those subselects run again.   (ie, the same
subqueryis executed twice) 
> 2. if i structure this as a union (and note the origin with a string), it takes a lot more work to integrate and sort
the2 separate selects ( eg "select id, timestamp, 'by-friend'" unioned with "in-group") 
>
> does anyone have ideas on other approaches to structuring this?

Dunno if this is optimal (most probably not), but it might be of some help
(WARNING: not tested at all):

with
posting_ids as (
    select
        t.posting_id,
        bool_or(t.from_friend) as from_friend,
        bool_or(t.grom_group) as from_group
    from (
        select
            posting.id as posting_id,
            true as from_friend,
            false as from_group
        from posting
        where posting.author_id in (
            select friend_id from friends where user_id = ?
        )
        union all
        select
            posting.id as posting_id,
            false as from_friend,
            true as from_group
        from posting
        where group_id in (
            select group_id from memberships where user_id = ?
        )
    ) t
    group by t.posting_id
)
select
    posting.*,
    posting_ids.from_friend,
    posting_ids.from_group,
    posting_ids.from_friend or posting_ids.from_group as from_any,
    posting_ids.from_friend and posting_ids.from_group as from_both
from
    posting
    join posting_ids on posting.id = posting_ids.posting_id


Ladislav Lenart

Re: newsfeed type query

From

Ladislav Lenart

Date:

29 April 2015, 10:55:02

Hello.

On 29.4.2015 01:57, Jonathan Vanasco wrote:
> Sorry, I was trying to ask something very abstract as I have similar situations
> on multiple groups of queries/tables (and they're all much more complex).
>
> I'm on pg 9.3
>
> The relevant structure is:
>
>     posting:
>         id
>         timestamp_publish
>         group_id__in
>         user_id__author
>
>     friends:
>         user_id__a
>         user_id__b
>
>     memberships:
>         user_id
>         group_id
>         role_id
>
>
> -- working sql
>                 CREATE TABLE groups(
>                     id SERIAL NOT NULL PRIMARY KEY
>                 );
>                 CREATE TABLE users(
>                     id SERIAL NOT NULL PRIMARY KEY
>                 );
>                 CREATE TABLE friends (
>                     user_id__a INT NOT NULL REFERENCES users( id ),
>                     user_id__b INT NOT NULL REFERENCES users( id )
>                 );
>                 CREATE TABLE memberships (
>                     user_id INT NOT NULL REFERENCES users( id ),
>                     group_id INT NOT NULL REFERENCES groups( id ),
>                     role_id INT NOT NULL
>                 );
>                 CREATE TABLE posting (
>                     id SERIAL NOT NULL,
>                     timestamp_publish timestamp not null,
>                     group_id__in INT NOT NULL REFERENCES groups(id),
>                     user_id__author INT NOT NULL REFERENCES users(id),
>                     is_published BOOL
>                 );
>
> The output that I'm trying to get is:
>     posting.id
>     {the context of the select}
>     posting.timestamp_publish (this may need to get correlated into other queries)
>
>
> These approaches had bad performance:
>
>     -- huge selects / memory
>     -- it needs to load everything from 2 tables before it limits
>     EXPLAIN ANALYZE
>     SELECT id, feed_context FROM (
>         SELECT  id, timestamp_publish, 'in-group' AS feed_context FROM posting
>         WHERE (
>             group_id__in IN (SELECT group_id FROM memberships WHERE user_id = 57
> AND role_id IN (1,2,3))
>             AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP
> AT TIME ZONE 'UTC')
>             )
>         UNION
>         SELECT  id, timestamp_publish, 'by-user' AS feed_context FROM posting
>         WHERE (
>             user_id__author IN (SELECT user_id__b FROM friends WHERE user_id__a
> = 57)
>             AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP
> AT TIME ZONE 'UTC')
>             )
>     ) AS feed
>     ORDER BY  timestamp_publish DESC
>     LIMIT 10
>     ;


I think you can propagate ORDER BY and LIMIT also to the subqueries of the
UNION, i.e.:

select...
from (
    (
        select...
        from posting
        where... -- friends
        order by timestamp_publish desc
        limit 10
    ) union (
    (
        select...
        from posting
        where... -- groups
        order by timestamp_publish desc
        limit 10
    )
) as feed
order by timestamp_publish desc
limit 10

That might behave better.


Ladislav Lenart


>     -- selects minimized, but repetitive subqueries
>     SELECT
>         id,
>         CASE
>             WHEN group_id__in IN (SELECT group_id FROM memberships WHERE user_id
> = 57 AND role_id IN (1,2,3)) THEN True
>             ELSE NULL
>         END AS feed_context_group,
>         CASE
>             WHEN user_id__author IN (SELECT user_id__b FROM friends WHERE
> user_id__a = 57) THEN True
>             ELSE NULL
>         END AS feed_context_user
>         FROM posting
>     WHERE (
>             group_id__in IN (SELECT group_id FROM memberships WHERE user_id = 57
> AND role_id IN (1,2,3))
>             OR
>             user_id__author IN (SELECT user_id__b FROM friends WHERE user_id__a
> = 57)
>         )
>         AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP AT
> TIME ZONE 'UTC')
>     ORDER BY  timestamp_publish DESC
>     LIMIT 10
>     ;
>
>
>
> On Apr 28, 2015, at 6:56 PM, Melvin Davidson wrote:
>
>> Since you very nicely DID NOT provide the pg version, O/S or table
>> structure(s), which is what you should do REGARDLESS of the
>> type of question (it's just the smart and polite thing to do when asking for
>> help) The best  I can suggest is:
>> SELECT
>>   CASE WHEN context = 'friend' THEN p.junka
>>             WHEN context = 'group' THEN p.junkb
>>             WHEN context = 'both'   THEN p.junka || ' ' || p.junkb
>>     END
>>    FROM posting p
>>   where p.author_id in (SELECT f.friend_id
>>                                     FROM friends f
>>                                   WHERE f.user_id = ?)
>>        OR p.group_id in (SELECT m.group_id
>>                                     FROM memberships m
>>                                   WHERE m.user_id = ?);
>
>

Re: newsfeed type query

From

Melvin Davidson

Date:

29 April 2015, 14:16:10

I see others have responded with suggestions to improve query performance,
but one thing I noticed when you gave the data structure is there are no
no primary keys defined for friends or posting, neither are there any indexes.
Was that an omission? If not, then please note that PostgreSQL is a _relational_
database and it is critical to have primary keys and additional indexes for data integrity
and performance.

FYI, defining a foreign key in a table does not automatically generate an associated
index.

I therefore suggest you do the following to improve performance.

ALTER TABLE posting
ADD CONSTRAINT posting_pk PRIMARY KEY ( id );

ALTER TABLE friends
ADD CONSTRAINT friends_pk PRIMARY KEY ( user_id__a, user_id__b );

ALTER TABLE membership
ADD CONSTRAINT membership_pk PRIMARY KEY ( user_id, group_id );

CREATE INDEX posting_group_id__in_idx
on posting
USING BTREE ( group_id__in );

ANALYZE posting;
ANALYZE friends;
ANALYZE membership;

On Wed, Apr 29, 2015 at 6:54 AM, Ladislav Lenart <lenartlad@volny.cz> wrote:

Hello.

On 29.4.2015 01:57, Jonathan Vanasco wrote:
> Sorry, I was trying to ask something very abstract as I have similar situations
> on multiple groups of queries/tables (and they're all much more complex).
>
> I'm on pg 9.3
>
> The relevant structure is:
>
> posting:
> id
> timestamp_publish
> group_id__in
> user_id__author
>
> friends:
> user_id__a
> user_id__b
>
> memberships:
> user_id
> group_id
> role_id
>
>
> -- working sql
> CREATE TABLE groups(
> id SERIAL NOT NULL PRIMARY KEY
> );
> CREATE TABLE users(
> id SERIAL NOT NULL PRIMARY KEY
> );
> CREATE TABLE friends (
> user_id__a INT NOT NULL REFERENCES users( id ),
> user_id__b INT NOT NULL REFERENCES users( id )
> );
> CREATE TABLE memberships (
> user_id INT NOT NULL REFERENCES users( id ),
> group_id INT NOT NULL REFERENCES groups( id ),
> role_id INT NOT NULL
> );
> CREATE TABLE posting (
> id SERIAL NOT NULL,
> timestamp_publish timestamp not null,
> group_id__in INT NOT NULL REFERENCES groups(id),
> user_id__author INT NOT NULL REFERENCES users(id),
> is_published BOOL
> );
>
> The output that I'm trying to get is:
> posting.id
> {the context of the select}
> posting.timestamp_publish (this may need to get correlated into other queries)
>
>
> These approaches had bad performance:
>
> -- huge selects / memory
> -- it needs to load everything from 2 tables before it limits
> EXPLAIN ANALYZE
> SELECT id, feed_context FROM (
> SELECT id, timestamp_publish, 'in-group' AS feed_context FROM posting
> WHERE (
> group_id__in IN (SELECT group_id FROM memberships WHERE user_id = 57
> AND role_id IN (1,2,3))
> AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP
> AT TIME ZONE 'UTC')
> )
> UNION
> SELECT id, timestamp_publish, 'by-user' AS feed_context FROM posting
> WHERE (
> user_id__author IN (SELECT user_id__b FROM friends WHERE user_id__a
> = 57)
> AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP
> AT TIME ZONE 'UTC')
> )
> ) AS feed
> ORDER BY timestamp_publish DESC
> LIMIT 10
> ;

I think you can propagate ORDER BY and LIMIT also to the subqueries of the
UNION, i.e.:

select...
from (
(
select...
from posting
where... -- friends
order by timestamp_publish desc
limit 10
) union (
(
select...
from posting
where... -- groups
order by timestamp_publish desc
limit 10
)
) as feed
order by timestamp_publish desc
limit 10

That might behave better.

Ladislav Lenart

> -- selects minimized, but repetitive subqueries
> SELECT
> id,
> CASE
> WHEN group_id__in IN (SELECT group_id FROM memberships WHERE user_id
> = 57 AND role_id IN (1,2,3)) THEN True
> ELSE NULL
> END AS feed_context_group,
> CASE
> WHEN user_id__author IN (SELECT user_id__b FROM friends WHERE
> user_id__a = 57) THEN True
> ELSE NULL
> END AS feed_context_user
> FROM posting
> WHERE (
> group_id__in IN (SELECT group_id FROM memberships WHERE user_id = 57
> AND role_id IN (1,2,3))
> OR
> user_id__author IN (SELECT user_id__b FROM friends WHERE user_id__a
> = 57)
> )
> AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP AT
> TIME ZONE 'UTC')
> ORDER BY timestamp_publish DESC
> LIMIT 10
> ;
>
>
>
> On Apr 28, 2015, at 6:56 PM, Melvin Davidson wrote:
>
>> Since you very nicely DID NOT provide the pg version, O/S or table
>> structure(s), which is what you should do REGARDLESS of the
>> type of question (it's just the smart and polite thing to do when asking for
>> help) The best I can suggest is:
>> SELECT
>> CASE WHEN context = 'friend' THEN p.junka
>> WHEN context = 'group' THEN p.junkb
>> WHEN context = 'both' THEN p.junka || ' ' || p.junkb
>> END
>> FROM posting p
>> where p.author_id in (SELECT f.friend_id
>> FROM friends f
>> WHERE f.user_id = ?)
>> OR p.group_id in (SELECT m.group_id
>> FROM memberships m
>> WHERE m.user_id = ?);
>
>

Melvin Davidson
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.

Re: newsfeed type query

From

Jonathan Vanasco

Date:

29 April 2015, 15:28:09

Thanks all!  These point me in much better directions!

Jim Nasby's approach to selecting an expression addressed some things (SELECT f.useraccount_id_b IS NOT NULL AS
in_friends)

Ladislav Lenart's usage of the CTE is also of a different format that I've used in the past.

I think i'll be able to patch together some performance improvements now, that will last until the database structure
changes.  

On Apr 29, 2015, at 6:54 AM, Ladislav Lenart wrote:

> I think you can propagate ORDER BY and LIMIT also to the subqueries of the
> UNION, i.e.:

It behaves a lot better, but doesn't give me the resultset I need.  Older data from one subquery is favored to newer
datafrom another 

I use a similar approach on another part of this application -- where the effect on the resultset isn't as pronounced.

On that query there are over 100 million total stream events.  Not using an inner limit runs the query in 7 minutes;
limitingthe inner subquery to 1MM runs in 70 seconds... and limiting to 10k is around 100ms.   

On Apr 29, 2015, at 10:16 AM, Melvin Davidson wrote:

> I see others have responded with suggestions to improve query performance,
> but one thing I noticed when you gave the data structure is there are no
> no primary keys defined for friends or posting,  neither are there any indexes.
> Was that an omission?

This was a quick functional example to illustrate.  The real tables are slightly different but do have pkeys ( 'id' is
abigserial, relationship tables (friends, memberships) use a composite key ).  They are aggressively indexed and
reindexedon various columns for query performance.  sometimes we create an extra index that has multiple columns or
partial-columnsto make make scans index-only.

Re: newsfeed type query

From

Ladislav Lenart

Date:

29 April 2015, 16:25:59

Hello.


On 29.4.2015 17:27, Jonathan Vanasco wrote:
>
> Thanks all!  These point me in much better directions!
>
> Jim Nasby's approach to selecting an expression addressed some things (SELECT f.useraccount_id_b IS NOT NULL AS
in_friends)
>
> Ladislav Lenart's usage of the CTE is also of a different format that I've used in the past.
>
> I think i'll be able to patch together some performance improvements now, that will last until the database structure
changes.  
>
>
> On Apr 29, 2015, at 6:54 AM, Ladislav Lenart wrote:
>
>> I think you can propagate ORDER BY and LIMIT also to the subqueries of the
>> UNION, i.e.:
>
>
> It behaves a lot better, but doesn't give me the resultset I need.  Older data from one subquery is favored to newer
datafrom another 

Hmm, I don't understand why it should behave like that. Imagine the following
postings (ts is a relative timestamp):

posting  ts  context
p0        0  friend
p10      10  group
p20      20  friend
p30      30  group
p40      40  friend
p50      50  group
p60      60  friend

and let's say the LIMIT is 2. Then:
* The first subquery (for friends) should return p60 and p40 (in DESC order).
* The second subquery (for groups) should return p50 and p30 (in DESC order).
* The UNION should return p60 and p50.

Could you please explain to me the error(s) in my reasoning?

Thank you,

Ladislav Lenart


> I use a similar approach on another part of this application -- where the effect on the resultset isn't as
pronounced.  
> On that query there are over 100 million total stream events.  Not using an inner limit runs the query in 7 minutes;
limitingthe inner subquery to 1MM runs in 70 seconds... and limiting to 10k is around 100ms.   
>
>
> On Apr 29, 2015, at 10:16 AM, Melvin Davidson wrote:
>
>> I see others have responded with suggestions to improve query performance,
>> but one thing I noticed when you gave the data structure is there are no
>> no primary keys defined for friends or posting,  neither are there any indexes.
>> Was that an omission?
>
> This was a quick functional example to illustrate.  The real tables are slightly different but do have pkeys ( 'id'
isa bigserial, relationship tables (friends, memberships) use a composite key ).  They are aggressively indexed and
reindexedon various columns for query performance.  sometimes we create an extra index that has multiple columns or
partial-columnsto make make scans index-only. 
>
>
>
>
>
>
>
>
>

Re: newsfeed type query

From

Jonathan Vanasco

Date:

29 April 2015, 17:01:01

On Apr 29, 2015, at 12:25 PM, Ladislav Lenart wrote:

> Could you please explain to me the error(s) in my reasoning?

Let me just flip your list in reverse... and add in some elements (marked with a *):

posting  ts  context
p60      60  friend
p55      55 friend*
p54      54 friend*
p50      50  group
p50      49  group*
p50      49  group*
p40      40  friend
p30      30  group
p20      20  friend
p10      10  group
p0        0  friend

With the 2 limited subqueries, the results would be:
    60F, 55F, 50G, 49G

But the "most recent" data is
    50F, 55F, 54F, 50G

So we end up showing 49 which is less relevant than 54.

In some situations this isn't much of an issue, but in others it is detrimental.
For example, one of my "feeds" contains a distribution of events according-to-type that is very uneven.  While "friend"
and"group" might be relatively close in time to one another, "system" or other events may be months old -- and that
oldercontent gets pulled in with this style of query.   

If you need to paginate the data and select the next 10 overall items, it gets even more complicated.

IIRC, the best mix of performance and "product" that I've found is do something like this:

    SELECT * FROM (
        SELECT a,b,c FROM table_a ORDER BY TIMESTAMP DESC LIMIT 10000;
        UNION
        SELECT a,b,c FROM table_b ORDER BY TIMESTAMP DESC LIMIT 10000;
    ) as unioned
     order by unioned TIMESTAMP DESC LIMIT 100 OFFSET 0;

by creating an artificial limit on the inner queries, you can save postgres from doing a lot of intensive i/o work and
memoryusage (like a lot) 
then, joining a few lists and sorting 20k (or even 100k) items is really cheap.
the downside is that you effectively limit the 'relevancy' of the query to whatever the inner limit is (ie, 10000 --
notthe combined total of 20000), but that number can be arbitrarily high enough that it is irrelevant while still
showingthe right amount of content for people.

Re: newsfeed type query

From

Ladislav Lenart

Date:

29 April 2015, 17:18:42

On 29.4.2015 18:54, Jonathan Vanasco wrote:
>
> On Apr 29, 2015, at 12:25 PM, Ladislav Lenart wrote:
>
>> Could you please explain to me the error(s) in my reasoning?
>
> Let me just flip your list in reverse... and add in some elements (marked with a *):
>
> posting  ts  context
> p60      60  friend
> p55      55 friend*
> p54      54 friend*
> p50      50  group
> p50      49  group*
> p50      49  group*
> p40      40  friend
> p30      30  group
> p20      20  friend
> p10      10  group
> p0        0  friend
>
> With the 2 limited subqueries, the results would be:
>     60F, 55F, 50G, 49G
>
> But the "most recent" data is
>     50F, 55F, 54F, 50G
>
> So we end up showing 49 which is less relevant than 54.

I would expect the overall query to return only 60F nad 55F as the most recent
data. No? You expect it to return 4 items when the LIMIT is only 2. Remember
that the overall query should be also ordered by ts and limited to 2.

I thought you want most recent items across all contexts and not 2 most recent
items from friends plus two most recent items from groups...


Ladislav Lenart


> In some situations this isn't much of an issue, but in others it is detrimental.
> For example, one of my "feeds" contains a distribution of events according-to-type that is very uneven.  While
"friend"and "group" might be relatively close in time to one another, "system" or other events may be months old -- and
thatolder content gets pulled in with this style of query.   
>
> If you need to paginate the data and select the next 10 overall items, it gets even more complicated.
>
> IIRC, the best mix of performance and "product" that I've found is do something like this:
>
>     SELECT * FROM (
>         SELECT a,b,c FROM table_a ORDER BY TIMESTAMP DESC LIMIT 10000;
>         UNION
>         SELECT a,b,c FROM table_b ORDER BY TIMESTAMP DESC LIMIT 10000;
>     ) as unioned
>      order by unioned TIMESTAMP DESC LIMIT 100 OFFSET 0;
>
> by creating an artificial limit on the inner queries, you can save postgres from doing a lot of intensive i/o work
andmemory usage (like a lot) 
> then, joining a few lists and sorting 20k (or even 100k) items is really cheap.
> the downside is that you effectively limit the 'relevancy' of the query to whatever the inner limit is (ie, 10000 --
notthe combined total of 20000), but that number can be arbitrarily high enough that it is irrelevant while still
showingthe right amount of content for people.

Re: newsfeed type query

From

Jim Nasby

Date:

29 April 2015, 22:51:01

On 4/29/15 11:54 AM, Jonathan Vanasco wrote:
> IIRC, the best mix of performance and "product" that I've found is do something like this:
>
>     SELECT * FROM (
>         SELECT a,b,c FROM table_a ORDER BY TIMESTAMP DESC LIMIT 10000;
>         UNION
>         SELECT a,b,c FROM table_b ORDER BY TIMESTAMP DESC LIMIT 10000;
>     ) as unioned
>       order by unioned TIMESTAMP DESC LIMIT 100 OFFSET 0;
>
> by creating an artificial limit on the inner queries, you can save postgres from doing a lot of intensive i/o work
andmemory usage (like a lot) 
> then, joining a few lists and sorting 20k (or even 100k) items is really cheap.

Only because you're using UNION. Use UNION ALL instead.

Also, you mentioned CTEs. Be aware that those are ALWAYS materialized.
Sometimes that helps performance... sometimes it hurts it horribly. I
stick with embedded subselects unless I need a specific CTE feature.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: newsfeed type query

From

Jonathan Vanasco

Date:

30 April 2015, 17:08:28

On Apr 29, 2015, at 6:50 PM, Jim Nasby wrote:

Only because you're using UNION. Use UNION ALL instead.

The difference between "union" and "union all" was negligible. the problem was in the subselect and the sheer size of the tables, even when we could handle it as an index-only scan.

On Apr 29, 2015, at 1:18 PM, Ladislav Lenart wrote:

I would expect the overall query to return only 60F nad 55F as the most recent
data. No? You expect it to return 4 items when the LIMIT is only 2. Remember
that the overall query should be also ordered by ts and limited to 2.

You're right. total mistake on my part and confusion with that. I got this query confused with the specifics of a similar one.

Re: newsfeed type query

From

Ladislav Lenart

Date:

30 April 2015, 18:05:41

On 30.4.2015 19:08, Jonathan Vanasco wrote:
>
> On Apr 29, 2015, at 6:50 PM, Jim Nasby wrote:
>
>> Only because you're using UNION. Use UNION ALL instead.
>
> The difference between "union" and "union all" was negligible.  the problem was
> in the subselect and the sheer size of the tables, even when we could handle it
> as an index-only scan.
>
>
> On Apr 29, 2015, at 1:18 PM, Ladislav Lenart wrote:
>
>> I would expect the overall query to return only 60F nad 55F as the most recent
>> data. No? You expect it to return 4 items when the LIMIT is only 2. Remember
>> that the overall query should be also ordered by ts and limited to 2.
>
> You're right. total mistake on my part and confusion with that.  I got this
> query confused with the specifics of a similar one.

OK :-) Have you managed to solve the problem then? I am interested in your final
solution.


Thank you,

Ladislav Lenart