Thread: newsfeed type query
I'm trying to upgrade some code that powers a newfeed type stream, and hoping someone can offer some insight on better waysto structure some parts of the query The part that has me stumped right now... There are several criteria for why something could appear in a stream. for example, here are 2 handling a posting: * a posting by a friend * a posting in a group the general way I've handled this so far has been simple: select * from posting where author_id in (select friend_id from friends where user_id = ?) or group_id in (select group_idfrom memberships where user_id = ?); now i need to pull in the context of the match (friend, group, both), but I can't figure out how to do this cleanly. 1. if i just add 'case' statements to the select to note the origin, those subselects run again. (ie, the same subqueryis executed twice) 2. if i structure this as a union (and note the origin with a string), it takes a lot more work to integrate and sort the2 separate selects ( eg "select id, timestamp, 'by-friend'" unioned with "in-group") does anyone have ideas on other approaches to structuring this?
Since you very nicely DID NOT provide the pg version, O/S or table structure(s), which is what you should do REGARDLESS of the
type of question (it's just the smart and polite thing to do when asking for help) The best I can suggest is:SELECT
CASE WHEN context = 'friend' THEN p.junka
WHEN context = 'group' THEN p.junkb
WHEN context = 'both' THEN p.junka || ' ' || p.junkb
END
FROM posting p
where p.author_id in (SELECT f.friend_id
FROM friends f
WHERE f.user_id = ?)
OR p.group_id in (SELECT m.group_id
FROM memberships m
WHERE m.user_id = ?);
On Tue, Apr 28, 2015 at 6:26 PM, Jonathan Vanasco <postgres@2xlp.com> wrote:
I'm trying to upgrade some code that powers a newfeed type stream, and hoping someone can offer some insight on better ways to structure some parts of the query
The part that has me stumped right now...
There are several criteria for why something could appear in a stream. for example, here are 2 handling a posting:
* a posting by a friend
* a posting in a group
the general way I've handled this so far has been simple:
select * from posting where author_id in (select friend_id from friends where user_id = ?) or group_id in (select group_id from memberships where user_id = ?);
now i need to pull in the context of the match (friend, group, both), but I can't figure out how to do this cleanly.
1. if i just add 'case' statements to the select to note the origin, those subselects run again. (ie, the same subquery is executed twice)
2. if i structure this as a union (and note the origin with a string), it takes a lot more work to integrate and sort the 2 separate selects ( eg "select id, timestamp, 'by-friend'" unioned with "in-group")
does anyone have ideas on other approaches to structuring this?
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
Melvin Davidson
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.

Sorry, I was trying to ask something very abstract as I have similar situations on multiple groups of queries/tables (and they're all much more complex).
I'm on pg 9.3
The relevant structure is:
posting:
id
timestamp_publish
group_id__in
user_id__author
friends:
user_id__a
user_id__b
memberships:
user_id
group_id
role_id
-- working sql
CREATE TABLE groups(
id SERIAL NOT NULL PRIMARY KEY
);
CREATE TABLE users(
id SERIAL NOT NULL PRIMARY KEY
);
CREATE TABLE friends (
user_id__a INT NOT NULL REFERENCES users( id ),
user_id__b INT NOT NULL REFERENCES users( id )
);
CREATE TABLE memberships (
user_id INT NOT NULL REFERENCES users( id ),
group_id INT NOT NULL REFERENCES groups( id ),
role_id INT NOT NULL
);
CREATE TABLE posting (
id SERIAL NOT NULL,
timestamp_publish timestamp not null,
group_id__in INT NOT NULL REFERENCES groups(id),
user_id__author INT NOT NULL REFERENCES users(id),
is_published BOOL
);
The output that I'm trying to get is:
posting.id
{the context of the select}
posting.timestamp_publish (this may need to get correlated into other queries)
These approaches had bad performance:
-- huge selects / memory
-- it needs to load everything from 2 tables before it limits
EXPLAIN ANALYZE
SELECT id, feed_context FROM (
SELECT id, timestamp_publish, 'in-group' AS feed_context FROM posting
WHERE (
group_id__in IN (SELECT group_id FROM memberships WHERE user_id = 57 AND role_id IN (1,2,3))
AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP AT TIME ZONE 'UTC')
)
UNION
SELECT id, timestamp_publish, 'by-user' AS feed_context FROM posting
WHERE (
user_id__author IN (SELECT user_id__b FROM friends WHERE user_id__a = 57)
AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP AT TIME ZONE 'UTC')
)
) AS feed
ORDER BY timestamp_publish DESC
LIMIT 10
;
-- selects minimized, but repetitive subqueries
SELECT
id,
CASE
WHEN group_id__in IN (SELECT group_id FROM memberships WHERE user_id = 57 AND role_id IN (1,2,3)) THEN True
ELSE NULL
END AS feed_context_group,
CASE
WHEN user_id__author IN (SELECT user_id__b FROM friends WHERE user_id__a = 57) THEN True
ELSE NULL
END AS feed_context_user
FROM posting
WHERE (
group_id__in IN (SELECT group_id FROM memberships WHERE user_id = 57 AND role_id IN (1,2,3))
OR
user_id__author IN (SELECT user_id__b FROM friends WHERE user_id__a = 57)
)
AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP AT TIME ZONE 'UTC')
ORDER BY timestamp_publish DESC
LIMIT 10
;
On Apr 28, 2015, at 6:56 PM, Melvin Davidson wrote:
Since you very nicely DID NOT provide the pg version, O/S or table structure(s), which is what you should do REGARDLESS of thetype of question (it's just the smart and polite thing to do when asking for help) The best I can suggest is:
SELECT
CASE WHEN context = 'friend' THEN p.junka
WHEN context = 'group' THEN p.junkb
WHEN context = 'both' THEN p.junka || ' ' || p.junkb
END
FROM posting p
where p.author_id in (SELECT f.friend_id
FROM friends f
WHERE f.user_id = ?)
OR p.group_id in (SELECT m.group_id
FROM memberships m
WHERE m.user_id = ?);
On 4/28/15 6:57 PM, Jonathan Vanasco wrote: > The relevant structure is: > > posting: > id > timestamp_publish > group_id__in > user_id__author > friends: > user_id__a > user_id__b > > memberships: > user_id > group_id > role_id Try this... SELECT ... , f.user_id__b IS NOT NULL AS in_friends , m.user_id IS NOT NULL AS in_group FROM posting p LEFT JOIN friends f ON( f.user_id__b = p.user_id__author ) LEFT JOIN memberships m ON( m.group_id = p.group_id__in ) WHERE is_published AND timestamp_publish ... AND ( f.user_id__a = 57 OR ( m.user_id = 57 AND m.group_id IN (1,2,3) ) ) I'm not sure how fast it'll be though. I suspect your best bet is to put the UNION approach inside a set returning function; that way the ugly is contained in one place. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
Hello. On 29.4.2015 00:26, Jonathan Vanasco wrote: > > I'm trying to upgrade some code that powers a newfeed type stream, and hoping someone can offer some insight on betterways to structure some parts of the query > > The part that has me stumped right now... > > There are several criteria for why something could appear in a stream. for example, here are 2 handling a posting: > > * a posting by a friend > * a posting in a group > > the general way I've handled this so far has been simple: > > select * from posting where author_id in (select friend_id from friends where user_id = ?) or group_id in (select group_idfrom memberships where user_id = ?); > > now i need to pull in the context of the match (friend, group, both), but I can't figure out how to do this cleanly. > > 1. if i just add 'case' statements to the select to note the origin, those subselects run again. (ie, the same subqueryis executed twice) > 2. if i structure this as a union (and note the origin with a string), it takes a lot more work to integrate and sort the2 separate selects ( eg "select id, timestamp, 'by-friend'" unioned with "in-group") > > does anyone have ideas on other approaches to structuring this? Dunno if this is optimal (most probably not), but it might be of some help (WARNING: not tested at all): with posting_ids as ( select t.posting_id, bool_or(t.from_friend) as from_friend, bool_or(t.grom_group) as from_group from ( select posting.id as posting_id, true as from_friend, false as from_group from posting where posting.author_id in ( select friend_id from friends where user_id = ? ) union all select posting.id as posting_id, false as from_friend, true as from_group from posting where group_id in ( select group_id from memberships where user_id = ? ) ) t group by t.posting_id ) select posting.*, posting_ids.from_friend, posting_ids.from_group, posting_ids.from_friend or posting_ids.from_group as from_any, posting_ids.from_friend and posting_ids.from_group as from_both from posting join posting_ids on posting.id = posting_ids.posting_id Ladislav Lenart
Hello. On 29.4.2015 01:57, Jonathan Vanasco wrote: > Sorry, I was trying to ask something very abstract as I have similar situations > on multiple groups of queries/tables (and they're all much more complex). > > I'm on pg 9.3 > > The relevant structure is: > > posting: > id > timestamp_publish > group_id__in > user_id__author > > friends: > user_id__a > user_id__b > > memberships: > user_id > group_id > role_id > > > -- working sql > CREATE TABLE groups( > id SERIAL NOT NULL PRIMARY KEY > ); > CREATE TABLE users( > id SERIAL NOT NULL PRIMARY KEY > ); > CREATE TABLE friends ( > user_id__a INT NOT NULL REFERENCES users( id ), > user_id__b INT NOT NULL REFERENCES users( id ) > ); > CREATE TABLE memberships ( > user_id INT NOT NULL REFERENCES users( id ), > group_id INT NOT NULL REFERENCES groups( id ), > role_id INT NOT NULL > ); > CREATE TABLE posting ( > id SERIAL NOT NULL, > timestamp_publish timestamp not null, > group_id__in INT NOT NULL REFERENCES groups(id), > user_id__author INT NOT NULL REFERENCES users(id), > is_published BOOL > ); > > The output that I'm trying to get is: > posting.id > {the context of the select} > posting.timestamp_publish (this may need to get correlated into other queries) > > > These approaches had bad performance: > > -- huge selects / memory > -- it needs to load everything from 2 tables before it limits > EXPLAIN ANALYZE > SELECT id, feed_context FROM ( > SELECT id, timestamp_publish, 'in-group' AS feed_context FROM posting > WHERE ( > group_id__in IN (SELECT group_id FROM memberships WHERE user_id = 57 > AND role_id IN (1,2,3)) > AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP > AT TIME ZONE 'UTC') > ) > UNION > SELECT id, timestamp_publish, 'by-user' AS feed_context FROM posting > WHERE ( > user_id__author IN (SELECT user_id__b FROM friends WHERE user_id__a > = 57) > AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP > AT TIME ZONE 'UTC') > ) > ) AS feed > ORDER BY timestamp_publish DESC > LIMIT 10 > ; I think you can propagate ORDER BY and LIMIT also to the subqueries of the UNION, i.e.: select... from ( ( select... from posting where... -- friends order by timestamp_publish desc limit 10 ) union ( ( select... from posting where... -- groups order by timestamp_publish desc limit 10 ) ) as feed order by timestamp_publish desc limit 10 That might behave better. Ladislav Lenart > -- selects minimized, but repetitive subqueries > SELECT > id, > CASE > WHEN group_id__in IN (SELECT group_id FROM memberships WHERE user_id > = 57 AND role_id IN (1,2,3)) THEN True > ELSE NULL > END AS feed_context_group, > CASE > WHEN user_id__author IN (SELECT user_id__b FROM friends WHERE > user_id__a = 57) THEN True > ELSE NULL > END AS feed_context_user > FROM posting > WHERE ( > group_id__in IN (SELECT group_id FROM memberships WHERE user_id = 57 > AND role_id IN (1,2,3)) > OR > user_id__author IN (SELECT user_id__b FROM friends WHERE user_id__a > = 57) > ) > AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP AT > TIME ZONE 'UTC') > ORDER BY timestamp_publish DESC > LIMIT 10 > ; > > > > On Apr 28, 2015, at 6:56 PM, Melvin Davidson wrote: > >> Since you very nicely DID NOT provide the pg version, O/S or table >> structure(s), which is what you should do REGARDLESS of the >> type of question (it's just the smart and polite thing to do when asking for >> help) The best I can suggest is: >> SELECT >> CASE WHEN context = 'friend' THEN p.junka >> WHEN context = 'group' THEN p.junkb >> WHEN context = 'both' THEN p.junka || ' ' || p.junkb >> END >> FROM posting p >> where p.author_id in (SELECT f.friend_id >> FROM friends f >> WHERE f.user_id = ?) >> OR p.group_id in (SELECT m.group_id >> FROM memberships m >> WHERE m.user_id = ?); > >
I see others have responded with suggestions to improve query performance,
but one thing I noticed when you gave the data structure is there are no
no primary keys defined for friends or posting, neither are there any indexes.
Was that an omission? If not, then please note that PostgreSQL is a _relational_
database and it is critical to have primary keys and additional indexes for data integrity
and performance.
FYI, defining a foreign key in a table does not automatically generate an associated
index.
I therefore suggest you do the following to improve performance.
ALTER TABLE posting
ADD CONSTRAINT posting_pk PRIMARY KEY ( id );
ALTER TABLE friends
ADD CONSTRAINT friends_pk PRIMARY KEY ( user_id__a, user_id__b );
ALTER TABLE membership
ADD CONSTRAINT membership_pk PRIMARY KEY ( user_id, group_id );
CREATE INDEX posting_group_id__in_idx
on posting
USING BTREE ( group_id__in );
ANALYZE posting;
ANALYZE friends;
ANALYZE membership;
but one thing I noticed when you gave the data structure is there are no
no primary keys defined for friends or posting, neither are there any indexes.
Was that an omission? If not, then please note that PostgreSQL is a _relational_
database and it is critical to have primary keys and additional indexes for data integrity
and performance.
FYI, defining a foreign key in a table does not automatically generate an associated
index.
I therefore suggest you do the following to improve performance.
ALTER TABLE posting
ADD CONSTRAINT posting_pk PRIMARY KEY ( id );
ALTER TABLE friends
ADD CONSTRAINT friends_pk PRIMARY KEY ( user_id__a, user_id__b );
ALTER TABLE membership
ADD CONSTRAINT membership_pk PRIMARY KEY ( user_id, group_id );
CREATE INDEX posting_group_id__in_idx
on posting
USING BTREE ( group_id__in );
ANALYZE posting;
ANALYZE friends;
ANALYZE membership;
On Wed, Apr 29, 2015 at 6:54 AM, Ladislav Lenart <lenartlad@volny.cz> wrote:
Hello.I think you can propagate ORDER BY and LIMIT also to the subqueries of the
On 29.4.2015 01:57, Jonathan Vanasco wrote:
> Sorry, I was trying to ask something very abstract as I have similar situations
> on multiple groups of queries/tables (and they're all much more complex).
>
> I'm on pg 9.3
>
> The relevant structure is:
>
> posting:
> id
> timestamp_publish
> group_id__in
> user_id__author
>
> friends:
> user_id__a
> user_id__b
>
> memberships:
> user_id
> group_id
> role_id
>
>
> -- working sql
> CREATE TABLE groups(
> id SERIAL NOT NULL PRIMARY KEY
> );
> CREATE TABLE users(
> id SERIAL NOT NULL PRIMARY KEY
> );
> CREATE TABLE friends (
> user_id__a INT NOT NULL REFERENCES users( id ),
> user_id__b INT NOT NULL REFERENCES users( id )
> );
> CREATE TABLE memberships (
> user_id INT NOT NULL REFERENCES users( id ),
> group_id INT NOT NULL REFERENCES groups( id ),
> role_id INT NOT NULL
> );
> CREATE TABLE posting (
> id SERIAL NOT NULL,
> timestamp_publish timestamp not null,
> group_id__in INT NOT NULL REFERENCES groups(id),
> user_id__author INT NOT NULL REFERENCES users(id),
> is_published BOOL
> );
>
> The output that I'm trying to get is:
> posting.id
> {the context of the select}
> posting.timestamp_publish (this may need to get correlated into other queries)
>
>
> These approaches had bad performance:
>
> -- huge selects / memory
> -- it needs to load everything from 2 tables before it limits
> EXPLAIN ANALYZE
> SELECT id, feed_context FROM (
> SELECT id, timestamp_publish, 'in-group' AS feed_context FROM posting
> WHERE (
> group_id__in IN (SELECT group_id FROM memberships WHERE user_id = 57
> AND role_id IN (1,2,3))
> AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP
> AT TIME ZONE 'UTC')
> )
> UNION
> SELECT id, timestamp_publish, 'by-user' AS feed_context FROM posting
> WHERE (
> user_id__author IN (SELECT user_id__b FROM friends WHERE user_id__a
> = 57)
> AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP
> AT TIME ZONE 'UTC')
> )
> ) AS feed
> ORDER BY timestamp_publish DESC
> LIMIT 10
> ;
UNION, i.e.:
select...
from (
(
select...
from posting
where... -- friends
order by timestamp_publish desc
limit 10
) union (
(
select...
from posting
where... -- groups
order by timestamp_publish desc
limit 10
)
) as feed
order by timestamp_publish desc
limit 10
That might behave better.
Ladislav Lenart
> -- selects minimized, but repetitive subqueries
> SELECT
> id,
> CASE
> WHEN group_id__in IN (SELECT group_id FROM memberships WHERE user_id
> = 57 AND role_id IN (1,2,3)) THEN True
> ELSE NULL
> END AS feed_context_group,
> CASE
> WHEN user_id__author IN (SELECT user_id__b FROM friends WHERE
> user_id__a = 57) THEN True
> ELSE NULL
> END AS feed_context_user
> FROM posting
> WHERE (
> group_id__in IN (SELECT group_id FROM memberships WHERE user_id = 57
> AND role_id IN (1,2,3))
> OR
> user_id__author IN (SELECT user_id__b FROM friends WHERE user_id__a
> = 57)
> )
> AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP AT
> TIME ZONE 'UTC')
> ORDER BY timestamp_publish DESC
> LIMIT 10
> ;
>
>
>
> On Apr 28, 2015, at 6:56 PM, Melvin Davidson wrote:
>
>> Since you very nicely DID NOT provide the pg version, O/S or table
>> structure(s), which is what you should do REGARDLESS of the
>> type of question (it's just the smart and polite thing to do when asking for
>> help) The best I can suggest is:
>> SELECT
>> CASE WHEN context = 'friend' THEN p.junka
>> WHEN context = 'group' THEN p.junkb
>> WHEN context = 'both' THEN p.junka || ' ' || p.junkb
>> END
>> FROM posting p
>> where p.author_id in (SELECT f.friend_id
>> FROM friends f
>> WHERE f.user_id = ?)
>> OR p.group_id in (SELECT m.group_id
>> FROM memberships m
>> WHERE m.user_id = ?);
>
>
--
Melvin Davidson
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.

Thanks all! These point me in much better directions! Jim Nasby's approach to selecting an expression addressed some things (SELECT f.useraccount_id_b IS NOT NULL AS in_friends) Ladislav Lenart's usage of the CTE is also of a different format that I've used in the past. I think i'll be able to patch together some performance improvements now, that will last until the database structure changes. On Apr 29, 2015, at 6:54 AM, Ladislav Lenart wrote: > I think you can propagate ORDER BY and LIMIT also to the subqueries of the > UNION, i.e.: It behaves a lot better, but doesn't give me the resultset I need. Older data from one subquery is favored to newer datafrom another I use a similar approach on another part of this application -- where the effect on the resultset isn't as pronounced. On that query there are over 100 million total stream events. Not using an inner limit runs the query in 7 minutes; limitingthe inner subquery to 1MM runs in 70 seconds... and limiting to 10k is around 100ms. On Apr 29, 2015, at 10:16 AM, Melvin Davidson wrote: > I see others have responded with suggestions to improve query performance, > but one thing I noticed when you gave the data structure is there are no > no primary keys defined for friends or posting, neither are there any indexes. > Was that an omission? This was a quick functional example to illustrate. The real tables are slightly different but do have pkeys ( 'id' is abigserial, relationship tables (friends, memberships) use a composite key ). They are aggressively indexed and reindexedon various columns for query performance. sometimes we create an extra index that has multiple columns or partial-columnsto make make scans index-only.
Hello. On 29.4.2015 17:27, Jonathan Vanasco wrote: > > Thanks all! These point me in much better directions! > > Jim Nasby's approach to selecting an expression addressed some things (SELECT f.useraccount_id_b IS NOT NULL AS in_friends) > > Ladislav Lenart's usage of the CTE is also of a different format that I've used in the past. > > I think i'll be able to patch together some performance improvements now, that will last until the database structure changes. > > > On Apr 29, 2015, at 6:54 AM, Ladislav Lenart wrote: > >> I think you can propagate ORDER BY and LIMIT also to the subqueries of the >> UNION, i.e.: > > > It behaves a lot better, but doesn't give me the resultset I need. Older data from one subquery is favored to newer datafrom another Hmm, I don't understand why it should behave like that. Imagine the following postings (ts is a relative timestamp): posting ts context p0 0 friend p10 10 group p20 20 friend p30 30 group p40 40 friend p50 50 group p60 60 friend and let's say the LIMIT is 2. Then: * The first subquery (for friends) should return p60 and p40 (in DESC order). * The second subquery (for groups) should return p50 and p30 (in DESC order). * The UNION should return p60 and p50. Could you please explain to me the error(s) in my reasoning? Thank you, Ladislav Lenart > I use a similar approach on another part of this application -- where the effect on the resultset isn't as pronounced. > On that query there are over 100 million total stream events. Not using an inner limit runs the query in 7 minutes; limitingthe inner subquery to 1MM runs in 70 seconds... and limiting to 10k is around 100ms. > > > On Apr 29, 2015, at 10:16 AM, Melvin Davidson wrote: > >> I see others have responded with suggestions to improve query performance, >> but one thing I noticed when you gave the data structure is there are no >> no primary keys defined for friends or posting, neither are there any indexes. >> Was that an omission? > > This was a quick functional example to illustrate. The real tables are slightly different but do have pkeys ( 'id' isa bigserial, relationship tables (friends, memberships) use a composite key ). They are aggressively indexed and reindexedon various columns for query performance. sometimes we create an extra index that has multiple columns or partial-columnsto make make scans index-only. > > > > > > > > >
On Apr 29, 2015, at 12:25 PM, Ladislav Lenart wrote: > Could you please explain to me the error(s) in my reasoning? Let me just flip your list in reverse... and add in some elements (marked with a *): posting ts context p60 60 friend p55 55 friend* p54 54 friend* p50 50 group p50 49 group* p50 49 group* p40 40 friend p30 30 group p20 20 friend p10 10 group p0 0 friend With the 2 limited subqueries, the results would be: 60F, 55F, 50G, 49G But the "most recent" data is 50F, 55F, 54F, 50G So we end up showing 49 which is less relevant than 54. In some situations this isn't much of an issue, but in others it is detrimental. For example, one of my "feeds" contains a distribution of events according-to-type that is very uneven. While "friend" and"group" might be relatively close in time to one another, "system" or other events may be months old -- and that oldercontent gets pulled in with this style of query. If you need to paginate the data and select the next 10 overall items, it gets even more complicated. IIRC, the best mix of performance and "product" that I've found is do something like this: SELECT * FROM ( SELECT a,b,c FROM table_a ORDER BY TIMESTAMP DESC LIMIT 10000; UNION SELECT a,b,c FROM table_b ORDER BY TIMESTAMP DESC LIMIT 10000; ) as unioned order by unioned TIMESTAMP DESC LIMIT 100 OFFSET 0; by creating an artificial limit on the inner queries, you can save postgres from doing a lot of intensive i/o work and memoryusage (like a lot) then, joining a few lists and sorting 20k (or even 100k) items is really cheap. the downside is that you effectively limit the 'relevancy' of the query to whatever the inner limit is (ie, 10000 -- notthe combined total of 20000), but that number can be arbitrarily high enough that it is irrelevant while still showingthe right amount of content for people.
On 29.4.2015 18:54, Jonathan Vanasco wrote: > > On Apr 29, 2015, at 12:25 PM, Ladislav Lenart wrote: > >> Could you please explain to me the error(s) in my reasoning? > > Let me just flip your list in reverse... and add in some elements (marked with a *): > > posting ts context > p60 60 friend > p55 55 friend* > p54 54 friend* > p50 50 group > p50 49 group* > p50 49 group* > p40 40 friend > p30 30 group > p20 20 friend > p10 10 group > p0 0 friend > > With the 2 limited subqueries, the results would be: > 60F, 55F, 50G, 49G > > But the "most recent" data is > 50F, 55F, 54F, 50G > > So we end up showing 49 which is less relevant than 54. I would expect the overall query to return only 60F nad 55F as the most recent data. No? You expect it to return 4 items when the LIMIT is only 2. Remember that the overall query should be also ordered by ts and limited to 2. I thought you want most recent items across all contexts and not 2 most recent items from friends plus two most recent items from groups... Ladislav Lenart > In some situations this isn't much of an issue, but in others it is detrimental. > For example, one of my "feeds" contains a distribution of events according-to-type that is very uneven. While "friend"and "group" might be relatively close in time to one another, "system" or other events may be months old -- and thatolder content gets pulled in with this style of query. > > If you need to paginate the data and select the next 10 overall items, it gets even more complicated. > > IIRC, the best mix of performance and "product" that I've found is do something like this: > > SELECT * FROM ( > SELECT a,b,c FROM table_a ORDER BY TIMESTAMP DESC LIMIT 10000; > UNION > SELECT a,b,c FROM table_b ORDER BY TIMESTAMP DESC LIMIT 10000; > ) as unioned > order by unioned TIMESTAMP DESC LIMIT 100 OFFSET 0; > > by creating an artificial limit on the inner queries, you can save postgres from doing a lot of intensive i/o work andmemory usage (like a lot) > then, joining a few lists and sorting 20k (or even 100k) items is really cheap. > the downside is that you effectively limit the 'relevancy' of the query to whatever the inner limit is (ie, 10000 -- notthe combined total of 20000), but that number can be arbitrarily high enough that it is irrelevant while still showingthe right amount of content for people.
On 4/29/15 11:54 AM, Jonathan Vanasco wrote: > IIRC, the best mix of performance and "product" that I've found is do something like this: > > SELECT * FROM ( > SELECT a,b,c FROM table_a ORDER BY TIMESTAMP DESC LIMIT 10000; > UNION > SELECT a,b,c FROM table_b ORDER BY TIMESTAMP DESC LIMIT 10000; > ) as unioned > order by unioned TIMESTAMP DESC LIMIT 100 OFFSET 0; > > by creating an artificial limit on the inner queries, you can save postgres from doing a lot of intensive i/o work andmemory usage (like a lot) > then, joining a few lists and sorting 20k (or even 100k) items is really cheap. Only because you're using UNION. Use UNION ALL instead. Also, you mentioned CTEs. Be aware that those are ALWAYS materialized. Sometimes that helps performance... sometimes it hurts it horribly. I stick with embedded subselects unless I need a specific CTE feature. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Apr 29, 2015, at 6:50 PM, Jim Nasby wrote:
Only because you're using UNION. Use UNION ALL instead.
On Apr 29, 2015, at 1:18 PM, Ladislav Lenart wrote:
I would expect the overall query to return only 60F nad 55F as the most recent
data. No? You expect it to return 4 items when the LIMIT is only 2. Remember
that the overall query should be also ordered by ts and limited to 2.
You're right. total mistake on my part and confusion with that. I got this query confused with the specifics of a similar one.
On 30.4.2015 19:08, Jonathan Vanasco wrote: > > On Apr 29, 2015, at 6:50 PM, Jim Nasby wrote: > >> Only because you're using UNION. Use UNION ALL instead. > > The difference between "union" and "union all" was negligible. the problem was > in the subselect and the sheer size of the tables, even when we could handle it > as an index-only scan. > > > On Apr 29, 2015, at 1:18 PM, Ladislav Lenart wrote: > >> I would expect the overall query to return only 60F nad 55F as the most recent >> data. No? You expect it to return 4 items when the LIMIT is only 2. Remember >> that the overall query should be also ordered by ts and limited to 2. > > You're right. total mistake on my part and confusion with that. I got this > query confused with the specifics of a similar one. OK :-) Have you managed to solve the problem then? I am interested in your final solution. Thank you, Ladislav Lenart