Home > mailing lists

Re: [GENERAL] DISTINCT vs GROUP BY - was Re: is (not) distinct from - Mailing list pgsql-general

From	Sven R. Kunze
Subject	Re: [GENERAL] DISTINCT vs GROUP BY - was Re: is (not) distinct from
Date	March 3, 2017 12:51:16
Msg-id	b36b9115-8e46-16b7-4b4f-06186cda52a2@mail.de Whole thread Raw
In response to	[GENERAL] DISTINCT vs GROUP BY - was Re: is (not) distinct from (George Neuner <gneuner2@comcast.net>)
List	pgsql-general

Tree view

On 03.03.2017 06:26, George Neuner wrote:

I know most people here don't pay much - or any - attention to
SQLServer, however there was an interesting article recently regarding
significant performance differences between DISTINCT and GROUP BY as
used to remove duplicates.

https://sqlperformance.com/2017/01/t-sql-queries/surprises-assumptions-group-by-distinct

On a similar note, this is also an interesting read about the topic of distinct vs group by: https://blogs.oracle.com/developer/entry/counting_with_oracle_is_faster Interesting is the performance difference between integers and strings for PostgreSQL which doesn't exist for Oracle.I also tried rewriting "select distinct" to "select group by" using PostgreSQL. It didn't help; it was even worse (see appendix).

I'll get around to doing some testing soon.  For now, I am just asking
if anyone has ever run into something like this?

Yes, my team did. We use Django on a daily basis to generate SQL queries. In case of model-spanning queries, a lot of joining and duplications are involved. Distinct is the "generally" accepted way to remedy the situation but it's actually more like Tom said: distinct is a band-aid here. UNIONS and SUBSELECTs would be better I guess.Sven** Appendix **>>>># \d docs Table "public.docs" Column | Type | Modifiers --------+---------+--------------------------------------------------- id | integer | not null default nextval('docs_id_seq'::regclass) meta | jsonb | Indexes: "docs_pkey" PRIMARY KEY, btree (id)>>>># explain analyze select count(distinct meta->>'blood_group') from docs; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=760497.00..760497.01 rows=1 width=449) (actual time=37631.727..37631.727 rows=1 loops=1) -> Seq Scan on docs (cost=0.00..710497.00 rows=10000000 width=449) (actual time=0.500..3999.417 rows=10000000 loops=1) Planning time: 0.211 ms Execution time: 37631.829 ms(4 rows)>>>># explain analyze select count(*) from (select meta->>'blood_group' from docs group by meta->>'blood_group') as x; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=4441923.83..4441923.84 rows=1 width=0) (actual time=41189.472..41189.472 rows=1 loops=1) -> Group (cost=4241923.83..4316923.83 rows=10000000 width=449) (actual time=31303.690..41189.455 rows=8 loops=1) Group Key: ((docs.meta ->> 'blood_group'::text)) -> Sort (cost=4241923.83..4266923.83 rows=10000000 width=449) (actual time=31303.686..40475.227 rows=10000000 loops=1) Sort Key: ((docs.meta ->> 'blood_group'::text)) Sort Method: external merge Disk: 129328kB -> Seq Scan on docs (cost=0.00..735497.00 rows=10000000 width=449) (actual time=0.349..6433.691 rows=10000000 loops=1) Planning time: 2.189 ms Execution time: 41203.669 ms(9 rows)

pgsql-general by date:

From: Achilleas Mantzios
Date: 03 March 2017, 12:24:11
Subject: Re: [GENERAL] PG on SSD

From: Geoff Winkless
Date: 03 March 2017, 13:43:32
Subject: Re: [GENERAL] ERROR: functions in index expression must be marked IMMUTABLE

Re: [GENERAL] DISTINCT vs GROUP BY - was Re: is (not) distinct from - Mailing list pgsql-general

Previous

Next