Home > mailing lists

Re: I'd like to discuss scaleout at PGCon - Mailing list pgsql-hackers

From	Pavel Stehule
Subject	Re: I'd like to discuss scaleout at PGCon
Date	June 6, 2018 12:11:02
Msg-id	CAFj8pRDRdUn-PsD5A8aYHt4JbsmHqUnGHsx5KE_35GhHmbZh+g@mail.gmail.com Whole thread Raw
In response to	Re: I'd like to discuss scaleout at PGCon (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
List	pgsql-hackers

Tree view

2018-06-06 10:58 GMT+02:00 Konstantin Knizhnik <k.knizhnik@postgrespro.ru>:

On 05.06.2018 20:17, MauMau wrote:
From: Merlin Moncure
FWIW, Distributed analytical queries is the right market to be in.
This is the field in which I work, and this is where the action is
at.
I am very, very, sure about this. My view is that many of the
existing solutions to this problem (in particular hadoop class
soltuions) have major architectural downsides that make them
inappropriate in use cases that postgres really shines at; direct
hookups to low latency applications for example. postgres is
fundamentally a more capable 'node' with its multiple man-millennia
of
engineering behind it. Unlimited vertical scaling (RAC etc) is
interesting too, but this is not the way the market is moving as
hardware advancements have reduced or eliminated the need for that
in
many spheres.
I'm feeling the same. As the Moore's Law ceases to hold, software
needs to make most of the processor power. Hadoop and Spark are
written in Java and Scala. According to Google [1] (see Fig. 8), Java
is slower than C++ by 3.7x - 12.6x, and Scala is slower than C++ by
2.5x - 3.6x.

Won't PostgreSQL be able to cover the workloads of Hadoop and Spark
someday, when PostgreSQL supports scaleout, in-memory database,
multi-model capability, and in-database filesystem? That may be a
pipedream, but why do people have to tolerate the separation of the
relational-based data warehouse and Hadoop-based data lake?

[1] Robert Hundt. "Loop Recognition in C++/Java/Go/Scala".
Proceedings of Scala Days 2011

Regards
MauMau

I can not completely agree with it. I have done a lot of benchmarking of PostgreSQL, CitusDB, SparkSQL and native C/Scala code generated for TPC-H queries.
The picture is not so obvious... All this systems provides different scalability and so shows best performance at different hardware configurations.
Also Java JIT has made a good progress since 2011. Calculation intensive code (like matrix multiplication) implemented in Java is about 2 times slower than optimized C code.
But DBMSes are rarely CPU bounded. Even if all database fits in memory (which is not so common scenario for big data applications), speed of modern CPU is much higher than RAM access speed... Java application are slower than C/C++ mostly because of garbage collection. This is why SparkSQL is moving to off-heap approach when objects are allocated outside Java heap and so not affecting Java GC. New versions of SparkSQL with off-heap memory and native code generation show very good performance. And high scalability always was one of the major features of SparkSQL.

So it is naive to expect that Postgres will be 4 times faster than SparkSQL on analytic queries just because it is written in C and SparkSQL - in Scala.
Postgres has made a very good progress in support of OLAP in last releases: it now supports parallel query execution, JIT, partitioning...
But still its scalability is very limited comparing with SparkSQL. I am not sure about GreenPlum with its sophisticated distributed query optimizer, but
most of other OLAP solutions for Postgres are not able to efficiently handle complex queries (with a lot of joins by non-partitioning keys).

I do not want to say that it is not possible to implement good analytic platform for OLAP on top of Postgres. But it is very challenged task.
And IMHO choice of programming language is not so important. What is more important is format of storing data. The bast systems for data analytic: Vartica, HyPer, KDB,...
are using vertical data mode. SparkSQL is also using Parquet file format which provides efficient extraction and processing of data.
With abstract storage API Postgres is also given a chance to implement efficient storage for OLAP data processing. But huge amount of work has to be done here.

Unfortunately, storage is one factor. For good performance columnar storages needs different executor. Although smart columnar storage can get very good compress ratio, so can has sense self.

Regards

Pavel

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

pgsql-hackers by date:

From: Amit Kapila
Date: 06 June 2018, 12:09:35
Subject: Re: Loaded footgun open_datasync on Windows

From: Konstantin Knizhnik
Date: 06 June 2018, 12:30:12
Subject: Re: libpq compression

Re: I'd like to discuss scaleout at PGCon - Mailing list pgsql-hackers

Previous

Next