Thread: Advice on efficiently logging outputs to PostgreSQL

Advice on efficiently logging outputs to PostgreSQL

From

Dominique Devienne

Date:

15 October 2024, 17:17:04

I have an existing heavy ETL that serially loads tons of data to PostgreSQL.
This is done using a CLI tool, processing one project after another.

I'd like to parallelize / distribute the work, which I could do from my CLI
tool, but 1) that would be confined to a single machine, and 2) we'd like to
provide a web UI to report progress and logs from the ETL workers / processes.

I mentioned in the past I have a basic LISTEN-NOTIFY-based work queue
in PostgreSQL, that would do nicely for this, but I'm wondering how to best
to deal with the logs.

The ETL process is quite chatty, and can generate quite a bit of logs.
Which for the Web UI to see (developped independently, by a separate team),
I'd want in PostgreSQL as well. The ETL worker wants to write them, almost
continuously. The Web UI wants to read them concurrently, to show them,
and ideally in a timely fashion.

My experience with CI tools like Jenkins is that logs are "laggy", and
arrive in big chunk, dozens of seconds appart, which is a subpar experience.
That's not PostgreSQL related, but just to illustrate what I'd like to avoid.

Has anyone done anything along these lines?

Given the way MVCC works in PostgreSQL, updating (by appending to) a
"file-like" bytea or text[] seems like a bad idea. So each log line should be
it's own row? Or lines with close timestamps aggregated together, to limit
rows generated?

If granular at the line level, will tons of small transactions be a problem?
And blow-up our "Oid budget" too rapidly?

Am I worrying too much? :)

I'd appreciate any advise or experience-based story around this
use-case, please.

Thanks, --DD

Re: Advice on efficiently logging outputs to PostgreSQL

From

Christophe Pettus

Date:

15 October 2024, 20:56:25

> On Oct 15, 2024, at 07:17, Dominique Devienne <ddevienne@gmail.com> wrote:
> Am I worrying too much? :)

Probably. :-)  The main things I'd worry about is:

1. What's the ratio of log lines to database updates?  You want this to be as high as usefully possible, since in
effectyou are doing write amplification by writing to the logs as well as to the "real" database. 

2. One of the things to watch out for if you are writing the log lines to the database is that if you write a log line
ina transaction, and that transaction rolls back, you lose the log line.  That may or may not be what you want: if it's
reportingan error (such as the reason that the transaction rolled back), you want to preserve that data.  One way of
handlingthis is to have the application have a separate session for logging, although you are now doing 2x the number
ofconnections. 

If the write volume is very high, you might consider using a dedicate log-ingestion service (there are tons) rather
thanPostgreSQL, so that you aren't overburdening the database with log activity.