Storing thousands of csv files in postgresql - Mailing list pgsql-sql
From | Ion Alberdi |
---|---|
Subject | Storing thousands of csv files in postgresql |
Date | |
Msg-id | CANbgw4BfmAUsMTsh0z2Y-97jusZObaC+Ez7Mm=papd-WUDd89A@mail.gmail.com Whole thread Raw |
Responses |
Re: Storing thousands of csv files in postgresql
Re: Storing thousands of csv files in postgresql |
List | pgsql-sql |
Hello to all,
One of the use cases we need to implement requires
storing and query-ing thousands (and more as the product grows) of csv files
that have different schema-s (by schema we mean column names and their type).
These csv would then need to be maintained with operations like:
- add column,
- add row,
- delete row,
- read: filter/sort/paginate,
- write: edit column values.
Let's assume that we store the definition of each schema in a dedicated table,
with the schema defined in a json column. With this schema we'll be able translate the read/write/update queries to these imported csv files into related SQL queries.
The remaining question is how to store the data of each file in the DB.
As suggested by https://www.postgresql.org/docs/10/sql-copy.html there is a way to import a csv in its own table. By using this approach for each csv-s we see:
Pros:
- All postgresql types available: https://www.postgresql.org/docs/9.5/datatype.html,
- Constraints on columns, among others unicity constraints,
that makes the DB guarantee rows will not duplicated (relevant to the add row use case),
- Debuggability: enables using standard SQL to browse csv data,
- Can reuse existing libraries to generate dynamic SQL queries [1]
Cons:
- Need to have as many tables as different schemas.
Another solution could consist of implementing a document store in postgresql,
by storing all columns of a row in a single jsonb column.
Pros:
- Single table to store all different imported csv-s.
Cons:
- Less types available https://www.postgresql.org/docs/9.4/datatype-json.html,
- No constraint on columns, (no unicity or data validation constraints
that should be delegated to the application),
- Ramp-up on json* functions, (and I wonder whether there are libraries
to safely generate dynamic SQL queries on json columns),
(- Debuggability: this is not such a big con as json_to_record enables
going back to a standard SQL experience)
Based on this first pro/con list, we're wondering about the scalability
limits faced by postgresql instances getting more tables in a given DB.
Browsing the web, we saw two main issues:
- One related to the OS "you may see some performance degradation associated
with databases containing many tables. PostgreSQL may use a large number of
files for storing the table data, and performance may suffer if the operating
system does not cope well with many files in a single directory." [1]
- Related to that, the fact that some operations like autovacuum are O(N) on number of tables [3]
On the other hand, reading timescaledb's architecture https://docs.timescale.com/timescaledb/latest/overview/core-concepts/hypertables-and-chunks/#partitioning-in-hypertables-with-chunks
"Each chunk is implemented using a standard database table."
it seems that their platform took such a direction, which may have proved the scalability of such an approach.
My question is thus the following:
how many of such tables can a single postgresql instance handle without trouble [4]?
Any challenge/addition to the pro/cons list described above would be very welcome too.
Best regards,
Ion
[1]: Like https://www.psycopg.org/docs/sql.html
[2]: https://link.springer.com/content/pdf/bbm%3A978-1-4302-0018-5%2F1.pdf
[3]: https://stackoverflow.com/questions/22395883/postgresql-what-is-the-maximum-number-of-tables-can-store-in-postgresql-databas
[4]: We use RDS instances in AWS
One of the use cases we need to implement requires
storing and query-ing thousands (and more as the product grows) of csv files
that have different schema-s (by schema we mean column names and their type).
These csv would then need to be maintained with operations like:
- add column,
- add row,
- delete row,
- read: filter/sort/paginate,
- write: edit column values.
Let's assume that we store the definition of each schema in a dedicated table,
with the schema defined in a json column. With this schema we'll be able translate the read/write/update queries to these imported csv files into related SQL queries.
The remaining question is how to store the data of each file in the DB.
As suggested by https://www.postgresql.org/docs/10/sql-copy.html there is a way to import a csv in its own table. By using this approach for each csv-s we see:
Pros:
- All postgresql types available: https://www.postgresql.org/docs/9.5/datatype.html,
- Constraints on columns, among others unicity constraints,
that makes the DB guarantee rows will not duplicated (relevant to the add row use case),
- Debuggability: enables using standard SQL to browse csv data,
- Can reuse existing libraries to generate dynamic SQL queries [1]
Cons:
- Need to have as many tables as different schemas.
Another solution could consist of implementing a document store in postgresql,
by storing all columns of a row in a single jsonb column.
Pros:
- Single table to store all different imported csv-s.
Cons:
- Less types available https://www.postgresql.org/docs/9.4/datatype-json.html,
- No constraint on columns, (no unicity or data validation constraints
that should be delegated to the application),
- Ramp-up on json* functions, (and I wonder whether there are libraries
to safely generate dynamic SQL queries on json columns),
(- Debuggability: this is not such a big con as json_to_record enables
going back to a standard SQL experience)
Based on this first pro/con list, we're wondering about the scalability
limits faced by postgresql instances getting more tables in a given DB.
Browsing the web, we saw two main issues:
- One related to the OS "you may see some performance degradation associated
with databases containing many tables. PostgreSQL may use a large number of
files for storing the table data, and performance may suffer if the operating
system does not cope well with many files in a single directory." [1]
- Related to that, the fact that some operations like autovacuum are O(N) on number of tables [3]
On the other hand, reading timescaledb's architecture https://docs.timescale.com/timescaledb/latest/overview/core-concepts/hypertables-and-chunks/#partitioning-in-hypertables-with-chunks
"Each chunk is implemented using a standard database table."
it seems that their platform took such a direction, which may have proved the scalability of such an approach.
My question is thus the following:
how many of such tables can a single postgresql instance handle without trouble [4]?
Any challenge/addition to the pro/cons list described above would be very welcome too.
Best regards,
Ion
[1]: Like https://www.psycopg.org/docs/sql.html
[2]: https://link.springer.com/content/pdf/bbm%3A978-1-4302-0018-5%2F1.pdf
[3]: https://stackoverflow.com/questions/22395883/postgresql-what-is-the-maximum-number-of-tables-can-store-in-postgresql-databas
[4]: We use RDS instances in AWS