Re: Transparent column encryption - Mailing list pgsql-hackers
From | Peter Eisentraut |
---|---|
Subject | Re: Transparent column encryption |
Date | |
Msg-id | 48a9f2c2-4a57-27d8-7c53-16a23a01014e@enterprisedb.com Whole thread Raw |
In response to | Transparent column encryption (Peter Eisentraut <peter.eisentraut@enterprisedb.com>) |
Responses |
Re: Transparent column encryption
|
List | pgsql-hackers |
Here is a new version of this patch. See also the original description quoted below. I have done a significant amount of work on this over the last few months. Some important news include: - The cryptography has been improved. It now uses an AEAD scheme, and for deterministic encryption a proper SIV construction. - The OpenSSL-specific parts have been moved to a separate file in libpq. Non-OpenSSL builds compile and work (without functionality, of course). - libpq handles multiple CEKs and CMKs, including changing keys on the fly. - libpq supports a mode to force encryption of certain values. - libpq supports a flexible configuration system for looking up CMKs, including support for external key management systems. - psql has a new \gencr command that allows passing in bind parameters for (potential) encryption. - There is some more pg_dump and psql support. - The new data types for storing encrypted data have been renamed for clarity. - Various changes to the protocol compared to the previous patch. - The patch contains full documentation of the protocol changes, glossary entries, and more new documentation. The major pieces that are still missing are: - DDL support for registering keys - Protocol versioning or feature flags Other than that it's pretty complete in my mind. For interested reviewers, I have organized the patch so that you can start reading it top to bottom: The documentation comes first, then the tests, then the code changes. Even some feedback on the first or first two aspects would be valuable to me. Old news follows: On 03.12.21 22:32, Peter Eisentraut wrote: > I want to present my proof-of-concept patch for the transparent column > encryption feature. (Some might also think of it as automatic > client-side encryption or similar, but I like my name.) This feature > enables the {automatic,transparent} encryption and decryption of > particular columns in the client. The data for those columns then > only ever appears in ciphertext on the server, so it is protected from > the "prying eyes" of DBAs, sysadmins, cloud operators, etc. The > canonical use case for this feature is storing credit card numbers > encrypted, in accordance with PCI DSS, as well as similar situations > involving social security numbers etc. Of course, you can't do any > computations with encrypted values on the server, but for these use > cases, that is not necessary. This feature does support deterministic > encryption as an alternative to the default randomized encryption, so > in that mode you can do equality lookups, at the cost of some > security. > > This functionality also exists in other SQL database products, so the > overall concepts weren't invented by me by any means. > > Also, this feature has nothing to do with the on-disk encryption > feature being contemplated in parallel. Both can exist independently. > > The attached patch has all the necessary pieces in place to make this > work, so you can have an idea how the overall system works. It > contains some documentation and tests to help illustrate the > functionality. But it's missing the remaining 90% of the work, > including additional DDL support, error handling, robust memory > management, protocol versioning, forward and backward compatibility, > pg_dump support, psql \d support, refinement of the cryptography, and > so on. But I think obvious solutions exist to all of those things, so > it isn't that interesting to focus on them for now. > > ------ > > Now to the explanation of how it works. > > You declare a column as encrypted in a CREATE TABLE statement. The > column value is encrypted by a symmetric key called the column > encryption key (CEK). The CEK is a catalog object. The CEK key > material is in turn encrypted by an assymmetric key called the column > master key (CMK). The CMK is not stored in the database but somewhere > where the client can get to it, for example in a file or in a key > management system. When a server sends rows containing encrypted > column values to the client, it first sends the required CMK and CEK > information (new protocol messages), which the client needs to record. > Then, the client can use this information to automatically decrypt the > incoming row data and forward it in plaintext to the application. > > For the CMKs, the catalog object specifies a "provider" and generic > options. Right now, libpq has a "file" provider hardcoded, and it > takes a "filename" option. Via some mechanism to be determined, > additional providers could be loaded and then talk to key management > systems via http or whatever. I have left some comments in the libpq > code where the hook points for this could be. > > The general idea would be for an application to have one CMK per area > of secret stuff, for example, for credit card data. The CMK can be > rotated: each CEK can be represented multiple times in the database, > encrypted by a different CMK. (The CEK can't be rotated easily, since > that would require reading out all the data from a table/column and > reencrypting it. We could/should add some custom tooling for that, > but it wouldn't be a routine operation.) > > The encryption algorithms are mostly hardcoded right now, but there > are facilities for picking algorithms and adding new ones that will be > expanded. The CMK process uses RSA-OAEP. The CEK process uses > AES-128-CBC right now; a more complete solution should probably > involve some HMAC thrown in. > > In the server, the encrypted datums are stored in types called > encryptedr and encryptedd (for randomized and deterministic > encryption). These are essentially cousins of bytea. For the rest of > the database system below the protocol handling, there is nothing > special about those. For example, encryptedr has no operators at all, > encryptedd has only an equality operator. pg_attribute has a new > column attrealtypid that stores the original type of the data in the > column. This is only used for providing it to clients, so that > higher-level clients can convert the decrypted value to their > appropriate data types in their environments. > > Some protocol extensions are required. These should be guarded by > some _pq_... setting, but this is not done in this patch yet. As > mentioned above, extra messages are added for sending the CMKs and > CEKs. In the RowDescription message, I have commandeered the format > field to add a bit that indicates that the field is encrypted. This > could be made a separate field, and there should probably be > additional fields to indicate the algorithm and CEK name, but this was > easiest for now. The ParameterDescription message is extended to > contain format fields for each parameter, for the same purpose. > Again, this could be done differently. > > Speaking of parameter descriptions, the trickiest part of this whole > thing appears to be how to get transparently encrypted data into the > database (as opposed to reading it out). It is required to use > protocol-level prepared statements (i.e., extended query) for this. > The client must first prepare a statement, then describe the statement > to get parameter metadata, which indicates which parameters are to be > encrypted and how. So this will require some care by applications > that want to do this, but, well, they probably should be careful > anyway. In libpq, the existing APIs make this difficult, because > there is no way to pass the result of a describe-statement call back > into execute-statement-with-parameters. I added new functions that do > this, so you then essentially do > > res0 = PQdescribePrepared(conn, ""); > res = PQexecPrepared2(conn, "", 2, values, NULL, NULL, 0, res0); > > (The name could obviously be improved.) Other client APIs that have a > "statement handle" concept could do this more elegantly and probably > without any API changes. > > Another challenge is that the parse analysis must check which > underlying column a parameter corresponds to. This is similar to > resorigtbl and resorigcol in the opposite direction. The current > implementation of this works for the test cases, but I know it has > some problems, so I'll continue working in this. This functionality > is in principle available to all prepared-statement variants, not only > protocol-level. So you can see in the tests that I expanded the > pg_prepared_statements view to show this information as well, which > also provides an easy way to test and debug this functionality > independent of column encryption. > > And also, psql doesn't use prepared statements, so writing into > encrypted columns currently doesn't work at all via psql. (Reading > works no problem.) All the test code currently uses custom libpq C > programs. We should think about a way to enable prepared statements > in psql, perhaps something like > > INSERT INTO t1 VALUES ($1, $2) \gg 'val1' 'val2' > > (\gexec and \gx are already taken.) > > ------ > > This is not targeting PostgreSQL 15. But I'd appreciate some feedback > on the direction. As I mentioned above, a lot of the remaining work > is arguably mostly straightforward. Some closer examination of the > issues surrounding the libpq API changes and psql would be useful. > Perhaps there are other projects where that kind of functionality > would also be useful.
Attachment
pgsql-hackers by date: