RFC: seccomp-bpf support - Mailing list pgsql-hackers
From | Joe Conway |
---|---|
Subject | RFC: seccomp-bpf support |
Date | |
Msg-id | bc032e95-7e8b-ed00-8d87-ed9db449bdd6@joeconway.com Whole thread Raw |
Responses |
Re: RFC: seccomp-bpf support
Re: RFC: seccomp-bpf support Re: RFC: seccomp-bpf support |
List | pgsql-hackers |
SECCOMP ("SECure COMPuting with filters") is a Linux kernel syscall filtering mechanism which allows reduction of the kernel attack surface by preventing (or at least audit logging) normally unused syscalls. Quoting from this link: https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt "A large number of system calls are exposed to every userland process with many of them going unused for the entire lifetime of the process. As system calls change and mature, bugs are found and eradicated. A certain subset of userland applications benefit by having a reduced set of available system calls. The resulting set reduces the total kernel surface exposed to the application. System call filtering is meant for use with those applications." Recent security best-practices recommend, and certain highly security-conscious organizations are beginning to require, that SECCOMP be used to the extent possible. The major web browsers, container runtime engines, and systemd are all examples of software that already support seccomp. --------- A seccomp (bpf) filter is comprised of a default action, and a set of rules with actions pertaining to specific syscalls (possibly with even more specific sets of arguments). Once loaded into the kernel, a filter is inherited by all child processes and cannot be removed. It can, however, be overlaid with another filter. For any given syscall match, the most restrictive (a.k.a. highest precedence) action will be taken by the kernel. PostgreSQL has already been run "in the wild" under seccomp control in containers, and possibly systemd. Adding seccomp support into PostgreSQL itself mitigates issues with these approaches, and has several advantages: * Container seccomp filters tend to be extremely broad/permissive, typically allowing about 6 out 7 of all syscalls. They must do this because the use cases for containers vary widely. * systemd does not implement seccomp filters by default. Packagers may decide to do so, but there is no guarantee. Adding them post install potentially requires cooperation by groups outside control of the database admins. * In the container and systemd case there is no particularly good way to inspect what filters are active. It is possible to observe actions taken, but again, control is possibly outside the database admin group. For example, the best way to understand what happened is to review the auditd log, which is likely not readable by the DBA. * With built-in support, it is possible to lock down backend processes more tightly than the postmaster. * With built-in support, it is possible to lock down different backend processes differently than each other, for example by using ALTER ROLE ... SET or ALTER DATABASE ... SET. * With built-in support, it is possible to calculate and return (in the form of an SRF) the effective filters being applied to the postmaster and the current backend. * With built-in support, it could be possible (this part not yet implemented) to have separate filters for different backend types, e.g. autovac workers, background writer, etc. --------- Attached is a patch for discussion, adding support for seccomp-bpf (nowadays generally just called seccomp) syscall filtering at configure-time using libseccomp. I would like to get this in shape to be committed by the end of the November CF if possible. The code itself has been through several rounds of revision based on discussions I have had with the author of libseccomp as well as a few other folks. However as of the moment: * Documentation - general discussion missing entirely * No regression tests --------- For convenience, here are a couple of additional links to relevant information regarding seccomp: https://en.wikipedia.org/wiki/Seccomp https://github.com/seccomp/libseccomp --------- Specific feedback requested: 1. Placement of pg_get_seccomp_filter() in src/backend/utils/adt/genfile.c originally made sense but after several rewrites no longer does. Ideas where it *should* go? 2. Where should a general discussion section go in the docs, if at all? 3. Currently this supports a global filter at the postmaster level, which is inherited by all child processes, and a secondary filter at the client backend session level. It likely makes sense to support secondary filters for other types of child processes, e.g. autovacuum workers, etc. Add that now (pg13), later release, or never? 4. What is the best way to approach testing of this feature? Tap testing perhaps? 5. Default GUC values - should we provide "starter" lists, or only a procedure for generating a list (as below). --------- Notes on usage: =============== In order to determine your minimally required allow lists, do something like the following on a non-production server with the same architecture as production: 0. Setup: * install libseccomp, libseccomp-dev, and seccomp * install auditd if not already installed * configure postgres --with-seccomp and maybe --enable-tap-tests to improve feature coverage (see below) 1. Modify postgresql.conf and/or create <pg_source_dir>/postgresql_tmp.conf 8<-------------------- seccomp = on global_syscall_default = allow global_syscall_allow = '' global_syscall_log = '' global_syscall_error = '' global_syscall_kill = '' session_syscall_default = log session_syscall_allow = '*' session_syscall_log = '*' session_syscall_error = '*' session_syscall_kill = '*' 8<-------------------- 2. Modify /etc/audit/auditd.conf * disp_qos = 'lossless' * change max_log_file_action = 'ignore' 3. Stop auditd, clear out all audit.logs, start auditd: * systemctl stop auditd.service # if running * echo -n "" > /var/log/audit/audit.log * systemctl start auditd.service 4. Start/restart postgres. 5. Exercise postgres as much as possible (one or more of the following): * make installcheck-world * make check world \ EXTRA_REGRESS_OPTS=--temp-config=<pg_source_dir>/postgresql_tmp.conf * run your application through its paces * other random testing of relevant postgres features Note: at this point audit.log will start growing quickly. During `make check world` mine grew to just under 1 GB. 6. Process results: a) systemctl stop auditd.service b) Run the provided "get_syscalls.sh" script c) Cut and paste the result as the value of session_syscall_allow. 7. Optional: a) global_syscall_default = 'log' b) Repeat steps 3-5 c) Repeat step 6a and 6b d) Cut and paste the result as the value of global_syscall_allow 8. Iterate steps 3-6b. * Output should be empty. * If there are any new syscalls, add to global_syscall_allow and session_syscall_allow. * Iterate until output of "get_syscalls.sh" script is empty. 9. Optional: * Change global and session defaults to "error" or "kill" * Reduce the allow lists if desired * This can be done for specific database users, by doing ALTER ROLE... SET session_syscall_allow to '<some reduced allow list>' 10. Adjust settings to taste, restart postgres, and monitor audit.log going forward. Below are some values from my system. Note that I have made no attempt thus far to do static code analysis -- this list was build using `make check world` several times. 8<------------------------- seccomp = on global_syscall_default = log global_syscall_allow = 'accept,access,bind,brk,chmod,clone,close,connect,dup,epoll_create1,epoll_ctl,epoll_wait,exit_group,fadvise64,fallocate,fcntl,fdatasync,fstat,fsync,ftruncate,futex,getdents,getegid,geteuid,getgid,getpeername,getpid,getppid,getrandom,getrusage,getsockname,getsockopt,getuid,ioctl,kill,link,listen,lseek,lstat,mkdir,mmap,mprotect,mremap,munmap,openat,pipe,poll,prctl,pread64,prlimit64,pwrite64,read,readlink,recvfrom,recvmsg,rename,rmdir,rt_sigaction,rt_sigprocmask,rt_sigreturn,seccomp,select,sendto,setitimer,set_robust_list,setsid,setsockopt,shmat,shmctl,shmdt,shmget,shutdown,socket,stat,statfs,symlink,sync_file_range,sysinfo,umask,uname,unlink,utime,wait4,write' global_syscall_log = '' global_syscall_error = '' global_syscall_kill = '' session_syscall_default = log session_syscall_allow = 'access,brk,chmod,close,connect,epoll_create1,epoll_ctl,epoll_wait,exit_group,fadvise64,fallocate,fcntl,fdatasync,fstat,fsync,ftruncate,futex,getdents,getegid,geteuid,getgid,getpeername,getpid,getrandom,getrusage,getsockname,getsockopt,getuid,ioctl,kill,link,lseek,lstat,mkdir,mmap,mprotect,mremap,munmap,openat,poll,pread64,pwrite64,read,readlink,recvfrom,recvmsg,rename,rmdir,rt_sigaction,rt_sigprocmask,rt_sigreturn,select,sendto,setitimer,setsockopt,shutdown,socket,stat,symlink,sync_file_range,sysinfo,umask,uname,unlink,utime,write' session_syscall_log = '*' session_syscall_error = '*' session_syscall_kill = '*' 8<------------------------- That results in the following effective filters at the ("context" equals) global and session levels: 8<------------------------- select * from pg_get_seccomp_filter() order by 4,1; syscall | syscallnum | filter_action | context -----------------+------------+----------------+--------- accept | 43 | global->allow | global access | 21 | global->allow | global bind | 49 | global->allow | global brk | 12 | global->allow | global chmod | 90 | global->allow | global clone | 56 | global->allow | global close | 3 | global->allow | global connect | 42 | global->allow | global <default> | -1 | global->log | global dup | 32 | global->allow | global epoll_create1 | 291 | global->allow | global epoll_ctl | 233 | global->allow | global epoll_wait | 232 | global->allow | global exit_group | 231 | global->allow | global fadvise64 | 221 | global->allow | global fallocate | 285 | global->allow | global fcntl | 72 | global->allow | global fdatasync | 75 | global->allow | global fstat | 5 | global->allow | global fsync | 74 | global->allow | global ftruncate | 77 | global->allow | global futex | 202 | global->allow | global getdents | 78 | global->allow | global getegid | 108 | global->allow | global geteuid | 107 | global->allow | global getgid | 104 | global->allow | global getpeername | 52 | global->allow | global getpid | 39 | global->allow | global getppid | 110 | global->allow | global getrandom | 318 | global->allow | global getrusage | 98 | global->allow | global getsockname | 51 | global->allow | global getsockopt | 55 | global->allow | global getuid | 102 | global->allow | global ioctl | 16 | global->allow | global kill | 62 | global->allow | global link | 86 | global->allow | global listen | 50 | global->allow | global lseek | 8 | global->allow | global lstat | 6 | global->allow | global mkdir | 83 | global->allow | global mmap | 9 | global->allow | global mprotect | 10 | global->allow | global mremap | 25 | global->allow | global munmap | 11 | global->allow | global openat | 257 | global->allow | global pipe | 22 | global->allow | global poll | 7 | global->allow | global prctl | 157 | global->allow | global pread64 | 17 | global->allow | global prlimit64 | 302 | global->allow | global pwrite64 | 18 | global->allow | global read | 0 | global->allow | global readlink | 89 | global->allow | global recvfrom | 45 | global->allow | global recvmsg | 47 | global->allow | global rename | 82 | global->allow | global rmdir | 84 | global->allow | global rt_sigaction | 13 | global->allow | global rt_sigprocmask | 14 | global->allow | global rt_sigreturn | 15 | global->allow | global seccomp | 317 | global->allow | global select | 23 | global->allow | global sendto | 44 | global->allow | global setitimer | 38 | global->allow | global set_robust_list | 273 | global->allow | global setsid | 112 | global->allow | global setsockopt | 54 | global->allow | global shmat | 30 | global->allow | global shmctl | 31 | global->allow | global shmdt | 67 | global->allow | global shmget | 29 | global->allow | global shutdown | 48 | global->allow | global socket | 41 | global->allow | global stat | 4 | global->allow | global statfs | 137 | global->allow | global symlink | 88 | global->allow | global sync_file_range | 277 | global->allow | global sysinfo | 99 | global->allow | global umask | 95 | global->allow | global uname | 63 | global->allow | global unlink | 87 | global->allow | global utime | 132 | global->allow | global wait4 | 61 | global->allow | global write | 1 | global->allow | global accept | 43 | session->log | session access | 21 | session->allow | session bind | 49 | session->log | session brk | 12 | session->allow | session chmod | 90 | session->allow | session clone | 56 | session->log | session close | 3 | session->allow | session connect | 42 | session->allow | session <default> | -1 | session->log | session dup | 32 | session->log | session epoll_create1 | 291 | session->allow | session epoll_ctl | 233 | session->allow | session epoll_wait | 232 | session->allow | session exit_group | 231 | session->allow | session fadvise64 | 221 | session->allow | session fallocate | 285 | session->allow | session fcntl | 72 | session->allow | session fdatasync | 75 | session->allow | session fstat | 5 | session->allow | session fsync | 74 | session->allow | session ftruncate | 77 | session->allow | session futex | 202 | session->allow | session getdents | 78 | session->allow | session getegid | 108 | session->allow | session geteuid | 107 | session->allow | session getgid | 104 | session->allow | session getpeername | 52 | session->allow | session getpid | 39 | session->allow | session getppid | 110 | session->log | session getrandom | 318 | session->allow | session getrusage | 98 | session->allow | session getsockname | 51 | session->allow | session getsockopt | 55 | session->allow | session getuid | 102 | session->allow | session ioctl | 16 | session->allow | session kill | 62 | session->allow | session link | 86 | session->allow | session listen | 50 | session->log | session lseek | 8 | session->allow | session lstat | 6 | session->allow | session mkdir | 83 | session->allow | session mmap | 9 | session->allow | session mprotect | 10 | session->allow | session mremap | 25 | session->allow | session munmap | 11 | session->allow | session openat | 257 | session->allow | session pipe | 22 | session->log | session poll | 7 | session->allow | session prctl | 157 | session->log | session pread64 | 17 | session->allow | session prlimit64 | 302 | session->log | session pwrite64 | 18 | session->allow | session read | 0 | session->allow | session readlink | 89 | session->allow | session recvfrom | 45 | session->allow | session recvmsg | 47 | session->allow | session rename | 82 | session->allow | session rmdir | 84 | session->allow | session rt_sigaction | 13 | session->allow | session rt_sigprocmask | 14 | session->allow | session rt_sigreturn | 15 | session->allow | session seccomp | 317 | session->log | session select | 23 | session->allow | session sendto | 44 | session->allow | session setitimer | 38 | session->allow | session set_robust_list | 273 | session->log | session setsid | 112 | session->log | session setsockopt | 54 | session->allow | session shmat | 30 | session->log | session shmctl | 31 | session->log | session shmdt | 67 | session->log | session shmget | 29 | session->log | session shutdown | 48 | session->allow | session socket | 41 | session->allow | session stat | 4 | session->allow | session statfs | 137 | session->log | session symlink | 88 | session->allow | session sync_file_range | 277 | session->allow | session sysinfo | 99 | session->allow | session umask | 95 | session->allow | session uname | 63 | session->allow | session unlink | 87 | session->allow | session utime | 132 | session->allow | session wait4 | 61 | session->log | session write | 1 | session->allow | session (170 rows) 8<------------------------- If you made it all the way to here, thank you for your attention :-) Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Attachment
pgsql-hackers by date: