NUMA packaging and patch - Mailing list pgsql-hackers
From | Kevin Grittner |
---|---|
Subject | NUMA packaging and patch |
Date | |
Msg-id | 1402267501.41111.YahooMailNeo@web122304.mail.ne1.yahoo.com Whole thread Raw |
Responses |
Re: NUMA packaging and patch
Re: NUMA packaging and patch |
List | pgsql-hackers |
I ran into a situation where a machine with 4 NUMA memory nodes and 40 cores had performance problems due to NUMA. The problems were worst right after they rebooted the OS and warmed the cache by running a script of queries to read all tables. These were all run on a single connection. As it turned out, the size of the database was just over one-quarter of the size of RAM, and with default NUMA policies both the OS cache for the database and the PostgreSQL shared memory allocation were placed on a single NUMA segment, so access to the CPU package managing that segment became a bottleneck. On top of that, processes which happened to run on the CPU package which had all the cached data had to allocate memory for local use on more distant memory because there was none left in the more local memory. Through normal operations, things eventually tended to shift around and get better (after several hours of heavy use with substandard performance). I ran some benchmarks and found that even in long-running tests, spreading these allocations among the memory segments showed about a 2% benefit in a read-only load. The biggest difference I saw in a long-running read-write load was about a 20% hit for unbalanced allocations, but I only saw that once. I talked to someone at PGCon who managed to engineer much worse performance hits for an unbalanced load, although the circumstances were fairly artificial. Still, fixing this seems like something worth doing if further benchmarks confirm benefits at this level. By default, the OS cache and buffers are allocated in the memory node with the shortest "distance" from the CPU a process is running on. This is determined by a the "cpuset" associated with the process which reads or writes the disk page. Typically a NUMA machine starts with a single cpuset with a policy specifying this behavior. Fixing this aspect of things seems like an issue for packagers, although we should probably document it for those running from their own source builds. To set an alternate policy for PostgreSQL, you first need to find or create the location for cpuset specification, which uses a filesystem in a way similar to the /proc directory. On a machine with more than one memory node, the appropriate filesystem is probably already mounted, although different distributions use different filesystem names and mount locations. I will illustrate the process on my Ubuntu machine. Even though it has only one memory node (and so, this makes no difference), I have it handy at the moment to confirm the commands as I put them into the email. # Sysadmin must create the root cpuset if not already done. (On a # system with NUMA memory, this will probably already be mounted.) # Location and options can vary by distro. sudo sudo mkdir /dev/cpuset sudo mount -t cpuset none /dev/cpuset # Sysadmin must create a cpuset for postgres and configure # resources. This will normally be all cores and all RAM. This is # where we specify that this cpuset will spread pages among its # memory nodes. sudo mkdir /dev/cpuset/postgres sudo /bin/bash -c "echo 0-3 >/dev/cpuset/postgres/cpus" sudo /bin/bash -c "echo 0 >/dev/cpuset/postgres/mems" sudo /bin/bash -c "echo 1 >/dev/cpuset/postgres/memory_spread_page" # Sysadmin must grant permissions to the desired setting(s). # This could be by user or group. sudo chown postgres /dev/cpuset/postgres/tasks # The pid of postmaster or an ancestor process must be written to # the tasks "file" of the cpuset. This can be a shell from which # pg_ctl is run, at least for bash shells. It could also be # written by the postmaster itself, essentially as an extra pid # file. Possible snippet from a service script: echo $$ >/dev/cpuset/postgres/tasks pg_ctl start ... Where the OS cache is larger than shared_buffers, the above is probably more important than the attached patch, which causes the main shared memory segment to be spread among all available memory nodes. This patch only compiles in the relevant code if configure is run using the --with-libnuma option, in which case a dependency on the numa library is created. It is v3 to avoid confusion with earlier versions I have shared with a few people off-list. (The only difference from v2 is fixing bitrot.) I'll add it to the next CF. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: