Thread: Help in dealing with OOM

Help in dealing with OOM

From
Siraj G
Date:
Hello Experts!

My secondary instance has been unstable today. The service is crashing with Out of Memory. Please see below error (/var/log/postgresql/postgresql-2024-10-14.log):

10.2.52.50,2024-10-14 15:28:11 IST,686671,finance_revamp,finance_b2b,1,LOG:  duration: 1148.527 ms  statement: SELECT "tripschedule"."id", "tripschedule"."name", "tripschedule"."branch_id", "tripschedule"."route_id", "tripschedule"."route_name", "tripschedule"."bus_id", "tripschedule"."path", "tripschedule"."path_timings", "tripschedule"."recurring_days", "tripschedule"."start_time", "tripschedule"."end_time", "tripschedule"."start_date", "tripschedule"."end_date", "tripschedule"."created_at", "tripschedule"."created_by", "tripschedule"."shift", "tripschedule"."is_deleted", "tripschedule"."is_active", "tripschedule"."is_cancelled", "tripschedule"."branch_latitude", "tripschedule"."branch_longitude", "tripschedule"."polygon_id", "tripschedule"."is_after_noon_shift" FROM "tripschedule" INNER JOIN "bus" ON ("tripschedule"."bus_id" = "bus"."id") WHERE ("bus"."vehicle_no" = 'KA51AH1922' AND "tripschedule"."end_date" >= '2024-10-14'::date AND "tripschedule"."is_active" AND "tripschedule"."recurring_days" && ARRAY[1]::integer[] AND "tripschedule"."start_date" <= '2024-10-14'::date AND ("tripschedule"."start_time" BETWEEN '14:57:57.654167'::time AND '15:57:57.654167'::time OR "tripschedule"."end_time" BETWEEN '14:57:57.654167'::time AND '15:57:57.654167'::time OR ("tripschedule"."start_time" <= '15:27:57.654167'::time AND "tripschedule"."end_time" >= '15:27:57.654167'::time))) ORDER BY "tripschedule"."id" DESC LIMIT 1
10.2.52.22,2024-10-14 15:28:11 IST,686748,orchids_letseduvate_db,autoscaling,1,LOG:  duration: 468.028 ms  statement: SELECT (1) AS "a" FROM "test" INNER JOIN "test_section_mapping" ON ("test"."id" = "test_section_mapping"."test_id") INNER JOIN "test_subjects" ON ("test"."id" = "test_subjects"."test_id") INNER JOIN "user_response" ON ("test"."id" = "user_response"."test_id") WHERE (("test"."test_date")::date >= '2024-10-14'::date AND ("test"."test_date")::date <= '2024-10-17'::date AND NOT "test"."is_delete" AND "test_section_mapping"."sectionmapping_id" IN (136364) AND "test_subjects"."subject_id" = 16 AND NOT "user_response"."is_delete" AND "user_response"."submitted_by_id" = 61725) LIMIT 1
,2024-10-14 15:28:11 IST,2334064,,,8,LOG:  checkpointer process (PID 2334587) was terminated by signal 9: Killed
,2024-10-14 15:28:11 IST,2334064,,,9,LOG:  terminating any other active server processes
10.2.52.50,2024-10-14 15:28:11 IST,686752,mcollege_letseduvate_db,finance_b2b,1,WARNING:  terminating connection because

This is from the OS log (/var/log/kern.log):

oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.sli      ce/system-postgresql.slice/postgresql@12-main.service,task=postgres,pid=2334587,uid=114
  494 Oct 14 09:58:10 gce-k12-prod-as1-erp-pg-secondary kernel: [6905020.514569] Out of memory: Killed process 2334587 (postgres) total-vm:26349584kB, anon-rss:3464kB, file-rss:0kB, shmem-rs      s:21813032kB, UID:114 pgtables:49024kB oom_score_adj:0

Regards
Siraj 

Re: Help in dealing with OOM

From
Joe Conway
Date:
On 10/14/24 14:37, Siraj G wrote:
> This is from the OS log (/var/log/kern.log):
> 
> oom- 
> kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.sli    
 ce/system-postgresql.slice/postgresql@12-main.service,task=postgres,pid=2334587,uid=114
>    494 Oct 14 09:58:10 gce-k12-prod-as1-erp-pg-secondary kernel: 
> [6905020.514569] Out of memory: Killed process 2334587 (postgres) total- 
> vm:26349584kB, anon-rss:3464kB, file-rss:0kB, shmem-rs     
>   s:21813032kB, UID:114 pgtables:49024kB oom_score_adj:0


1. Do you happen to have swap disabled? If so, don't do that.

2. Does the postgres cgroup have memory.limit (cgroup v1) or memory.max
    (cgroup v2) set?

3. If #2 answer is no, have you followed the documented guidance here
    (in particular vm.overcommit_memory=2):

  
https://www.postgresql.org/docs/12/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT


-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Help in dealing with OOM

From
Siraj G
Date:
Thanks Joe, I will set these kernel parameters.

I also would like to highlight that the issue happened on SECONDARY. While the PRIMARY has less memory and computation in comparison to SECONDARY, not sure if there is anything wrong in the PgSQL.

PRIMARY: 48vCPUs & 48GB memory
SECONDARY: 64vCPUs & 64GB memory

I noticed a few things which do not sound tidy:
1. Total number of DBs are: 1860  (DB environment serves a product that has tenants - around 1100 tenants which means these many DBs are active)
     : Is there any metric for optimal performance on the number of DBs we should have per instance? I would assume NO (and it should be purely based on the overall operations), but just a question out of curiosity.
2. max_connections is set to 10000. 
I tried to reduce it to 4000 but was unable to do so (I tried this after reducing the max_connections in PRIMARY to 4000). This is the error:
FATAL:  hot standby is not possible because max_connections = 4000 is a lower setting than on the master server (its value was 10000)

If I am clubbing multiple things, sorry for the clutter.

Regards
Siraj

On Tue, Oct 15, 2024 at 12:39 AM Joe Conway <mail@joeconway.com> wrote:
On 10/14/24 14:37, Siraj G wrote:
> This is from the OS log (/var/log/kern.log):
>
> oom-
> kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.sli      ce/system-postgresql.slice/postgresql@12-main.service,task=postgres,pid=2334587,uid=114
>    494 Oct 14 09:58:10 gce-k12-prod-as1-erp-pg-secondary kernel:
> [6905020.514569] Out of memory: Killed process 2334587 (postgres) total-
> vm:26349584kB, anon-rss:3464kB, file-rss:0kB, shmem-rs     
>   s:21813032kB, UID:114 pgtables:49024kB oom_score_adj:0


1. Do you happen to have swap disabled? If so, don't do that.

2. Does the postgres cgroup have memory.limit (cgroup v1) or memory.max
    (cgroup v2) set?

3. If #2 answer is no, have you followed the documented guidance here
    (in particular vm.overcommit_memory=2):


https://www.postgresql.org/docs/12/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com