Thread: CLOSE_WAIT pileup and Application Timeout

CLOSE_WAIT pileup and Application Timeout

From

KK CHN

Date:

04 October 2024, 07:29:33

List,

I am facing a network (TCP IP connection closing issue) .

Running a mobile tablet application, Android application to update the status of vehicles fleet say around 1000 numbers installed with the app on each vehicle along with a vehicle tracking application server solution based on Java and Wildfly with PosrgreSQL16 backend.

The mobile tablets are installed with the android based vehicle tracking app which updated every 30 seconds its location fitted inside the vehicle ( lat long coordinates) to the PostgreSQL DB through the java backend application to know the latest location of the vehicle and its movement which will be rendered in a map based front end.

The vehicles on the field communicate via 443 to 8080 of the Wildfly (version 27 ) deployed with the vehicle tracking application developed with Java(version 17).

The mobile tablet communicates to the backend application over mobile data (4G/5G SIMS).

The running vehicles may disconnect or be unable to send the location data in between if the mobile data coverage is less or absent in a particular area where data coverage is nil or signal strength less.

The server on which the backend application runs most often ( a week's time or so) shows connection timeout and is unable to serve tracking of the vehicles further.

When we restart the Wildfly server the application returns to normal. again the issue repeats after a week or two.

In the Server machine when this bottleneck occurs I am seeing a lot of TCP/IP CLOSE_WAIT ( 3000 to 5000 ) when the server backend becomes unresponsive.

What is the root cause of this issue ? Is it due to the android application unable to send the CLOSE_WAIT ACK due to poor mobile data connectivity ?

If so, how do people address this issue ? and what may be a fix ?

Any directions / or reference material most welcome.

Thank you,

Krishane

Re: CLOSE_WAIT pileup and Application Timeout

From

Francesco Benetton

Date:

04 October 2024, 14:41:16

If I understand clearly, postgresql is used as a Data server for the backend, and so the Android app does not connect directly to postgresql.
The first idea is a problem on closing or recycling the connection by the backend after executing the request. Maybe wrong client connection pooling settings?

Il ven 4 ott 2024, 06:29 KK CHN <kkchn.in@gmail.com> ha scritto:

List,

I am facing a network (TCP IP connection closing issue) .

Running a mobile tablet application, Android application to update the status of vehicles fleet say around 1000 numbers installed with the app on each vehicle along with a vehicle tracking application server solution based on Java and Wildfly with PosrgreSQL16 backend.

The mobile tablets are installed with the android based vehicle tracking app which updated every 30 seconds its location fitted inside the vehicle ( lat long coordinates) to the PostgreSQL DB through the java backend application to know the latest location of the vehicle and its movement which will be rendered in a map based front end.

The vehicles on the field communicate via 443 to 8080 of the Wildfly (version 27 ) deployed with the vehicle tracking application developed with Java(version 17).

The mobile tablet communicates to the backend application over mobile data (4G/5G SIMS).

The running vehicles may disconnect or be unable to send the location data in between if the mobile data coverage is less or absent in a particular area where data coverage is nil or signal strength less.

The server on which the backend application runs most often ( a week's time or so) shows connection timeout and is unable to serve tracking of the vehicles further.

When we restart the Wildfly server the application returns to normal. again the issue repeats after a week or two.

In the Server machine when this bottleneck occurs I am seeing a lot of TCP/IP CLOSE_WAIT ( 3000 to 5000 ) when the server backend becomes unresponsive.

What is the root cause of this issue ? Is it due to the android application unable to send the CLOSE_WAIT ACK due to poor mobile data connectivity ?

If so, how do people address this issue ? and what may be a fix ?

Any directions / or reference material most welcome.

Thank you,
Krishane

Re: CLOSE_WAIT pileup and Application Timeout

From

Adrian Klaver

Date:

04 October 2024, 18:47:11

On 10/3/24 21:29, KK CHN wrote:
> List,
> 
> I am facing a  network (TCP IP connection closing issue) .
> 
> Running a  mobile tablet application, Android application to update the 
> status of vehicles fleet say around 1000 numbers installed with the app 
> on each vehicle along  with a  vehicle tracking  application server 
> solution based on Java and Wildfly with  PosrgreSQL16 backend.
> 

> 
> The  running vehicles may disconnect  or be unable to send the location 
> data in between if the mobile data coverage is less or absent in a 
> particular area where data coverage is nil or signal strength less.
> 
> The server on which the backend application runs most often ( a week's 
> time  or so) shows connection timeout and is unable to serve tracking  
> of  the vehicles further.
> 
> When we restart the  Wildfly server  the application returns to normal.  
> again the issue repeats  after a week or two.

Seems the issue is in the application server. What is not clear to me is 
whether the connection timeout you refer to is from the mobile devices 
to the application or the application to the Postgres server? I'm 
guessing the latter as I would expect the mobile devices to drop 
connections more often then weekly.
> 
> In the Server machine when this bottleneck occurs  I am seeing  a lot 
> of  TCP/IP CLOSE_WAIT   ( 3000 to 5000 ) when the server backend becomes 
> unresponsive.

Again not clear, are you referring to the application or the Postgres 
database running on the server?

> 
> What is the root cause of this issue ?   Is it due to the android 
> application unable to send the CLOSE_WAIT ACK due to poor mobile data 
> connectivity ?
> 
> 
>   If so, how do people  address this issue ?  and what may be a fix ?
> 
>   Any  directions / or reference material most welcome.
> 
> Thank you,
> Krishane
> 
> 
> 
> 
> 

-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: CLOSE_WAIT pileup and Application Timeout

From

KK CHN

Date:

06 October 2024, 16:26:11

On Fri, Oct 4, 2024 at 9:17 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:

On 10/3/24 21:29, KK CHN wrote:
> List,
>
> I am facing a network (TCP IP connection closing issue) .
>
> Running a mobile tablet application, Android application to update the
> status of vehicles fleet say around 1000 numbers installed with the app
> on each vehicle along with a vehicle tracking application server
> solution based on Java and Wildfly with PosrgreSQL16 backend.
>

>
> The running vehicles may disconnect or be unable to send the location
> data in between if the mobile data coverage is less or absent in a
> particular area where data coverage is nil or signal strength less.
>
> The server on which the backend application runs most often ( a week's
> time or so) shows connection timeout and is unable to serve tracking
> of the vehicles further.
>
> When we restart the Wildfly server the application returns to normal.
> again the issue repeats after a week or two.

Seems the issue is in the application server. What is not clear to me is
whether the connection timeout you refer to is from the mobile devices
to the application or the application to the Postgres server?

its from mobile devices to application server. When I do a restart of application server everything backs to normal. But after a period of time again it cripples. That time when I netstat on Application VM lots of CLOSE_WAIT states as indicated.

I'm
guessing the latter as I would expect the mobile devices to drop
connections more often then weekly.

Yes mobile devices may drops connections at any point of time if it reaches an area where signal strength is poor( eg; Underground parking or near the areas where mobile data coverage is poor.
>

The topology is mobile devices connect and update the location via application VM then finally in PGSQL VM.

The application server and Database server both separate virtual machines. Application server hangs most often not the database VM. Since there are other applications which update to the database VM without any issue. The DB VM caters all the writes from other applications. But those applications are different, not fleet management one.

> In the Server machine when this bottleneck occurs I am seeing a lot
> of TCP/IP CLOSE_WAIT ( 3000 to 5000 ) when the server backend becomes
> unresponsive.

Again not clear, are you referring to the application or the Postgres
database running on the server?

>
> What is the root cause of this issue ? Is it due to the android
> application unable to send the CLOSE_WAIT ACK due to poor mobile data
> connectivity ?
>
>
> If so, how do people address this issue ? and what may be a fix ?
>
> Any directions / or reference material most welcome.
>
> Thank you,
> Krishane
>
>
>
>
>

--
Adrian Klaver
adrian.klaver@aklaver.com

Re: CLOSE_WAIT pileup and Application Timeout

From

Adrian Klaver

Date:

06 October 2024, 18:52:46

On 10/6/24 06:26, KK CHN wrote:
> 
> 
> On Fri, Oct 4, 2024 at 9:17 PM Adrian Klaver <adrian.klaver@aklaver.com 

>     Seems the issue is in the application server. What is not clear to
>     me is
>     whether the connection timeout you refer to is from the mobile devices
>     to the application or the application to the Postgres server?
> 
> its from mobile devices to application server.  When I do a restart of 
> application server everything backs to normal.  But after a period of 
> time again it cripples.  That time when I netstat on Application VM lots 
> of  CLOSE_WAIT states as indicated.
> 
>     I'm
>     guessing the latter as I would expect the mobile devices to drop
>     connections more often then weekly. 
> 
>     Yes mobile devices may drops connections at any point of time if it
>     reaches an area where signal strength is poor( eg; Underground
>     parking or near the areas where mobile data coverage is poor.
>      >
> 
> 
> The topology is mobile devices  connect and update the location via 
> application VM then   finally in  PGSQL VM.
> 
> The application server and  Database server both separate virtual 
> machines.      Application server hangs most often not the database VM. 
> Since there are other applications which update to the database VM 
> without any issue.  The DB VM caters all the writes from other 
> applications. But those applications are different, not fleet management 
> one.

 From what I see this really has nothing to do with the Postgres 
backend. It is a matter of communication, actually lack of 
communication, between the mobile devices and the application server. A 
broad answer is that something needs to be done to gracefully deal with 
mobile device connection drops


-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: CLOSE_WAIT pileup and Application Timeout

From

Alvaro Herrera

Date:

06 October 2024, 21:37:18

On 2024-Oct-04, KK CHN wrote:

> The mobile tablets are installed with the android based vehicle
> tracking app which updated every 30 seconds its location fitted inside the
> vehicle ( lat long coordinates) to the PostgreSQL DB through the java
> backend application to know the latest location of the vehicle and its
> movement which will be rendered in a map based front end.
> 
> The vehicles on the field communicate  via 443 to   8080 of the Wildfly
> (version 27 ) deployed with the vehicle tracking application developed with
> Java(version 17).

It sounds like setting TCP keepalives in the connections between the
Wildfly and the vehicles might help get the number of dead connections
down to a reasonable level.  Then it's up to Wildfly to close the
connections to Postgres in a timely fashion.  (It's not clear from your
description how do vehicle connections to Wildfly relate to Postgres
connections.)

I wonder if the connections from Wildfly to Postgres use SSL?  Because
there are reported cases where TCP connections are kept and accumulate,
causing problems -- but apparently SSL is a necessary piece for that to
happen.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Thou shalt study thy libraries and strive not to reinvent them without
cause, that thy code may be short and readable and thy days pleasant
and productive. (7th Commandment for C Programmers)

Re: CLOSE_WAIT pileup and Application Timeout

From

KK CHN

Date:

07 October 2024, 14:30:47

On Mon, Oct 7, 2024 at 12:07 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2024-Oct-04, KK CHN wrote:

> The mobile tablets are installed with the android based vehicle
> tracking app which updated every 30 seconds its location fitted inside the
> vehicle ( lat long coordinates) to the PostgreSQL DB through the java
> backend application to know the latest location of the vehicle and its
> movement which will be rendered in a map based front end.
>
> The vehicles on the field communicate via 443 to 8080 of the Wildfly
> (version 27 ) deployed with the vehicle tracking application developed with
> Java(version 17).

It sounds like setting TCP keepalives in the connections between the
Wildfly and the vehicles might help get the number of dead connections
down to a reasonable level. Then it's up to Wildfly to close the
connections to Postgres in a timely fashion. (It's not clear from your
description how do vehicle connections to Wildfly relate to Postgres
connections.)

Where do I have to introduce the TCP keepalives ? in the OS level or application code level ?

[root@dbch wildfly-27.0.0.Final]# cat /proc/sys/net/ipv4/tcp_keepalive_time
7200
[root@dbch wildfly-27.0.0.Final]# cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75
[root@dbch wildfly-27.0.0.Final]# cat /proc/sys/net/ipv4/tcp_keepalive_probes
9
[root@dbch wildfly-27.0.0.Final]#

These are the default values in the OS level. Do I need to reduce all the above three values to say 600, 20, 5 ? Or need to be handled in the application backend code ?

Any hints much appreciated..

I wonder if the connections from Wildfly to Postgres use SSL? Because
there are reported cases where TCP connections are kept and accumulate,
causing problems -- but apparently SSL is a necessary piece for that to
happen.

No SSL in between Wildfly (8080 ) to PGSQL(5432). Both the machines internal lan VMs in the same network. Only the devices on the field (fitted on the vehicles) communicate to the application backend via a public URL :443 port then it connectes to the 8080 of wildfly then the java code connects the database server running on 5432 on the internal LAN network.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Thou shalt study thy libraries and strive not to reinvent them without
cause, that thy code may be short and readable and thy days pleasant
and productive. (7th Commandment for C Programmers)

Re: CLOSE_WAIT pileup and Application Timeout

From

Alvaro Herrera

Date:

07 October 2024, 17:31:38

On 2024-Oct-07, KK CHN wrote:

> On Mon, Oct 7, 2024 at 12:07 AM Alvaro Herrera <alvherre@alvh.no-ip.org>
> wrote:

> Where do I have to introduce the TCP keepalives ? in the OS level or
> application code level ?
> 
> [root@dbch wildfly-27.0.0.Final]# cat /proc/sys/net/ipv4/tcp_keepalive_time
> 7200
> [root@dbch wildfly-27.0.0.Final]# cat /proc/sys/net/ipv4/tcp_keepalive_intvl
> 75
> [root@dbch wildfly-27.0.0.Final]# cat
> /proc/sys/net/ipv4/tcp_keepalive_probes
> 9
> [root@dbch wildfly-27.0.0.Final]#
> 
> These are the default values in the OS level.   Do I need to reduce all the
> above three values to  say 600, 20, 5  ?   Or need to be handled in the
> application backend code ?

My understanding is that these values have no effect unless the socket
gets
  setsockopt( ... , SO_KEEPALIVE, ...)

So that's definitely something that the app needs to do -- it's not
enabled automatically.

With these default settings, the connection would be closed about 2:11
after going quiet, so if your problem manifests only a week later, you
would have enough time for these to be cleaned up.  But of course you
should monitor what happens.


> > I wonder if the connections from Wildfly to Postgres use SSL?  Because
> > there are reported cases where TCP connections are kept and accumulate,
> > causing problems -- but apparently SSL is a necessary piece for that to
> > happen.
> >
> No SSL in between   Wildfly (8080 ) to    PGSQL(5432).

Okay, that's unlikely to be relevant then.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Linux transformó mi computadora, de una `máquina para hacer cosas',
en un aparato realmente entretenido, sobre el cual cada día aprendo
algo nuevo" (Jaime Salinas)