[omniORB] question on behavior of client on server restart

Discussion:

[omniORB] question on behavior of client on server restart - possible hang?

Igor Lautar

2009-08-05 19:14:11 UTC

Hi All,

We are hunting down a problem with our client SW using omniORB (client and
server are v4.1.1 on windows) for communication.

Our problem could be explained, if omniORB call does not return. We think we
saw something like this in past, but are not sure.

My question is:
Is it possible (under certain circumstances) that client call hangs
indefinitely? We do not have clientCallTimeOutPerid set (its 0).
Would it be possible that client ORB never detects that server is gone (other
than one way calls)?

We suspect something like this may happen when servant is restarted, and call
has already been waiting for some time.

Only reference I've found for this kind of behaviour is comment by Richard
Hirst some time back: http://www.omniorb-support.com/pipermail/omniorb-
list/2007-January/028386.html, to which there were no replies, but it seams
patch was applied to 4.1.1.

Thx for ideas,
Igor

Duncan Grisby

2009-08-11 17:21:53 UTC

Permalink

Post by Igor Lautar
We are hunting down a problem with our client SW using omniORB (client and
server are v4.1.1 on windows) for communication.

Can you update to 4.1.4? That has a number of bug fixes that may be
relevant.

Post by Igor Lautar
Our problem could be explained, if omniORB call does not return. We think we
saw something like this in past, but are not sure.
Is it possible (under certain circumstances) that client call hangs
indefinitely? We do not have clientCallTimeOutPerid set (its 0).
Would it be possible that client ORB never detects that server is gone (other
than one way calls)?

Yes, it can happen. When omniORB makes a call, it sends data on a TCP
connection, then blocks waiting for the response. If the server dies,
TCP can fail to notice, and the OS may leave the recv() call blocked for
ever. There's not much omniORB can do in this situation, other than have
a call timeout.

You could try setting the SO_KEEPALIVE socket option on omniORB's
connections, by adding it to src/lib/omniORB/orbcore/tcp/tcpEndpoint.cc
where it sets other options. That will check the connection if it's been
idle for a long time.

Cheers,

Duncan.

--
-- Duncan Grisby --
-- ***@grisby.org --
-- http://www.grisby.org --

Igor Lautar

2009-08-11 17:33:05 UTC

Permalink

Post by Duncan Grisby

Post by Igor Lautar
We are hunting down a problem with our client SW using omniORB (client
and server are v4.1.1 on windows) for communication.

Can you update to 4.1.4? That has a number of bug fixes that may be
relevant.

Update is not a big problem, will try to reproduce with 4.1.4 (see other mail
on list, had some luck on actually getting to this state).

Post by Duncan Grisby
Yes, it can happen. When omniORB makes a call, it sends data on a TCP
connection, then blocks waiting for the response. If the server dies,
TCP can fail to notice, and the OS may leave the recv() call blocked for
ever. There's not much omniORB can do in this situation, other than have
a call timeout.

Yeah, this is how I thought about it as well.

Post by Duncan Grisby
You could try setting the SO_KEEPALIVE socket option on omniORB's
connections, by adding it to src/lib/omniORB/orbcore/tcp/tcpEndpoint.cc
where it sets other options. That will check the connection if it's been
idle for a long time.

Hmm, but SO_KEEPALIVE would not help in case where ORB closes connections
after connection idle time?

Some calls we make can take a long time (could be improved by making these
calls async), more than idle time on connections.

But we also make a lot of connection, so I'm not sure increasing IDLE time
would be a good idea. One server can have 500+ connections at the same time.

Thx,
Igor

Duncan Grisby

2009-08-11 19:33:56 UTC

Permalink

On Tuesday 11 August, Igor Lautar wrote:

[...]

Post by Igor Lautar

Hmm, but SO_KEEPALIVE would not help in case where ORB closes connections
after connection idle time?
From omniORB's point of view, a connection where the client is blocked

waiting for a reply from the server is not "idle", even though there's
no traffic. A connection is only idle (and a candidate for closure) if
there are no calls in progress on it.

Despite its name, SO_KEEPALIVE is actually there to kill a TCP
connection if the network breaks, not to keep the connection alive.

The situation you're in that the client sends a request and the server
sends one or more TCP ACKs so the client knows the data has arrived.
Now the client is waiting for the server to reply, and there is no
traffic at all on the TCP connection. The nature of TCP means that if
the network breaks or the server is uncleanly shut down, the client will
not receive any indication of that, so as far as it's concerned, the
server might still be there. SO_KEEPALIVE sends a test packet every once
in a while, so the client will notice if the server is no longer
reachable, and will close the TCP connection.

Cheers,

Duncan.

--
-- Duncan Grisby --
-- ***@grisby.org --
-- http://www.grisby.org --

Igor Lautar

2009-08-11 19:40:16 UTC

Permalink

Post by Igor Lautar
From omniORB's point of view, a connection where the client is blocked
waiting for a reply from the server is not "idle", even though there's
no traffic. A connection is only idle (and a candidate for closure) if
there are no calls in progress on it.
Despite its name, SO_KEEPALIVE is actually there to kill a TCP
connection if the network breaks, not to keep the connection alive.
The situation you're in that the client sends a request and the server
sends one or more TCP ACKs so the client knows the data has arrived.
Now the client is waiting for the server to reply, and there is no
traffic at all on the TCP connection. The nature of TCP means that if
the network breaks or the server is uncleanly shut down, the client will
not receive any indication of that, so as far as it's concerned, the
server might still be there. SO_KEEPALIVE sends a test packet every once
in a while, so the client will notice if the server is no longer
reachable, and will close the TCP connection.

Thx for explanation.

So this would have side-effect that connections could brake (more often) if
network goes down for few seconds? As keep_alive would not reach server where
case where this is not send, survives downtime?

It could be worthwhile to make it configurable, maybe I'll get around to make
a patch... At least for me, it seams useful under certain circumstances. But
in our case, timeout is better solution.

Regards,

Duncan Grisby

2009-08-12 16:49:55 UTC

Permalink

On Tuesday 11 August, Igor Lautar wrote:

[...]

Post by Igor Lautar
So this would have side-effect that connections could brake (more often) if
network goes down for few seconds? As keep_alive would not reach server where
case where this is not send, survives downtime?

It shouldn't drop a connection if the network is only down for a few
seconds. According to RFC 1122, section 4.2.3.6:

Keep-alive packets MUST only be sent when no data or acknowledgement
packets have been received for the connection within an interval.
This interval MUST be configurable and MUST default to no less than
two hours.

It is extremely important to remember that ACK segments that contain
no data are not reliably transmitted by TCP. Consequently, if a
keep-alive mechanism is implemented it MUST NOT interpret failure to
respond to any specific probe as a dead connection.

Note that the default time between keep-alives must be at least 2 hours,
and one failure shouldn't count as a dead connection.

Cheers,

Duncan.

--
-- Duncan Grisby --
-- ***@grisby.org --
-- http://www.grisby.org --

Igor Lautar

2009-08-12 17:00:44 UTC

Permalink

Post by Duncan Grisby
It shouldn't drop a connection if the network is only down for a few
Keep-alive packets MUST only be sent when no data or acknowledgement
packets have been received for the connection within an interval.
This interval MUST be configurable and MUST default to no less than
two hours.
It is extremely important to remember that ACK segments that contain
no data are not reliably transmitted by TCP. Consequently, if a
keep-alive mechanism is implemented it MUST NOT interpret failure to
respond to any specific probe as a dead connection.
Note that the default time between keep-alives must be at least 2 hours,
and one failure shouldn't count as a dead connection.

Makes sense, thx for all your help.

Regards,
Igor