[omniORB] Bidir connection problem

Andrea Venturoli

2014-03-12 15:05:19 UTC

Hello.

We've got a problem with bidir connection I can't get to solve.

This is our setup:

_ server side we've got an application based on omniORB 4.1.6;

_ clients use JacORB and connect to the server with a bidir policy;

_ the server periodically makes oneway calls to clients; in order to
avoid blocking on these, ClientCallTimeout was set to five seconds;

_ since we need to transfer huge amounts of data, we raised
giopMaxMsgSize. We tried several values and found out that the bigger
this parameter is, the higher is the chance of seeing the problem,
especially on slow lines. However using the default value does not mean
the problem never happens, it's just more unlikely to happen (and harder
to reproduce).

The sequence of events that lead to the problem is the following:

_ the client connects to the server: bidir policy works and the two ends
are able to exchange messages;

_ then the client makes a long call to the server: this seems to block
the only active connection for other uses;

_ meanwhile the server makes a call to the client, which times out; so

2014-Mar-12 14:10:11: omniORB: Timed out waiting for write lock: giop:tcp:10.1.2.26:61583
2014-Mar-12 14:10:11: omniORB: throw giopStream::CommFailure from giopStream.cc:331(0,NO,TRANSIENT_CallTimedout)
2014-Mar-12 14:10:11: omniORB: Unexpected error encountered in talking to the server giop:tcp:10.1.2.26:61583. The connection is closed immediately. GIOP_C state 2, strand state 0
2014-Mar-12 14:10:11: omniORB: throw TRANSIENT from omniObjRef.cc:809 (NO,TRANSIENT_CallTimedout)

Notice that while the notification is lost, the client side call completes.

[jacorb.orb.iiop] DEBUG : Transport to 10.1.2.15:60606: stream closed on read < 0
[jacorb.giop.conn] DEBUG : ClientGIOPConnection to 10.1.2.15:60606 (5f064398): getMessage() -- COMM_FAILURE
[jacorb.giop.conn] DEBUG : ClientGIOPConnection to 10.1.2.15:60606 (5f064398): streamClosed()
[jacorb.giop.conn] DEBUG : ClientGIOPConnection to 10.1.2.15:60606 (5f064398): closeAllowReopen()
[jacorb.orb.iiop] INFO : Client-side TCP transport to 10.1.2.15:60606 closed.
[jacorb.giop.conn] DEBUG : ClientGIOPConnection to 10.1.2.15:60606 (5f064398): will wait until connected
524589
[jacorb.giop.conn] DEBUG : ClientGIOPConnection to 10.1.2.15:60606 (5f064398): sendMessage() -- opening transport
[jacorb.orb.iiop] DEBUG : Trying to connect to 10.1.2.15:60606 with timeout=90000.
[jacorb.orb.iiop] INFO : Connected to 10.1.2.15:60606 from local port 47203

_ from now on, while the client can call the server fine, any call from

2014-Mar-12 14:10:21: 2014-Mar-12 14:10:21: omniORB: throw TRANSIENT from giopBiDir.cc:447 (NO,TRANSIENT_BiDirConnIsGone)
2014-Mar-12 14:10:21: omniORB: throw TRANSIENT from omniObjRef.cc:822 (NO,TRANSIENT_BiDirConnIsGone)

What we'd like to achieve:

_ ideally the "notification" should overlap with the client side call; I
know this is probably hard, if not impossible, to achive, especially
when limited bandwidth is the reason behind the timeout, so we could
live with a lost call;

_ what I really don't understand is why any subsequent call fails, even
if the client side traffic has dropped.

Should the old connection really close on a failed oneway call, even if
data is incoming fine on this connection?

Then again, why isn't the second connection used?

2014-Mar-12 14:10:20: omniORB: Server accepted connection from giop:tcp:10.1.2.26:47203
2014-Mar-12 14:10:20: omniORB: giopWorker task execute.
2014-Mar-12 14:10:20: omniORB: Accepted connection from giop:tcp:10.1.2.26:47203 because of this rule: "* tcp,ssl,bidir"
2014-Mar-12 14:10:20: omniORB: inputMessage: from giop:tcp:10.1.2.26:47203 179 bytes
2014-Mar-12 14:10:20: omniORB: Receive codeset service context and set TCS to (UTF-8,UTF-16)
2014-Mar-12 14:10:20: omniORB: Receive bidir IIOP service context: ( 10.1.2.26:39278 10.1.2.26:44795 )
2014-Mar-12 14:10:20: omniORB: Accepted request from giop:tcp:10.1.2.26:47203 to switch to bidirectional because of this rule: "* tcp,ssl,bidir"

Bidir is again established, so why is it not used?

Any help is appreciated.

Thanks
av.