[omniORB] Problems with corbaloc

Discussion:

Nigel Rantor

2006-11-21 18:06:43 UTC

Hi all,

I've used omniORB before and have decided to use it on a new project I
am a part of. I've used the C++ bindings in the past, now I'm
experimenting with Python.

I joined the list last week and it seems fairly low-volume so I'm going
to post without having been around too long.

I think I have some idea of what the underlying problem may be but I'm
not sure, so here goes.

I want to start up one specific service in such a way that I do not need
a bootstrapping service to get hold of it remotely.

My current solution is to provide the ORB with an endpoint, use the
omniINSPOA to house my service and provide a well-known name to it so
that I can construct a corbaloc URL by knowing the hostname, port and name.

This all works fine. I have no problems doing this, it's groovy. I have
code that works.

My problem is when I try to test these services by killing them I find
that when they come back up and talk to each other I get COMM_FAILURE
errors.

This happens as soon as they start up as they attempt to contact all the
other machines that should be running this service. The weird thing is
that the initiator seems to be okay, but when the receive attempts to
call back to the initiator it dies with a COMM_FAILURE.

To make it more concrete, let's say I have this service running on two
machines, A and B.

1) start service on A

2) service on A attempts to contact B, B is not running yet, fine.

3) start service on B

4) service on B attempts to contact A, A is running and replies.

5) kill service on B

6) start service on B

7) service on B attempts to contact A, A is running and has an operation
invoked on it successfuly by B. A then attempts to invoke an operation
on B and a CORBA.COMM_FAILURE is raised.

If I leave the service on B dead for long enough this problem does not
occur, so I turned tracing on and found that once the service on A gets
to the point where it prints the below message out I can then kill and
restart the service on B and everything works.

--------------------------------------------------------------------
omniORB: Scanning Python thread states.
omniORB: Scanning Python thread states.
omniORB: Scanning Python thread states.
omniORB: Scanning Python thread states.
omniORB: sendCloseConnection: to giop:tcp:172.16.69.250:9991 12 bytes
omniORB: Client connection refcount (forced) = 0
omniORB: Client close connection to giop:tcp:172.16.69.250:9991
omniORB: throw giopStream::CommFailure from
giopStream.cc:835(0,NO,COMM_FAILURE_UnMarshalArguments)
omniORB: Server connection refcount = 1
omniORB: Server connection refcount = 0
omniORB: Server close connection from giop:tcp:172.16.69.250:40464
omniORB: Deleting Python state for thread id 1085389744 (thread exit)
omniORB: AsyncInvoker: thread id = 4 has exited. Total threads = 3
--------------------------------------------------------------------

Now, I don't really have any good ideas, but it does strike me that the
line that says:

--------------------------------------------------------------------
omniORB: throw giopStream::CommFailure from
giopStream.cc:835(0,NO,COMM_FAILURE_UnMarshalArguments)
--------------------------------------------------------------------

isn't actually throwing anything to the app level at the time, I'm
wondering if this is possibly being held over until I next attempt to
invoke an operation on that same connection? in normal operation this
won't happen because the remote servant will have a different port/IOR
over different invocations but in my case the corbaloc URL doesn't change.

Any thoughts or ideas would be greatly appreciated. I'm sure I can add
some code to work around this but I'd really rather have the system Just
Work(tm)

Thanks,

n

Duncan Grisby

2006-11-23 18:15:08 UTC

Permalink

On Tuesday 21 November, Nigel Rantor wrote:

[...]

Post by Nigel Rantor
1) start service on A
2) service on A attempts to contact B, B is not running yet, fine.
3) start service on B
4) service on B attempts to contact A, A is running and replies.
5) kill service on B
6) start service on B
7) service on B attempts to contact A, A is running and has an
operation invoked on it successfuly by B. A then attempts to invoke an
operation on B and a CORBA.COMM_FAILURE is raised.

The issue is that A has cached a connection to B. Since B restarts
listening on the same port, A thinks its cached connection is valid.
It's not until it tries to use it that it notices that it's broken.

Because of the way the network stack works, the socket send call that A
performs actually succeeds even though the data has nowhere to go. It's
only when it tries to do a receive that it finds out about the failure.
Since as far as it's concerned it has sent the message, it has no way to
know whether B did actually get the message, or whether it's safe to
retry, so it has to throw a COMM_FAILURE exception. In other situations
(like sending a big request message) the socket send fails, and omniORB
knows it's safe to retry.

The solution for you is to install a COMM_FAILURE exception handler that
retries once. After the COMM_FAILURE, omniORB will open a new connection
and it'll be fine.

Post by Nigel Rantor
If I leave the service on B dead for long enough this problem does not
occur, so I turned tracing on and found that once the service on A
gets to the point where it prints the below message out I can then
kill and restart the service on B and everything works.

[...]

Post by Nigel Rantor
omniORB: sendCloseConnection: to giop:tcp:172.16.69.250:9991 12 bytes
omniORB: Client connection refcount (forced) = 0
omniORB: Client close connection to giop:tcp:172.16.69.250:9991

This is omniORB closing the idle connection (which is actually broken,
but it doesn't know that). Once it's closed, a new call will open a new
connection, hence the lack of exception.

Post by Nigel Rantor
omniORB: throw giopStream::CommFailure from
giopStream.cc:835(0,NO,COMM_FAILURE_UnMarshalArguments)

This is merely an internal implementation detail. It's caught by another
bit of omniORB, so it's not expected to propagate to your application
code.

Cheers,

Duncan.

--
-- Duncan Grisby --
-- ***@grisby.org --
-- http://www.grisby.org --

Nigel Rantor

2006-11-24 19:36:56 UTC

Permalink

Post by Duncan Grisby
The solution for you is to install a COMM_FAILURE exception handler that
retries once. After the COMM_FAILURE, omniORB will open a new connection
and it'll be fine.

Thanks Duncan, that makes me feel a lot better :-) I'll try it next week.

Keep up the good work,

n