Endian problems (was Re: [omniORB] Client hangs while trying to narrow reference specified by corbaloc)

Patrick Hartling

2006-08-03 00:58:20 UTC

It appears that I did not diagnose this problem correctly. On further
testing, I determined that the machine-to-machine connection works for
the case of two little endian architectures but not for the case of
one big endian and one little endian. Removing the use of the omniORB
INS POA did not change that fact, so that is no longer a factor
(thankfully). As an aside, my previous test of the Linux-to-Windows
case failed because the Windows Firewall was getting in the way. That
was the result of a completely separate issue that I still need to
resolve.

If the client tries to connect to the server using a stringified
object reference, I get a SequenceIsTooLong marshaling exception from
within IOP::IOR::unmarshaltype_id(). Tracing into it, I find that the
sequence length is supposed to be 32, but the byte order does not get
swapped correctly, thus causing omniORB to think that the sequence is
supposed to be much, much longer. This happens when the client is big
endian and the server is little endian and vice versa. If the
endianness matches for both machines, then there are no problems.

This did not happen before with stringified references. My primary way
of testing the use of CORBA was to use a big endian machine and a
little endian machine. Recently, I have not been using that approach,
so I cannot pinpoint exactly when things went wrong. My current guess
is that the switch I made from omniORB 4.0.6 to 4.0.7 is the most
likely cause, but I have not yet tested this theory. I know that I
switched to omniORB 4.0.7 between the last phase of my project and the
current phase and that I was using the big endian/little endian
testing model throughout the last phase.

My next step will be to try backing off to omniORB 4.0.6. If that
works, I will probably stick with omniORB 4.0.6.

-Patrick

Everything works very well if I have the client and server running on
the same machine. If I run on two separate machines, however, the
client hangs while trying to narrow the reference to the bootstrap
object that it gets back from CORBA::ORB::string_to_object(). It

Does it hang? From the trace you sent, it looks like it got an
exception.

Well, I say it hangs because the _narrow() call never returns. It just
keeps going through a cycle of sending a message, waiting for a while,
getting the WaitingForReply exception, and then trying again. I should
have been more clear in my message about when the client and server
pause.

omniORB: Client attempt to connect to giop:tcp:192.168.1.199:42000
omniORB: AsyncInvoker: thread id = 1 has started. Total threads = 2
omniORB: giopRendezvouser task execute for giop:tcp:192.168.1.183:37128
omniORB: AsyncInvoker: thread id = 2 has started. Total threads = 2
omniORB: Scavenger task execute.
omniORB: Client opened connection to giop:tcp:192.168.1.199:42000
omniORB: sendChunk: to giop:tcp:192.168.1.199:42000 98 bytes

There is a long pause here, which I take to mean that the client is
waiting on the server.

omniORB: inputMessage: from giop:tcp:192.168.1.199:42000 12 bytes

This message is a CloseConnection message...

omniORB: throw giopStream::CommFailure from
giopImpl10.cc:298(1,NO,COMM_FAILURE_WaitingForReply)

...so the client says communication failed.

The server reports a communication failure while trying to un-marshal
omniORB: Server accepted connection from giop:tcp:192.168.1.183:37129
omniORB: AsyncInvoker: thread id = 5 has started. Total threads = 3
omniORB: Scavenger task execute.
omniORB: AsyncInvoker: thread id = 6 has started. Total threads = 3
omniORB: giopWorker task execute.
omniORB: Accepted connection from giop:tcp:192.168.1.183:37129 because
of this rule: "* bidir,tcp"
omniORB: inputMessage: from giop:tcp:192.168.1.183:37129 98 bytes

There is a long pause here before the next line of output is printed.

omniORB: sendCloseConnection: to giop:tcp:192.168.1.183:37129 12 bytes

The server immediately closed the connection when it got the call. I
can't explain why that happened. Try running with traceLevel 40 to see
the GIOP messages. That might give some idea of what's going on.

[trace output removed]

--
Patrick L. Hartling
http://www.137.org/patrick/