[omniORB] LocateRequest gets dropped on RHEL 6.1 system

Discussion:

Brad Fawcett

2011-09-07 23:26:24 UTC

Hello,

I am seeing a problem with running my code using CORBA (omniORB - 4.1.4) on
a RHEL 6.1 system.

Problem scenario is this:

1.) invoke the CORBA request
2.) this causes a LOCATE_REQUEST to be sent out
3.) TCP ACK is sent back
Note: looking through the CORBA server traces. this LOCATE_REQUEST
is never received there. However, a suspicious looking
one is received in its place.
4.) but no LOCATE_REPLY is sent back
5.) then after a while (looks like a 3 minute time-out), the
LOCATE_REQUEST is sent out again. (on a different port)
6.) this time the LOCATE_REPLY is sent back.

======================= relevant TCPdump results ==================
349 18.856754 9.5.167.92 9.5.167.92 GIOP GIOP 1.2 LocateRequest s=26
id=20 op=LocateRequest
0000 00 00 00 00 00 00 00 00 00 00 00 00 08 00 45 00 ........ ......E.
0010 00 5a 68 92 40 00 40 06 71 49 09 05 a7 5c 09 05 ***@.@. qI...\..
0020 a7 5c cf d6 8c d5 3c 86 94 47 3c 63 0b b9 80 18 .\....<. .G<c....
0030 01 09 61 0f 00 00 01 01 08 0a 14 1a e9 34 14 1a ..a..... .....4..
0040 e7 3b 47 49 4f 50 01 02 01 03 1a 00 00 00 14 00 .;GIOP.. ........
0050 00 00 00 00 00 00 0e 00 00 00 fe ea 91 66 4e 00 ........ .....fN.
0060 00 37 53 00 00 00 00 00 .7S.....

350 18.856783 9.5.167.92 9.5.167.92 TCP 36053 > 53206 [ACK] Seq=299
Ack=1469 Win=38144 Len=0 TSV=337307956 TSER=337307956
0000 00 00 00 00 00 00 00 00 00 00 00 00 08 00 45 00 ........ ......E.
0010 00 34 ec c5 40 00 40 06 ed 3b 09 05 a7 5c 09 05 ***@.@. .;...\..
0020 a7 5c 8c d5 cf d6 3c 63 0b b9 3c 86 94 6d 80 10 .\....<c ..<..m..
0030 01 2a a4 76 00 00 01 01 08 0a 14 1a e9 34 14 1a .*.v.... .....4..
0040 e9 34 .4

// Locate_Reply never is sent.

****** Also, the suspect GIOP message shown below in the CORBA trace never
appears in the TCP tracelog.

============== CORBA server-side daemon trace ===================
taken around the time that the Locate_Request is expected to be received.

omniORB: 2011-09-06 16:34:52.667839: SocketCollection idle. Sleeping.
omniORB: 2011-09-06 16:34:52.670531: inputMessage: from
giop:tcp:[::ffff:9.5.167.92]:53206 38 bytes
omniORB: 2011-09-06 16:34:52.670551:
4749 4f50 0102 0100 0501 0000 1200 0000 GIOP............
0000 0000 0000 0000 0e00 0000 feea 9166 ...............f
4e00 0037 5300
N..7S.
omniORB: 2011-09-06 16:34:55.919786: Scan for idle connections
(1315344895,919689000)
omniORB: 2011-09-06 16:34:55.919816: Scavenger reduce idle count for strand
0x7f70f80009b0 to 35
omniORB: 2011-09-06 16:34:55.919840: Scavenger reduce idle count for strand
0x7f70f80016b0 to 35

Does this Message look suspicious?

This problem is very consistent. Full traces are easily available.

This test runs successfully everytime on RHEL 5.5 system, but fails
everytime on a RHEL 6.1 system. (Should qualify this some, in that
machine type also is important here. It fails everytime on a RHEL 6.1 /
HS21 IBM bladesystem but works everytime on RHEL 6.1 / HS22 IBM
bladesystem. This behavior has been verified on several machines of the
various types, & was very consistent.)

Best Regards,

Brad Fawcett
***@us.ibm.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.omniorb-support.com/pipermail/omniorb-list/attachments/20110907/cadd144d/attachment.htm

Duncan Grisby

2011-09-08 21:40:35 UTC

Permalink

Post by Brad Fawcett
I am seeing a problem with running my code using CORBA (omniORB -
4.1.4) on a RHEL 6.1 system.
1.) invoke the CORBA request
2.) this causes a LOCATE_REQUEST to be sent out
3.) TCP ACK is sent back
Note: looking through the CORBA server traces. this LOCATE_REQUEST
is never received there. However, a suspicious looking
one is received in its place.

Can you send complete client and server traces from traceLevel 40
traceTime 1 traceThreadId 1 ? That will help show what omniORB thinks
is happening. Associated tcpdump output would be helpful too. It looks
as though this is a client and server on the same machine, is it?

Unpicking the message received on the server:

4749 4f50
GIOP

0102
version 1.2

01
little endian

00
Request message

0501 0000
261 octets in message

1200 0000
Request id 18

0000 0000
SyncScope NONE (i.e. oneway call)

0000
KeyAddr

0000
padding

0e00 0000

14 octet object key

feea 9166 4e00 0037 5300

10 octets of the object key

So that looks like the start of a valid oneway request message that
wasn't completed. It's definitely not the LocateRequest sent by the
client.

Cheers,

Duncan.

--
-- Duncan Grisby --
-- ***@grisby.org --
-- http://www.grisby.org --

Brad Fawcett

2011-09-09 03:34:35 UTC

Permalink

Skipped content of type multipart/alternative-------------- next part --------------
A non-text attachment was scrubbed...
Name: corbaTcpTraces.zip
Type: application/zip
Size: 14361 bytes
Desc: not available
Url : http://www.omniorb-support.com/pipermail/omniorb-list/attachments/20110908/efc70352/corbaTcpTraces-0001.zip

Brad Fawcett

2011-09-09 20:59:04 UTC

Permalink

Ran into some technical difficulties with the posting to the mailing list.
First, the attached logs were too big & weren't accepted, and then for some
reason, even with smaller logs, it added the attachment, but lost the text
to go with the logs.

So .... Here is the text to go with the previous logs that were updated.
(corbaTcpTraces-0001.zip)

-----------------------------------------------------------------------

Yes, the Client & Server are on the same machine.
Attached are the client & server traces with traceLevel 40 traceTime 1
traceThreadId 1, and the tcpdump as well.

Here are the interesting points in trace.

client log (line 177 - time = 13:36:43.433568) send LocateRequest

tcp trace (line 18 - time = 13:36:43.433590) locateRequest packet

server log (line 180 - time = 13:36:43.433633) input Message
received // <<< this is the suspect packet, that I think started out
as the locateRequest. if I'm reading it right, the GIOP internal message
size (261 bytes) is larger then the tcp packet size (38).

original logs too big, so they had to be truncated. full logs available
upon request.

Thanks,

Brad Fawcett
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.omniorb-support.com/pipermail/omniorb-list/attachments/20110909/45004709/attachment.htm

Duncan Grisby

2011-09-09 22:12:31 UTC

Permalink

Post by Brad Fawcett
Yes, the Client & Server are on the same machine.
Attached are the client & server traces with traceLevel 40 traceTime 1
traceThreadId 1, and the tcpdump as well.

The mailing list ate your big attachments, but you CCed me so I got
them...

Post by Brad Fawcett
server log (line 577 - time = 13:36:43.433633) input Message
received // <<< this is the suspect packet, that I think started
out as the locateRequest. if I'm reading it right, the GIOP internal
message size (261 bytes) is larger then the tcp packet size (38).

Something very odd is going on. The client sends this LocateRequest:

omniORB: (5) 2011-09-08 13:36:43.433568: sendChunk: to giop:tcp:10.6.36.36:59790 38 bytes
omniORB: (5) 2011-09-08 13:36:43.433575:
4749 4f50 0102 0103 1a00 0000 1400 0000 GIOP............
0000 0000 0e00 0000 fe19 fd68 4e00 0061 ...........hN..a
da00 0000 0000 ......

but the server sees this

omniORB: (6) 2011-09-08 13:36:43.433633: inputMessage: from giop:tcp:[::ffff:10.6.36.36]:33014 38 bytes
omniORB: (6) 2011-09-08 13:36:43.433657:
4749 4f50 0102 0100 0501 0000 1200 0000 GIOP............
0000 0000 0000 0000 0e00 0000 fe19 fd68 ...............h
4e00 0061 da00 N..a..

As I mentioned before, that's the start of a valid oneway request
message, rather than the LocateRequest. It's got the right length
though. Where did the data come from? Well, if we look at the previous
message the server got on that connection, we see...

omniORB: (6) 2011-09-08 13:36:42.792230: inputMessage: from giop:tcp:[::ffff:10.6.36.36]:33014 273 bytes
omniORB: (6) 2011-09-08 13:36:42.792246:
4749 4f50 0102 0100 0501 0000 1200 0000 GIOP............
0000 0000 0000 0000 0e00 0000 fe19 fd68 ...............h
4e00 0061 da00 0000 0000 0000 0d00 0000 N..a............
6c6f 675f 6163 7469 7669 7479 0000 0000 log_activity....
0000 0000 3130 2e36 8100 0000 3030 3732 ....10.6....0072
3033 6434 6236 3234 6261 3061 3731 3862 03d4b624ba0a718b
6364 3232 3535 3935 3634 6436 3932 3934 cd22559564d69294
3234 3737 3637 3530 3833 3139 3661 3563 2477675083196a5c
3938 6536 3031 3333 3336 3865 3635 3461 98e60133368e654a
3639 3464 3864 6135 3333 3431 3664 3064 694d8da533416d0d
3838 3839 3763 3637 3734 3037 6334 3462 88897c677407c44b
3333 3734 3864 6430 3931 3761 3139 3164 33748dd0917a191d
3733 3064 6431 6363 3963 3763 0064 3064 730dd1cc9c7c.d0d
0500 0000 6a6f 6273 0034 3037 1300 0000 ....jobs.407....
7265 6769 7374 6572 4a6f 6257 6974 6841 registerJobWithA
4153 0064 1900 0000 6a6f 6220 3020 6973 AS.d....job 0 is
2062 6569 6e67 2073 7562 6d69 7474 6564 being submitted
00 .

That's a valid and complete oneway request. Its first 38 bytes are the
data that was seen in the next message.

Somehow, when omniORB tries to receive a message, it gets the correct
length for the message, but has the old data. Now, omniORB gets the data
by calling recv() on the socket, giving it a buffer. The buffer is
reused between messages, so it contains the old data at the time recv()
is called. recv() returns the correct message length, but either it
fills the buffer with the old data, or it doesn't fill the buffer at
all, leaving the old data. Either way, it looks to me like an OS /
platform bug rather than an omniORB bug.

It might be illuminating to edit src/lib/omniORB/orbcore/giopStream.cc
in inputMessage so it uses memset to clear the data just before calling
Recv. That will show if recv is returning old data or not returning data
at all.

Cheers,

Duncan.

--
-- Duncan Grisby --
-- ***@grisby.org --
-- http://www.grisby.org --

Brad Fawcett

2011-09-24 03:08:17 UTC

Permalink

Hi,

Post by Duncan Grisby
Somehow, when omniORB tries to receive a message, it gets the correct
length for the message, but has the old data. Now, omniORB gets the data
by calling recv() on the socket, giving it a buffer. The buffer is
reused between messages, so it contains the old data at the time recv()
is called. recv() returns the correct message length, but either it
fills the buffer with the old data, or it doesn't fill the buffer at
all, leaving the old data. Either way, it looks to me like an OS /
platform bug rather than an omniORB bug.
It might be illuminating to edit src/lib/omniORB/orbcore/giopStream.cc
in inputMessage so it uses memset to clear the data just before calling
Recv. That will show if recv is returning old data or not returning data
at all.

inserted a memset of 0x right before tcp::recv call. for the bad call, the
buffer remains
all zero's. so, it looks like no data is being transferred.

yes, it does look like a OS/platform bug in most ways. currently pursuing
the tcp recv
code to understand what conditions would cause it to return 0 bytes, but
profess something different.
also, the IOATDMA (dma transfer support for TCP) seems to be involved in
this picture. when it is
disabled, the data is transferred correctly. when it is enabled, the data
is not transferred.

but the curious thing is that it is so darn consistent with relationship to
the CORBA calls.
It is always on the 23rd corba call of this testcase. & is not dependent
upon timing between
CORBA requests. & if is a pure tcp/OS bug, then wouldn't it be appearing
more often in other cases?

current theory is that there is something special in the way CORBA manages
the TCP connections such that it
exposes a bug in the tcp::recv code with ioatdma driver installed.

can u think of anything from the corba side of things that would happen
that might interrupt a tcp::recv before it completes successfully? does
it do many peeks? or selects? or something like that? are the any trace
points you would suggest inserting to see what is happening during this
time?

Thanks,

Brad.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.omniorb-support.com/pipermail/omniorb-list/attachments/20110923/0cade8f4/attachment.htm

Duncan Grisby

2011-09-29 17:56:28 UTC

Permalink

On Fri, 2011-09-23 at 16:07 -0500, Brad Fawcett wrote:

[...]

Post by Brad Fawcett
inserted a memset of 0x right before tcp::recv call. for the bad
call, the buffer remains
all zero's. so, it looks like no data is being transferred.

That seemed like the most likely situation.

[...]

Post by Brad Fawcett
but the curious thing is that it is so darn consistent with
relationship to the CORBA calls.
It is always on the 23rd corba call of this testcase. & is not
dependent upon timing between
CORBA requests. & if is a pure tcp/OS bug, then wouldn't it be
appearing more often in other cases?

That is odd that it is so consistent. The only think I can think of is
that it's something to do with the volume of data transferred up to that
point, but adding up the numbers of bytes in each message, none of the
numbers look like obvious boundaries.

Post by Brad Fawcett
can u think of anything from the corba side of things that would
happen that might interrupt a tcp::recv before it completes
successfully? does it do many peeks? or selects? or something like
that? are the any trace points you would suggest inserting to see
what is happening during this time?

The only other thing that might be going on at the time is a "select"
thread that watches connections for activity. That could potentially be
doing poll() on the same socket as the thread doing recv(). You could
add a log message before and after the call to poll() at line 507 in
src/lib/omniORB/orbcore/SocketCollection.cc. If that indicates that a
poll() is happening at the same time as the recv(), that might be it. On
the other hand, I wouldn't expect it to be consistent enough in its
timing that it always clobbered the 23rd call if it was something to do
with that.

Other than that, omniORB's usage of sockets is really quite simple. I
can't think of anything that would be significantly different from other
applications.

Cheers,

Duncan.

--
-- Duncan Grisby --
-- ***@grisby.org --
-- http://www.grisby.org --