Igor Lautar
2009-08-11 14:19:09 UTC
Hi,
This is a follow up to my previous question on invoke hang in remote call.
(omniORB 4.1.1, VS8, Windows XP x86 client and Windows 2003 AMD64 for server)
It seams we are indeed seeing hang in certain cases. Currently, we cannot
reproduce it every time, but it seams something like this triggers it:
1) invoke remote call and stop call in servant (while (1) { sleep... })
2) do some other work with same ORBs from different threads, repeat this for
few hours
3) kill server
4) client does not detect COMM_FAILURE (or some other exception) for call from
point (1)
We will continue to pinpoint why this would occur, by talking to some other
guys with omniORB experience, it seams it has something to do with surviving
interruptions in TCP or maybe even a feature to support clustered
environments.
Now, the obvious solution is to use timeouts on calls. This is fine, but it
has a nasty sideeffect: transfer is slow as hell (20x decrease in 1 Gbit
network).
This is excerpt from omniORB debugs:
without timeout:
[00852 2009/08/11 09:54:38.374] D 2 omniORB: Invoke 'Send' on remote:
key<cmgrServer>
[00852 2009/08/11 09:54:38.374] D 2 omniORB: sendChunk: to
giop:tcp:10.41.4.49:23244 68 bytes
[00852 2009/08/11 09:54:38.374] D 2 omniORB:
[00852 2009/08/11 09:54:38.374] D 2 4749 4f50 0102 0300 3400 0400 d402 0000
GIOP....4.......
[00852 2009/08/11 09:54:38.374] D 2 0300 0000 0000 0000 0a00 0000 636d 6772
............cmgr
[00852 2009/08/11 09:54:38.374] D 2 5365 7276 6572 0000 0500 0000 5365 6e64
Server......Send
[00852 2009/08/11 09:54:38.374] D 2 0000 7400 0000 0000 0000 0000 8823 814a
..t..........#.J
[00852 2009/08/11 09:54:38.374] D 2 0000 0400
....
[00852 2009/08/11 09:54:38.374] D 2 omniORB: sendCopyChunk: to
giop:tcp:10.41.4.49:23244 262140 bytes
[00852 2009/08/11 09:54:38.374] D 2 omniORB: 128 bytes out of 262140
[00852 2009/08/11 09:54:38.374] D 2 afaf afaf afaf afaf 0000 0000 0000 0001
................
[00852 2009/08/11 09:54:38.374] D 2 0000 0000 0004 0000 0000 0000 0000 0000
................
[00852 2009/08/11 09:54:38.374] D 2 0000 0000 0000 0000 f5f5 f5f5 f5f5 f5f5
................
[00852 2009/08/11 09:54:38.374] D 2 0000 0000 0000 0002 0000 0000 0003 ffd8
................
[00852 2009/08/11 09:54:38.374] D 2 0000 0000 0000 0000 0000 0000 0000 0000
................
[00852 2009/08/11 09:54:38.374] D 2 0000 0000 0000 0001 0000 0000 0000 0002
................
[00852 2009/08/11 09:54:38.374] D 2 0000 0000 0000 0000 0000 0000 0000 0000
................
[00852 2009/08/11 09:54:38.374] D 2 0000 0000 0000 0000 2829 1bba 0c8f ec0b
........()......
[00852 2009/08/11 09:54:38.384] D 2 omniORB: sendChunk: to
giop:tcp:10.41.4.49:23244 20 bytes
[00852 2009/08/11 09:54:38.384] D 2 omniORB:
[00852 2009/08/11 09:54:38.384] D 2 4749 4f50 0102 0107 0800 0000 d402 0000
GIOP............
[00852 2009/08/11 09:54:38.384] D 2 0000 0000
....
[00852 2009/08/11 09:54:38.394] D 2 omniORB: inputMessage: from
giop:tcp:10.41.4.49:23244 24 bytes
[00852 2009/08/11 09:54:38.394] D 2 omniORB:
[00852 2009/08/11 09:54:38.394] D 2 4749 4f50 0102 0101 0c00 0000 d402 0000
GIOP............
[00852 2009/08/11 09:54:38.394] D 2 0000 0000 0000 0000
........
with timeout:
[00176 2009/08/11 09:53:33.268] D 2 omniORB: Invoke 'Send' on remote:
key<cmgrServer>
[00176 2009/08/11 09:53:33.268] D 2 omniORB: sendChunk: to
giop:tcp:10.41.4.49:23244 68 bytes
[00176 2009/08/11 09:53:33.268] D 2 omniORB:
[00176 2009/08/11 09:53:33.268] D 2 4749 4f50 0102 0300 3400 0400 c601 0000
GIOP....4.......
[00176 2009/08/11 09:53:33.268] D 2 0300 0000 0000 0000 0a00 0000 636d 6772
............cmgr
[00176 2009/08/11 09:53:33.268] D 2 5365 7276 6572 0000 0500 0000 5365 6e64
Server......Send
[00176 2009/08/11 09:53:33.268] D 2 0000 7400 0000 0000 0000 0000 2623 814a
..t.........&#.J
[00176 2009/08/11 09:53:33.268] D 2 0000 0400
....
[00176 2009/08/11 09:53:33.268] D 2 omniORB: sendCopyChunk: to
giop:tcp:10.41.4.49:23244 262140 bytes
[00176 2009/08/11 09:53:33.268] D 2 omniORB: 128 bytes out of 262140
[00176 2009/08/11 09:53:33.268] D 2 afaf afaf afaf afaf 0000 0000 0000 0001
................
[00176 2009/08/11 09:53:33.268] D 2 0000 0000 0004 0000 0000 0000 0000 0000
................
[00176 2009/08/11 09:53:33.268] D 2 0000 0000 0000 0000 f5f5 f5f5 f5f5 f5f5
................
[00176 2009/08/11 09:53:33.268] D 2 0000 0000 0000 0002 0000 0000 0003 ffd8
................
[00176 2009/08/11 09:53:33.268] D 2 0000 0000 0000 0000 0000 0000 0000 0000
................
[00176 2009/08/11 09:53:33.268] D 2 0000 0000 0000 0001 0000 0000 0000 0002
................
[00176 2009/08/11 09:53:33.268] D 2 0000 0000 0000 0000 0000 0000 0000 0000
................
[00176 2009/08/11 09:53:33.268] D 2 0000 0000 0000 0000 2829 1bba 0c8f ec0b
........()......
[00176 2009/08/11 09:53:33.278] D 2 omniORB: sendChunk: to
giop:tcp:10.41.4.49:23244 20 bytes
[00176 2009/08/11 09:53:33.278] D 2 omniORB:
[00176 2009/08/11 09:53:33.278] D 2 4749 4f50 0102 0107 0800 0000 c601 0000
GIOP............
[00176 2009/08/11 09:53:33.278] D 2 0000 0000
....
[00176 2009/08/11 09:53:33.469] D 2 omniORB: inputMessage: from
giop:tcp:10.41.4.49:23244 24 bytes
[00176 2009/08/11 09:53:33.469] D 2 omniORB:
[00176 2009/08/11 09:53:33.469] D 2 4749 4f50 0102 0101 0c00 0000 c601 0000
GIOP............
[00176 2009/08/11 09:53:33.469] D 2 0000 0000 0000 0000
........
Notice the difference between time taken to do sendChunk (size 20) and
inputMessage (size 24).
In case where there is no timeout set, time is 10 milliseconds, and where
there is a timeout, time is 191 milliseconds.
Now, I plan to dive into it to figure out where this is coming from. I'm
wondering if you have any ideas to try next (maybe some config option I'm not
aware of?), or maybe just pointer where should I start looking in code to
speed things up.
Thank you for comments/suggestions.
Igor
This is a follow up to my previous question on invoke hang in remote call.
(omniORB 4.1.1, VS8, Windows XP x86 client and Windows 2003 AMD64 for server)
It seams we are indeed seeing hang in certain cases. Currently, we cannot
reproduce it every time, but it seams something like this triggers it:
1) invoke remote call and stop call in servant (while (1) { sleep... })
2) do some other work with same ORBs from different threads, repeat this for
few hours
3) kill server
4) client does not detect COMM_FAILURE (or some other exception) for call from
point (1)
We will continue to pinpoint why this would occur, by talking to some other
guys with omniORB experience, it seams it has something to do with surviving
interruptions in TCP or maybe even a feature to support clustered
environments.
Now, the obvious solution is to use timeouts on calls. This is fine, but it
has a nasty sideeffect: transfer is slow as hell (20x decrease in 1 Gbit
network).
This is excerpt from omniORB debugs:
without timeout:
[00852 2009/08/11 09:54:38.374] D 2 omniORB: Invoke 'Send' on remote:
key<cmgrServer>
[00852 2009/08/11 09:54:38.374] D 2 omniORB: sendChunk: to
giop:tcp:10.41.4.49:23244 68 bytes
[00852 2009/08/11 09:54:38.374] D 2 omniORB:
[00852 2009/08/11 09:54:38.374] D 2 4749 4f50 0102 0300 3400 0400 d402 0000
GIOP....4.......
[00852 2009/08/11 09:54:38.374] D 2 0300 0000 0000 0000 0a00 0000 636d 6772
............cmgr
[00852 2009/08/11 09:54:38.374] D 2 5365 7276 6572 0000 0500 0000 5365 6e64
Server......Send
[00852 2009/08/11 09:54:38.374] D 2 0000 7400 0000 0000 0000 0000 8823 814a
..t..........#.J
[00852 2009/08/11 09:54:38.374] D 2 0000 0400
....
[00852 2009/08/11 09:54:38.374] D 2 omniORB: sendCopyChunk: to
giop:tcp:10.41.4.49:23244 262140 bytes
[00852 2009/08/11 09:54:38.374] D 2 omniORB: 128 bytes out of 262140
[00852 2009/08/11 09:54:38.374] D 2 afaf afaf afaf afaf 0000 0000 0000 0001
................
[00852 2009/08/11 09:54:38.374] D 2 0000 0000 0004 0000 0000 0000 0000 0000
................
[00852 2009/08/11 09:54:38.374] D 2 0000 0000 0000 0000 f5f5 f5f5 f5f5 f5f5
................
[00852 2009/08/11 09:54:38.374] D 2 0000 0000 0000 0002 0000 0000 0003 ffd8
................
[00852 2009/08/11 09:54:38.374] D 2 0000 0000 0000 0000 0000 0000 0000 0000
................
[00852 2009/08/11 09:54:38.374] D 2 0000 0000 0000 0001 0000 0000 0000 0002
................
[00852 2009/08/11 09:54:38.374] D 2 0000 0000 0000 0000 0000 0000 0000 0000
................
[00852 2009/08/11 09:54:38.374] D 2 0000 0000 0000 0000 2829 1bba 0c8f ec0b
........()......
[00852 2009/08/11 09:54:38.384] D 2 omniORB: sendChunk: to
giop:tcp:10.41.4.49:23244 20 bytes
[00852 2009/08/11 09:54:38.384] D 2 omniORB:
[00852 2009/08/11 09:54:38.384] D 2 4749 4f50 0102 0107 0800 0000 d402 0000
GIOP............
[00852 2009/08/11 09:54:38.384] D 2 0000 0000
....
[00852 2009/08/11 09:54:38.394] D 2 omniORB: inputMessage: from
giop:tcp:10.41.4.49:23244 24 bytes
[00852 2009/08/11 09:54:38.394] D 2 omniORB:
[00852 2009/08/11 09:54:38.394] D 2 4749 4f50 0102 0101 0c00 0000 d402 0000
GIOP............
[00852 2009/08/11 09:54:38.394] D 2 0000 0000 0000 0000
........
with timeout:
[00176 2009/08/11 09:53:33.268] D 2 omniORB: Invoke 'Send' on remote:
key<cmgrServer>
[00176 2009/08/11 09:53:33.268] D 2 omniORB: sendChunk: to
giop:tcp:10.41.4.49:23244 68 bytes
[00176 2009/08/11 09:53:33.268] D 2 omniORB:
[00176 2009/08/11 09:53:33.268] D 2 4749 4f50 0102 0300 3400 0400 c601 0000
GIOP....4.......
[00176 2009/08/11 09:53:33.268] D 2 0300 0000 0000 0000 0a00 0000 636d 6772
............cmgr
[00176 2009/08/11 09:53:33.268] D 2 5365 7276 6572 0000 0500 0000 5365 6e64
Server......Send
[00176 2009/08/11 09:53:33.268] D 2 0000 7400 0000 0000 0000 0000 2623 814a
..t.........&#.J
[00176 2009/08/11 09:53:33.268] D 2 0000 0400
....
[00176 2009/08/11 09:53:33.268] D 2 omniORB: sendCopyChunk: to
giop:tcp:10.41.4.49:23244 262140 bytes
[00176 2009/08/11 09:53:33.268] D 2 omniORB: 128 bytes out of 262140
[00176 2009/08/11 09:53:33.268] D 2 afaf afaf afaf afaf 0000 0000 0000 0001
................
[00176 2009/08/11 09:53:33.268] D 2 0000 0000 0004 0000 0000 0000 0000 0000
................
[00176 2009/08/11 09:53:33.268] D 2 0000 0000 0000 0000 f5f5 f5f5 f5f5 f5f5
................
[00176 2009/08/11 09:53:33.268] D 2 0000 0000 0000 0002 0000 0000 0003 ffd8
................
[00176 2009/08/11 09:53:33.268] D 2 0000 0000 0000 0000 0000 0000 0000 0000
................
[00176 2009/08/11 09:53:33.268] D 2 0000 0000 0000 0001 0000 0000 0000 0002
................
[00176 2009/08/11 09:53:33.268] D 2 0000 0000 0000 0000 0000 0000 0000 0000
................
[00176 2009/08/11 09:53:33.268] D 2 0000 0000 0000 0000 2829 1bba 0c8f ec0b
........()......
[00176 2009/08/11 09:53:33.278] D 2 omniORB: sendChunk: to
giop:tcp:10.41.4.49:23244 20 bytes
[00176 2009/08/11 09:53:33.278] D 2 omniORB:
[00176 2009/08/11 09:53:33.278] D 2 4749 4f50 0102 0107 0800 0000 c601 0000
GIOP............
[00176 2009/08/11 09:53:33.278] D 2 0000 0000
....
[00176 2009/08/11 09:53:33.469] D 2 omniORB: inputMessage: from
giop:tcp:10.41.4.49:23244 24 bytes
[00176 2009/08/11 09:53:33.469] D 2 omniORB:
[00176 2009/08/11 09:53:33.469] D 2 4749 4f50 0102 0101 0c00 0000 c601 0000
GIOP............
[00176 2009/08/11 09:53:33.469] D 2 0000 0000 0000 0000
........
Notice the difference between time taken to do sendChunk (size 20) and
inputMessage (size 24).
In case where there is no timeout set, time is 10 milliseconds, and where
there is a timeout, time is 191 milliseconds.
Now, I plan to dive into it to figure out where this is coming from. I'm
wondering if you have any ideas to try next (maybe some config option I'm not
aware of?), or maybe just pointer where should I start looking in code to
speed things up.
Thank you for comments/suggestions.
Igor