Dietmar May
2006-12-09 02:12:44 UTC
In omniORB 4.0.7 under Windows XP, the code for tcpConnection::Recv() stays
in an infinite loop unless the deadline_secs / deadline_nanosecs parameters
are non-zero. These are apparently set to zero for client calls to the
COSNamingService, causing a client (thread) hang.
In this particular case, the Recv() method is being called by
CosNaming::_objref_NamingContext::bind_new_context(), in an attempt to
register an initial object reference with omniNames. Unfortunately,
omniNames is hung up while performing startup processing of the
omninames-*.log, and fails to respond at all; because of this, all client
apps that attempt to contact omniNames also hang.
Essentially, if zero bytes are received [rx == 0 following a call to rx =
select(...)], and if USE_FAKE_INTERRUPTABLE_RECV is defined (which is
apparently the case on Windows), then the code simply continues and keeps
looping until something IS received.
omniNames IS running (has been executed prior to attempting to start the
hanging client app), has been properly initialized (using -start), and has
been demonstrated to work properly following the -start invocation. However,
on numerous occasions (stretching back at least to omniORB 2.8.0, and maybe
even 2.6.1), I've seen omniNames get confused and cause problems for client
apps - deleting the omninames*.log file and restarting omniNames resolves
the issue. It appears to be due to applications failing to deregister their
initial object reference with omniNames (which can happen a lot when
debugging an application; it's less likely to occur in production use), and
then omniNames attempting to recreate the initref. In any case, what's
happening is that omniNames is hanging (I've debugged into omniNames and
verified that it is stuck in an infinite loop; attempts to contact it using
non-omniORB mechanisms, such as IIOP_NET, also fail - or more properly, also
hang). Terminating omniNames at least causes a socket shutdown and the hung
clients are able to continue and throw an appropriate exception.
Maybe the issue isn't with the tcpConnection::Recv() method, maybe it's with
the GIOP_C object that has pd_deadline_secs == pd_deadline_nanosecs == 0.
The call stack is:
omni::tcpConnection::Recv()
omni::giopStream::inputMessage()
omni::giopImpl10::inputMessageBegin()
omni::GIOP_C::ReceiveReply()
omniRemoteIdentity::dispatch()
omniObjRef::_invoke()
CosNaming::_objref_NamingContext::bind_new_context()
cos_naming_register() //based on omniORB examples
Clearly, it's not a good thing for omniNames to hang. But it's also not a
good thing for a client program to hang while trying to access omniNames.
There should be some kind of timeout that kicks in, so that an error can be
properly reported and handled.
Meanwhile, I've built omniNames with debugging information, and have
verified that it is hanging on the first "bind" statement in the
omninames.log file. The call stack is:
omniObjRef::_invoke()
CosNaming::_objref_NamingContext::rebind_context(name, nc2);
omniNameslog::getBind(initf)
omniNameslog::init()
Perhaps significant is that inside of _invoke(), the code jumps past all of
the timeout initializers, to issue the dispatch() without (apparently) any
timeout condition. Within dispatch(), GIOP_C::ReceiveReply() is called
(perhaps indirectly). This makes a call to impl()->inputMessageBegin(); and
from here, control transfers to tcpEndpoint::AcceptAndMonitor(). At this
point, the code simply loops perpetually, waiting for the response that will
never come:
while (1) {
pd_new_conn_socket = RC_INVALID_SOCKET;
if (!Select()) break; //!!! Select always returns 1, because
do_select() returns 0
if (pd_new_conn_socket != RC_INVALID_SOCKET) {
return new tcpConnection(pd_new_conn_socket,this);
}
if (pd_poked)
return 0;
}
The omninames.log file contains the following text. omniNames never gets
past the first "bind" to process the remaining entries.
port 2809
create 4e616d6553657276696365
create e6ef784501000c5000000000
bind
IOR:010000002b00000049444c3a6f6d672e6f72672f436f734e616d696e672f4e616d696e67
436f6e746578744578743a312e300000010000000000000060000000010102000e0000003139
322e3136382e332e32303000f90a0b0000004e616d6553657276696365000200000000000000
080000000100000000545441010000001c000000010000000100010001000000010001050901
01000100000009010100 DMIS DMIS\ Context ncontext
IOR:010000002b00000049444c3a6f6d672e6f72672f436f734e616d696e672f4e616d696e67
436f6e746578744578743a312e300000010000000000000064000000010102000e0000003139
322e3136382e332e32303000f90a0e000000ff00e6ef784501000c5000000000000002000000
00000000080000000100000000545441010000001c0000000100000001000100010000000100
0105090101000100000009010100
create e6ef784501000c5000000001
bind
IOR:010000002b00000049444c3a6f6d672e6f72672f436f734e616d696e672f4e616d696e67
436f6e746578744578743a312e300000010000000000000064000000010102000e0000003139
322e3136382e332e32303000f90a0e000000ff00e6ef784501000c5000000000000002000000
00000000080000000100000000545441010000001c0000000100000001000100010000000100
0105090101000100000009010100 KANGAROO DMIS\ Context ncontext
IOR:010000002b00000049444c3a6f6d672e6f72672f436f734e616d696e672f4e616d696e67
436f6e746578744578743a312e300000010000000000000064000000010102000e0000003139
322e3136382e332e32303000f90a0e000000ff00e6ef784501000c5000000001000002000000
00000000080000000100000000545441010000001c0000000100000001000100010000000100
0105090101000100000009010100
bind
IOR:010000002b00000049444c3a6f6d672e6f72672f436f734e616d696e672f4e616d696e67
436f6e746578744578743a312e300000010000000000000064000000010102000e0000003139
322e3136382e332e32303000f90a0e000000ff00e6ef784501000c5000000001000002000000
00000000080000000100000000545441010000001c0000000100000001000100010000000100
0105090101000100000009010100 Implementation Object nobject
IOR:010000001b00000049444c3a446d69732f4f626a656374466163746f72793a312e300000
010000000000000064000000010102000e0000003139322e3136382e332e323030001c080e00
0000fe0ff078450000056c000000000100000200000000000000080000000100000000545441
010000001c00000001000000010001000100000001000105090101000100000009010100
unbind
IOR:010000002b00000049444c3a6f6d672e6f72672f436f734e616d696e672f4e616d696e67
436f6e746578744578743a312e300000010000000000000064000000010102000e0000003139
322e3136382e332e32303000f90a0e000000ff00e6ef784501000c5000000001000002000000
00000000080000000100000000545441010000001c0000000100000001000100010000000100
0105090101000100000009010100 Implementation Object
It's easy enough to delete the omninames.log file and -start over (and that
does work); but it seems to me that there's a fundamental, underlying
problem here that should get resolved.
Any ideas?
Dietmar May
in an infinite loop unless the deadline_secs / deadline_nanosecs parameters
are non-zero. These are apparently set to zero for client calls to the
COSNamingService, causing a client (thread) hang.
In this particular case, the Recv() method is being called by
CosNaming::_objref_NamingContext::bind_new_context(), in an attempt to
register an initial object reference with omniNames. Unfortunately,
omniNames is hung up while performing startup processing of the
omninames-*.log, and fails to respond at all; because of this, all client
apps that attempt to contact omniNames also hang.
Essentially, if zero bytes are received [rx == 0 following a call to rx =
select(...)], and if USE_FAKE_INTERRUPTABLE_RECV is defined (which is
apparently the case on Windows), then the code simply continues and keeps
looping until something IS received.
omniNames IS running (has been executed prior to attempting to start the
hanging client app), has been properly initialized (using -start), and has
been demonstrated to work properly following the -start invocation. However,
on numerous occasions (stretching back at least to omniORB 2.8.0, and maybe
even 2.6.1), I've seen omniNames get confused and cause problems for client
apps - deleting the omninames*.log file and restarting omniNames resolves
the issue. It appears to be due to applications failing to deregister their
initial object reference with omniNames (which can happen a lot when
debugging an application; it's less likely to occur in production use), and
then omniNames attempting to recreate the initref. In any case, what's
happening is that omniNames is hanging (I've debugged into omniNames and
verified that it is stuck in an infinite loop; attempts to contact it using
non-omniORB mechanisms, such as IIOP_NET, also fail - or more properly, also
hang). Terminating omniNames at least causes a socket shutdown and the hung
clients are able to continue and throw an appropriate exception.
Maybe the issue isn't with the tcpConnection::Recv() method, maybe it's with
the GIOP_C object that has pd_deadline_secs == pd_deadline_nanosecs == 0.
The call stack is:
omni::tcpConnection::Recv()
omni::giopStream::inputMessage()
omni::giopImpl10::inputMessageBegin()
omni::GIOP_C::ReceiveReply()
omniRemoteIdentity::dispatch()
omniObjRef::_invoke()
CosNaming::_objref_NamingContext::bind_new_context()
cos_naming_register() //based on omniORB examples
Clearly, it's not a good thing for omniNames to hang. But it's also not a
good thing for a client program to hang while trying to access omniNames.
There should be some kind of timeout that kicks in, so that an error can be
properly reported and handled.
Meanwhile, I've built omniNames with debugging information, and have
verified that it is hanging on the first "bind" statement in the
omninames.log file. The call stack is:
omniObjRef::_invoke()
CosNaming::_objref_NamingContext::rebind_context(name, nc2);
omniNameslog::getBind(initf)
omniNameslog::init()
Perhaps significant is that inside of _invoke(), the code jumps past all of
the timeout initializers, to issue the dispatch() without (apparently) any
timeout condition. Within dispatch(), GIOP_C::ReceiveReply() is called
(perhaps indirectly). This makes a call to impl()->inputMessageBegin(); and
from here, control transfers to tcpEndpoint::AcceptAndMonitor(). At this
point, the code simply loops perpetually, waiting for the response that will
never come:
while (1) {
pd_new_conn_socket = RC_INVALID_SOCKET;
if (!Select()) break; //!!! Select always returns 1, because
do_select() returns 0
if (pd_new_conn_socket != RC_INVALID_SOCKET) {
return new tcpConnection(pd_new_conn_socket,this);
}
if (pd_poked)
return 0;
}
The omninames.log file contains the following text. omniNames never gets
past the first "bind" to process the remaining entries.
port 2809
create 4e616d6553657276696365
create e6ef784501000c5000000000
bind
IOR:010000002b00000049444c3a6f6d672e6f72672f436f734e616d696e672f4e616d696e67
436f6e746578744578743a312e300000010000000000000060000000010102000e0000003139
322e3136382e332e32303000f90a0b0000004e616d6553657276696365000200000000000000
080000000100000000545441010000001c000000010000000100010001000000010001050901
01000100000009010100 DMIS DMIS\ Context ncontext
IOR:010000002b00000049444c3a6f6d672e6f72672f436f734e616d696e672f4e616d696e67
436f6e746578744578743a312e300000010000000000000064000000010102000e0000003139
322e3136382e332e32303000f90a0e000000ff00e6ef784501000c5000000000000002000000
00000000080000000100000000545441010000001c0000000100000001000100010000000100
0105090101000100000009010100
create e6ef784501000c5000000001
bind
IOR:010000002b00000049444c3a6f6d672e6f72672f436f734e616d696e672f4e616d696e67
436f6e746578744578743a312e300000010000000000000064000000010102000e0000003139
322e3136382e332e32303000f90a0e000000ff00e6ef784501000c5000000000000002000000
00000000080000000100000000545441010000001c0000000100000001000100010000000100
0105090101000100000009010100 KANGAROO DMIS\ Context ncontext
IOR:010000002b00000049444c3a6f6d672e6f72672f436f734e616d696e672f4e616d696e67
436f6e746578744578743a312e300000010000000000000064000000010102000e0000003139
322e3136382e332e32303000f90a0e000000ff00e6ef784501000c5000000001000002000000
00000000080000000100000000545441010000001c0000000100000001000100010000000100
0105090101000100000009010100
bind
IOR:010000002b00000049444c3a6f6d672e6f72672f436f734e616d696e672f4e616d696e67
436f6e746578744578743a312e300000010000000000000064000000010102000e0000003139
322e3136382e332e32303000f90a0e000000ff00e6ef784501000c5000000001000002000000
00000000080000000100000000545441010000001c0000000100000001000100010000000100
0105090101000100000009010100 Implementation Object nobject
IOR:010000001b00000049444c3a446d69732f4f626a656374466163746f72793a312e300000
010000000000000064000000010102000e0000003139322e3136382e332e323030001c080e00
0000fe0ff078450000056c000000000100000200000000000000080000000100000000545441
010000001c00000001000000010001000100000001000105090101000100000009010100
unbind
IOR:010000002b00000049444c3a6f6d672e6f72672f436f734e616d696e672f4e616d696e67
436f6e746578744578743a312e300000010000000000000064000000010102000e0000003139
322e3136382e332e32303000f90a0e000000ff00e6ef784501000c5000000001000002000000
00000000080000000100000000545441010000001c0000000100000001000100010000000100
0105090101000100000009010100 Implementation Object
It's easy enough to delete the omninames.log file and -start over (and that
does work); but it seems to me that there's a fundamental, underlying
problem here that should get resolved.
Any ideas?
Dietmar May