Discussion:
[omniORB] SocketCollection bug
Martin Kocian
2010-01-17 02:51:21 UTC
Permalink
Hi,

When I run omniorb a thread will sometimes hang. I traced this to a
conflict between one thread doing a blocking read in tcpConnection::Recv
and another thread doing a blocking select call in the posix
implementation of SocketCollection::Select.
If both threads are listening to the same socket and data arrives then
only one of the two threads gets woken up. If Recv gets woken up then it's
fine because there is a timeout on Select, but if it's
the Select thread that gets woken up then Recv will block indefinitely.
Please correct me if I'm wrong but as far as I know BSD sockets do not
allow several threads to do blocking select/read calls on the same socket
at the same time in which case this is a bug in omniorb. The omniorb release
I'm using is 4.1.4.

Thank you,

Martin
Duncan Grisby
2010-01-19 15:08:45 UTC
Permalink
Post by Martin Kocian
When I run omniorb a thread will sometimes hang. I traced this to a
conflict between one thread doing a blocking read in tcpConnection::Recv
and another thread doing a blocking select call in the posix
implementation of SocketCollection::Select.
If both threads are listening to the same socket and data arrives then
only one of the two threads gets woken up. If Recv gets woken up then it's
fine because there is a timeout on Select, but if it's
the Select thread that gets woken up then Recv will block indefinitely.
Please correct me if I'm wrong but as far as I know BSD sockets do not
allow several threads to do blocking select/read calls on the same socket
at the same time in which case this is a bug in omniorb. The omniorb release
I'm using is 4.1.4.
What platform are you using?

select() returns if one of the file descriptors in the read set can be
read without blocking. That doesn't change whether another thread doing
recv() is able to actually read the data. It would certainly be bad to
have two threads doing recv() on the same socket, but select() doesn't
consume any data, so it shouldn't prevent recv() from returning. What
diagnosis have you done to show that the problem you describe is
actually what is happening?

Cheers,

Duncan.
--
-- Duncan Grisby --
-- ***@grisby.org --
-- http://www.grisby.org --
Kevin Bailey
2010-01-19 22:47:16 UTC
Permalink
Post by Duncan Grisby
It would certainly be bad to
have two threads doing recv() on the same socket, but select() doesn't
consume any data, so it shouldn't prevent recv() from returning.
Forgive me for jumping in, but I'd like to make sure I
understand omniorb thread model. select() doesn't
consume but won't it typically be followed by its own
recv() and, if the other thread exhausted the socket,
wouldn't this one block ?
Duncan Grisby
2010-01-19 23:05:48 UTC
Permalink
Post by Kevin Bailey
Forgive me for jumping in, but I'd like to make sure I
understand omniorb thread model. select() doesn't
consume but won't it typically be followed by its own
recv() and, if the other thread exhausted the socket,
wouldn't this one block ?
That's not what's going on here. The threading and connection management
model is rather complex, due to the semantics of GIOP.

Imagine we're in the default thread-per-connection model (thread pool is
similar, but it's easier to explain in thread-per-connection). When no
call is happening, the connection's thread is blocked in recv() waiting
for an incoming call. Now, a call comes in so the recv() returns the
data, and the thread unmarshals the arguments and processes the up-call.
Once the call returns, the same thread marshals the return values and
goes back to blocking in recv() for the next call. Simple, right?

Unfortunately for this simple picture, the GIOP specification allows
another concurrent call to be sent on the same connection while the
first call is being processed. So, before the connection's dedicated
thread starts the up-call, it marks the connection as "selectable".
Periodically, a thread looks at all the selectable connections and uses
select (or poll) to watch them. If data arrives while the dedicated
thread is still in its upcall, a new thread from the pool is triggered
to handle the concurrent upcall.

When the first call returns, the connection is marked so it is no longer
selectable. However, it is expensive to stop the thread that's blocked
in select immediately, so it is allowed to keep watching the connection
until the next time it rescans the connections. If a new call comes in
while the select thread is still watching, it can wake up both the
select thread and the dedicated thread (which is back to blocking in
recv). However, the select thread can see that the connection is no
longer selectable, so it ignores the condition.

You can see all this in action in SocketCollection.cc and giopServer.cc
if you're interested...

Cheers,

Duncan.
--
-- Duncan Grisby --
-- ***@grisby.org --
-- http://www.grisby.org --
Martin Kocian
2010-01-21 06:40:38 UTC
Permalink
Hi Duncan,

Thank you for your reply. My answers are embedded below.
Post by Duncan Grisby
Post by Martin Kocian
When I run omniorb a thread will sometimes hang. I traced this to a
conflict between one thread doing a blocking read in tcpConnection::Recv
and another thread doing a blocking select call in the posix
implementation of SocketCollection::Select.
If both threads are listening to the same socket and data arrives then
only one of the two threads gets woken up. If Recv gets woken up then it's
fine because there is a timeout on Select, but if it's
the Select thread that gets woken up then Recv will block indefinitely.
Please correct me if I'm wrong but as far as I know BSD sockets do not
allow several threads to do blocking select/read calls on the same socket
at the same time in which case this is a bug in omniorb. The omniorb release
I'm using is 4.1.4.
What platform are you using?
RTEMS 4.9.2 on a powerpc 405 in a Xilinx Virtex 4 chip.
Post by Duncan Grisby
select() returns if one of the file descriptors in the read set can be
read without blocking. That doesn't change whether another thread doing
recv() is able to actually read the data. It would certainly be bad to
have two threads doing recv() on the same socket, but select() doesn't
consume any data, so it shouldn't prevent recv() from returning. What
diagnosis have you done to show that the problem you describe is
actually what is happening?
I made tcpConnection do a select before every recv call (rather than just
the first call) in which case the thread will hang in select instead of
recv. I then put print statements before and after the
blocking select call in rtems to tell me what thread is calling select on
what socket and what thread is waking up from the select call. If in
addition I change the indefinitely blocking select call to one with a
timeout of a few seconds I observe that in the cases where the read hangs
I have both the read thread and the SocketCollection thread do a select on
the same socket from which the SocketCollection thread returns
immediately. The read thread select call on the other hand times out,
but then when recv gets called it reads the data normally as if nothing
had happened. I then made SocketCollection use a patched, non-blocking
version of select that just looks at the socket without signalling this
to rtems. With this ad-hoc fix the problem is gone.
Post by Duncan Grisby
From the rtems code it seems clear that doing two blocking selects on
the same socket won't work because the first thread to be woken up
clears the rtems event that signalled that the socket had data waiting.
So I think the question is if this is a bug in rtems or if multiple
blocking selects (or select plus recv) on the same socket are not
allowed for BSD sockets in which case it's a bug in omniorb.

Thanks,

Martin
Post by Duncan Grisby
Cheers,
Duncan.
--
-- Duncan Grisby --
-- http://www.grisby.org --
| Martin Kocian |
| ***@slac.stanford.edu |
| Stanford Linear Accelerator Center |
| M.S. 98, P.O. Box 20450 |
| Stanford, CA 94309 |
| Tel. (650)926-2887 Fax (650)926-2923 |
Duncan Grisby
2010-01-27 05:53:44 UTC
Permalink
On Wed, 2010-01-20 at 16:40 -0800, Martin Kocian wrote:

[...]
Post by Duncan Grisby
From the rtems code it seems clear that doing two blocking selects on
the same socket won't work because the first thread to be woken up
clears the rtems event that signalled that the socket had data waiting.
So I think the question is if this is a bug in rtems or if multiple
blocking selects (or select plus recv) on the same socket are not
allowed for BSD sockets in which case it's a bug in omniorb.
I think it's a bug in rtems. I think it might be justifiable that if
there are two select() calls on the same socket only one wakes up, but
if there is a select() and a recv(), it must be wrong for the recv() to
not wake up when a later recv() would successfully return data.

As a work-around, assuming you don't configure your clients to multiplex
calls on a single connection (omniORB's default is not to), you can
prevent omniORB from doing the select() in this case by setting the
maxServerThreadPerConnection parameter to 1. That will mean it doesn't
try to watch for interleaved calls, avoiding the problematic situation
altogether.

Cheers,

Duncan.
--
-- Duncan Grisby --
-- ***@grisby.org --
-- http://www.grisby.org --
Loading...