[omniORB] Omniorb threads dying without reason

Discussion:

Nicusor Tanase

2006-09-19 00:18:45 UTC

Hi everyone,

We are using Omniorb for high load application. The load on server side
is balanced on two HW nodes holding several application instances, and the
clients connect distinctivelly to one node based on some dispatching rule.

Recently we ran into following problem: sometimes, on server side, ORB
threads
are dying without being respawned again. This affects the client side,
as the whole deployment throughput drops dramatically.

This happens while application is running in thread pool mode, and an
Umarshalling exception occurs:

omniORB: (4) sendChunk: to giop:tcp:10.230.169.177:56894 204 bytes<AA>
omniORB: (4) inputMessage: from giop:tcp:10.230.169.177:56894 460 bytes<AA>
omniORB: (4) Dispatching remote call 'AccountBalanceUpdate5' to: root/my
poa<amServer> (active)<AA>
omniORB: (4) sendChunk: to giop:tcp:10.230.169.177:56894 204 bytes<AA>
omniORB: (4) inputMessage: from giop:tcp:10.230.169.177:56894 460 bytes<AA>
omniORB: (4) Dispatching remote call 'AccountBalanceUpdate5' to: root/my
poa<amServer> (active)<AA>
omniORB: (4) sendChunk: to giop:tcp:10.230.169.177:56894 204 bytes<AA>
omniORB: (4) throw giopStream::CommFailure from
giopStream.cc:834(0,NO,COMM_FAILURE_UnMarshalArguments)<AA>
omniORB: (4) Server connection refcount = 1<AA>
omniORB: (4) AsyncInvoker: thread id = 4 has exited. Total threads = 15<AA>

While checking the Omni code, I discovered that "giopStream.cc:834" line
should never be reached under
any conditions, at least the comments say so.
giopStream.cc:

784
////////////////////////////////////////////////////////////////////////
785 giopStream_Buffer*
786 giopStream::inputMessage() {
............................................
823 while ((buf->last - buf->start) < 12) {
824
825 int rsz = pd_strand->connection->Recv((void*)
826 ((omni::ptr_arith_t)buf+buf->last),
827 (size_t) (buf->end - buf->last),
828 pd_deadline_secs,
829 pd_deadline_nanosecs);
830 if (rsz > 0) {
831 buf->last += rsz;
832 }
833 else {
834 errorOnReceive(rsz,__FILE__,__LINE__,buf,0);
835 // never reaches here.
836 }
837 }

Checking up the call stack, I noticed in invoker.cc code, that any
unhandled exception
is being catched, and a warning is printed.
Moreover, the task wrapper catches any possible exception, and we never
got such warning in logs.
So the possibility of uncaught exception does not exists.

invoker.cc:
136 void real_run() {
137
138 if (omniORB::trace(10)) {
.........................................................
180 pd_pool->pd_lock->unlock();
181 try {
182 pd_task->execute();
183 }
184 catch(...) {
185 omniORB::logs(1, "AsyncInvoker: Warning: unexpected exception "
186 "caught while executing a task.");
187 }

On other hand, in other cases, we get UnmarshallingException, but the
threads continue to work normally.

omniORB: (18) Dispatching remote call 'SubscriptionQuery5' to: root/my
poa<smServer> (active)<AA>
omniORB: (18) sendChunk: to giop:tcp:10.230.169.177:56903 453 bytes<AA>
omniORB: (18) giopWorker task execute.<AA>
omniORB: (18) throw giopStream::CommFailure from
giopStream.cc:834(0,NO,COMM_FAILURE_UnMarshalArguments)<AA>
omniORB: (18) Server connection refcount = 1<AA>
omniORB: (18) Server connection refcount = 0<AA>
omniORB: (18) Server close connection from giop:tcp:10.230.169.177:56930<AA>
omniORB: (18) giopWorker task execute.<AA>

The difference that I could notice is that for second case, worker thread
is being used
for another job. It might be that due to low load, the thread is being
destroyed in the first case.
Also, another difference, is that the reference count towards the server
side object. In second case,
this reaches zero, while in the first case it remains 1.

Bellow is the server side configuration.
ORBthreadPerConnectionPolicy=1; export ORBthreadPerConnectionPolicy
ORBthreadPerConnectionUpperLimit=16; export
ORBthreadPerConnectionUpperLimit
ORBthreadPerConnectionLowerLimit=8; export
ORBthreadPerConnectionLowerLimit
ORBthreadPoolWatchConnection=0; export ORBthreadPoolWatchConnection

Any idea what could cause the thread problem is welcome (including
the explanation that there is no problem :) ).

Thanks in advance,
Nic

Duncan Grisby

2006-09-20 23:45:53 UTC

Permalink

On Monday 18 September, "Nicusor Tanase" wrote:

[...]

Post by Nicusor Tanase
omniORB: (4) throw giopStream::CommFailure from
giopStream.cc:834(0,NO,COMM_FAILURE_UnMarshalArguments)
omniORB: (4) Server connection refcount = 1
omniORB: (4) AsyncInvoker: thread id = 4 has exited. Total threads = 15
While checking the Omni code, I discovered that "giopStream.cc:834"
line should never be reached under any conditions, at least the

[...]

Post by Nicusor Tanase
834 errorOnReceive(rsz,__FILE__,__LINE__,buf,0);
835 // never reaches here.

You are misinterpreting the comment -- the comment means that
errorOnReceive throws an exception, so the flow of control never reaches
the following line. It doesn't mean that the call to errorOnReceive is
never reached.

Anyway, the exception is not a sign of a problem. It simply means that a
network connection was closed, so the server thread that was trying to
read from the connection saw a read failure. The exception is caught
higher up the code, and everything carries on fine.

Post by Nicusor Tanase
On other hand, in other cases, we get UnmarshallingException, but the
threads continue to work normally.
omniORB: (18) Dispatching remote call 'SubscriptionQuery5' to: root/my
poa<smServer> (active)
omniORB: (18) sendChunk: to giop:tcp:10.230.169.177:56903 453 bytes
omniORB: (18) giopWorker task execute.
omniORB: (18) throw giopStream::CommFailure from
giopStream.cc:834(0,NO,COMM_FAILURE_UnMarshalArguments)
omniORB: (18) Server connection refcount = 1
omniORB: (18) Server connection refcount = 0
omniORB: (18) Server close connection from giop:tcp:10.230.169.177:56930
omniORB: (18) giopWorker task execute.
The difference that I could notice is that for second case, worker
thread is being used for another job. It might be that due to low
load, the thread is being destroyed in the first case.

Yes, that's exactly what's happening. Threads are stopped if they're
idle for 10 seconds, so thread 4 didn't have anything to do and exited.

Post by Nicusor Tanase
Also, another difference, is that the reference count towards the
server side object. In second case, this reaches zero, while in the
first case it remains 1.

That just means that another thread was holding on to the connection at
the time the connection was closed. It's not in an of itself a sign of a
problem.

Post by Nicusor Tanase
Any idea what could cause the thread problem is welcome (including
the explanation that there is no problem :) ).

Everything you've posted so far looks completely normal, and is not a
sign of any kind of problem. Whatever is causing the slowdown you're
seeing is something else, not these connection closures.

Cheers,

Duncan.

--
-- Duncan Grisby --
-- ***@grisby.org --
-- http://www.grisby.org --

Nicusor Tanase

2006-09-21 00:07:43 UTC

Permalink

Hi Duncan,

Thanks very much :). Indeed helps.

Nic

Post by Duncan Grisby
[...]

[...]

Post by Nicusor Tanase
834 errorOnReceive(rsz,__FILE__,__LINE__,buf,0);
835 // never reaches here.

You are misinterpreting the comment -- the comment means that
errorOnReceive throws an exception, so the flow of control never reaches
the following line. It doesn't mean that the call to errorOnReceive is
never reached.
Anyway, the exception is not a sign of a problem. It simply means that a
network connection was closed, so the server thread that was trying to
read from the connection saw a read failure. The exception is caught
higher up the code, and everything carries on fine.

Yes, that's exactly what's happening. Threads are stopped if they're
idle for 10 seconds, so thread 4 didn't have anything to do and exited.

Post by Nicusor Tanase
Also, another difference, is that the reference count towards the
server side object. In second case, this reaches zero, while in the
first case it remains 1.

That just means that another thread was holding on to the connection at
the time the connection was closed. It's not in an of itself a sign of a
problem.

Post by Nicusor Tanase
Any idea what could cause the thread problem is welcome (including
the explanation that there is no problem :) ).