Nicusor Tanase
2006-09-19 00:18:45 UTC
Hi everyone,
We are using Omniorb for high load application. The load on server side
is balanced on two HW nodes holding several application instances, and the
clients connect distinctivelly to one node based on some dispatching rule.
Recently we ran into following problem: sometimes, on server side, ORB
threads
are dying without being respawned again. This affects the client side,
as the whole deployment throughput drops dramatically.
This happens while application is running in thread pool mode, and an
Umarshalling exception occurs:
omniORB: (4) sendChunk: to giop:tcp:10.230.169.177:56894 204 bytes<AA>
omniORB: (4) inputMessage: from giop:tcp:10.230.169.177:56894 460 bytes<AA>
omniORB: (4) Dispatching remote call 'AccountBalanceUpdate5' to: root/my
poa<amServer> (active)<AA>
omniORB: (4) sendChunk: to giop:tcp:10.230.169.177:56894 204 bytes<AA>
omniORB: (4) inputMessage: from giop:tcp:10.230.169.177:56894 460 bytes<AA>
omniORB: (4) Dispatching remote call 'AccountBalanceUpdate5' to: root/my
poa<amServer> (active)<AA>
omniORB: (4) sendChunk: to giop:tcp:10.230.169.177:56894 204 bytes<AA>
omniORB: (4) throw giopStream::CommFailure from
giopStream.cc:834(0,NO,COMM_FAILURE_UnMarshalArguments)<AA>
omniORB: (4) Server connection refcount = 1<AA>
omniORB: (4) AsyncInvoker: thread id = 4 has exited. Total threads = 15<AA>
While checking the Omni code, I discovered that "giopStream.cc:834" line
should never be reached under
any conditions, at least the comments say so.
giopStream.cc:
784
////////////////////////////////////////////////////////////////////////
785 giopStream_Buffer*
786 giopStream::inputMessage() {
............................................
823 while ((buf->last - buf->start) < 12) {
824
825 int rsz = pd_strand->connection->Recv((void*)
826 ((omni::ptr_arith_t)buf+buf->last),
827 (size_t) (buf->end - buf->last),
828 pd_deadline_secs,
829 pd_deadline_nanosecs);
830 if (rsz > 0) {
831 buf->last += rsz;
832 }
833 else {
834 errorOnReceive(rsz,__FILE__,__LINE__,buf,0);
835 // never reaches here.
836 }
837 }
Checking up the call stack, I noticed in invoker.cc code, that any
unhandled exception
is being catched, and a warning is printed.
Moreover, the task wrapper catches any possible exception, and we never
got such warning in logs.
So the possibility of uncaught exception does not exists.
invoker.cc:
136 void real_run() {
137
138 if (omniORB::trace(10)) {
.........................................................
180 pd_pool->pd_lock->unlock();
181 try {
182 pd_task->execute();
183 }
184 catch(...) {
185 omniORB::logs(1, "AsyncInvoker: Warning: unexpected exception "
186 "caught while executing a task.");
187 }
On other hand, in other cases, we get UnmarshallingException, but the
threads continue to work normally.
omniORB: (18) Dispatching remote call 'SubscriptionQuery5' to: root/my
poa<smServer> (active)<AA>
omniORB: (18) sendChunk: to giop:tcp:10.230.169.177:56903 453 bytes<AA>
omniORB: (18) giopWorker task execute.<AA>
omniORB: (18) throw giopStream::CommFailure from
giopStream.cc:834(0,NO,COMM_FAILURE_UnMarshalArguments)<AA>
omniORB: (18) Server connection refcount = 1<AA>
omniORB: (18) Server connection refcount = 0<AA>
omniORB: (18) Server close connection from giop:tcp:10.230.169.177:56930<AA>
omniORB: (18) giopWorker task execute.<AA>
The difference that I could notice is that for second case, worker thread
is being used
for another job. It might be that due to low load, the thread is being
destroyed in the first case.
Also, another difference, is that the reference count towards the server
side object. In second case,
this reaches zero, while in the first case it remains 1.
Bellow is the server side configuration.
ORBthreadPerConnectionPolicy=1; export ORBthreadPerConnectionPolicy
ORBthreadPerConnectionUpperLimit=16; export
ORBthreadPerConnectionUpperLimit
ORBthreadPerConnectionLowerLimit=8; export
ORBthreadPerConnectionLowerLimit
ORBthreadPoolWatchConnection=0; export ORBthreadPoolWatchConnection
Any idea what could cause the thread problem is welcome (including
the explanation that there is no problem :) ).
Thanks in advance,
Nic
We are using Omniorb for high load application. The load on server side
is balanced on two HW nodes holding several application instances, and the
clients connect distinctivelly to one node based on some dispatching rule.
Recently we ran into following problem: sometimes, on server side, ORB
threads
are dying without being respawned again. This affects the client side,
as the whole deployment throughput drops dramatically.
This happens while application is running in thread pool mode, and an
Umarshalling exception occurs:
omniORB: (4) sendChunk: to giop:tcp:10.230.169.177:56894 204 bytes<AA>
omniORB: (4) inputMessage: from giop:tcp:10.230.169.177:56894 460 bytes<AA>
omniORB: (4) Dispatching remote call 'AccountBalanceUpdate5' to: root/my
poa<amServer> (active)<AA>
omniORB: (4) sendChunk: to giop:tcp:10.230.169.177:56894 204 bytes<AA>
omniORB: (4) inputMessage: from giop:tcp:10.230.169.177:56894 460 bytes<AA>
omniORB: (4) Dispatching remote call 'AccountBalanceUpdate5' to: root/my
poa<amServer> (active)<AA>
omniORB: (4) sendChunk: to giop:tcp:10.230.169.177:56894 204 bytes<AA>
omniORB: (4) throw giopStream::CommFailure from
giopStream.cc:834(0,NO,COMM_FAILURE_UnMarshalArguments)<AA>
omniORB: (4) Server connection refcount = 1<AA>
omniORB: (4) AsyncInvoker: thread id = 4 has exited. Total threads = 15<AA>
While checking the Omni code, I discovered that "giopStream.cc:834" line
should never be reached under
any conditions, at least the comments say so.
giopStream.cc:
784
////////////////////////////////////////////////////////////////////////
785 giopStream_Buffer*
786 giopStream::inputMessage() {
............................................
823 while ((buf->last - buf->start) < 12) {
824
825 int rsz = pd_strand->connection->Recv((void*)
826 ((omni::ptr_arith_t)buf+buf->last),
827 (size_t) (buf->end - buf->last),
828 pd_deadline_secs,
829 pd_deadline_nanosecs);
830 if (rsz > 0) {
831 buf->last += rsz;
832 }
833 else {
834 errorOnReceive(rsz,__FILE__,__LINE__,buf,0);
835 // never reaches here.
836 }
837 }
Checking up the call stack, I noticed in invoker.cc code, that any
unhandled exception
is being catched, and a warning is printed.
Moreover, the task wrapper catches any possible exception, and we never
got such warning in logs.
So the possibility of uncaught exception does not exists.
invoker.cc:
136 void real_run() {
137
138 if (omniORB::trace(10)) {
.........................................................
180 pd_pool->pd_lock->unlock();
181 try {
182 pd_task->execute();
183 }
184 catch(...) {
185 omniORB::logs(1, "AsyncInvoker: Warning: unexpected exception "
186 "caught while executing a task.");
187 }
On other hand, in other cases, we get UnmarshallingException, but the
threads continue to work normally.
omniORB: (18) Dispatching remote call 'SubscriptionQuery5' to: root/my
poa<smServer> (active)<AA>
omniORB: (18) sendChunk: to giop:tcp:10.230.169.177:56903 453 bytes<AA>
omniORB: (18) giopWorker task execute.<AA>
omniORB: (18) throw giopStream::CommFailure from
giopStream.cc:834(0,NO,COMM_FAILURE_UnMarshalArguments)<AA>
omniORB: (18) Server connection refcount = 1<AA>
omniORB: (18) Server connection refcount = 0<AA>
omniORB: (18) Server close connection from giop:tcp:10.230.169.177:56930<AA>
omniORB: (18) giopWorker task execute.<AA>
The difference that I could notice is that for second case, worker thread
is being used
for another job. It might be that due to low load, the thread is being
destroyed in the first case.
Also, another difference, is that the reference count towards the server
side object. In second case,
this reaches zero, while in the first case it remains 1.
Bellow is the server side configuration.
ORBthreadPerConnectionPolicy=1; export ORBthreadPerConnectionPolicy
ORBthreadPerConnectionUpperLimit=16; export
ORBthreadPerConnectionUpperLimit
ORBthreadPerConnectionLowerLimit=8; export
ORBthreadPerConnectionLowerLimit
ORBthreadPoolWatchConnection=0; export ORBthreadPoolWatchConnection
Any idea what could cause the thread problem is welcome (including
the explanation that there is no problem :) ).
Thanks in advance,
Nic