[omniORB] Scalability problems; 300+ clients

Discussion:

Slawomir Lisznianski

2006-07-05 21:45:35 UTC

Hello,

I'm curious whether anyone has seen this before. We have a farm of about
300 computers on which we run our CORBA (omniORB) infrastructure. When
the load riches about 330 clients (connected to a single server), new
clients tend to see COMM_FAILURE exception, or (rarly) TRANSIENT.
Sometimes existing clients start getting exceptions as well, in that
load. We're not running out of file descriptors in the server process.
We played with ORB parameters on the server side but couldn't see any
improvement. We use Linux Red-Hat AS3, FC4 and omniORB 4.0.7

Thanks,

--
Slawomir Lisznianski
Paramay Group, Inc.
"Programs for Research Machinery"

Peter Klotz

2006-07-06 00:23:51 UTC

Permalink

Post by Slawomir Lisznianski
Hello,
I'm curious whether anyone has seen this before. We have a farm of about
300 computers on which we run our CORBA (omniORB) infrastructure. When
the load riches about 330 clients (connected to a single server), new
clients tend to see COMM_FAILURE exception, or (rarly) TRANSIENT.
Sometimes existing clients start getting exceptions as well, in that
load. We're not running out of file descriptors in the server process.
We played with ORB parameters on the server side but couldn't see any
improvement. We use Linux Red-Hat AS3, FC4 and omniORB 4.0.7

We encountered similar problems under Red Hat Enterprise Linux 3 and I
tracked it down to the stack size of the server threads. Each thread
seems to be created with a stack size of 10MB. The total memory one
process is allowed to use is 3GB (at least on i386). This limits the
number of threads to approximately 300.

We disabled threadPerConnectionPolicy (which is 1 by default) and use a
thread pool with much less than 300 threads instead.

Best regards, Peter.

Slawomir Lisznianski

2006-07-06 00:52:32 UTC

Permalink

Ironically we are seeing problems earlier when we use
threadPerConnectionPolicy=0 ;-) When we set threadPerConnectionPolicy=1
we are able to serve slightly more clients (~600) before exceptions
start occuring. The whole time memory footprint of the server is below
1GB with over 6GB free on the box.

Next, we are going to try running server on a 2.6 kernel and see if it
makes any difference.

Post by Peter Klotz
We encountered similar problems under Red Hat Enterprise Linux 3 and I
tracked it down to the stack size of the server threads. Each thread
seems to be created with a stack size of 10MB. The total memory one
process is allowed to use is 3GB (at least on i386). This limits the
number of threads to approximately 300.
We disabled threadPerConnectionPolicy (which is 1 by default) and use a
thread pool with much less than 300 threads instead.

--
Slawomir Lisznianski
Paramay Group, Inc.
"Programs for Research Machinery"

Slawomir Lisznianski

2006-07-06 01:40:35 UTC

Permalink

OK guys, here is some additional info. When we use
threadPerConnectionPolicy=1 we do _NOT_ experience any connectivity
problems unless, of course, we run out of per-process memory on the
server side at which point omniORB cannot spawn any more worker threads
(and that's the error it logs). So, with threadPerConnectionPolicy=1 and
thread stack size set to 1MB, we were able to serve over 1100 clients
concurrently without getting any errors. We used Red Hat AS 3 running on
2.4 SMP kernel, 32-bit architecture.

As soon as we set threadPerConnectionPolicy=0 we were able to serve at
most ~330 clients before we started seeing COMM_FAILURE exceptions on
clients side.

Thanks,

--
Slawomir Lisznianski
Paramay Group, Inc.
"Programs for Research Machinery"

Duncan Grisby

2006-07-06 02:39:50 UTC

Permalink

Post by Slawomir Lisznianski
OK guys, here is some additional info. When we use
threadPerConnectionPolicy=1 we do _NOT_ experience any connectivity
problems unless, of course, we run out of per-process memory on the
server side at which point omniORB cannot spawn any more worker threads
(and that's the error it logs). So, with threadPerConnectionPolicy=1 and
thread stack size set to 1MB, we were able to serve over 1100 clients
concurrently without getting any errors. We used Red Hat AS 3 running on
2.4 SMP kernel, 32-bit architecture.
As soon as we set threadPerConnectionPolicy=0 we were able to serve at
most ~330 clients before we started seeing COMM_FAILURE exceptions on
clients side.

It may be due to having more file descriptors in use than can be put in
an fd_set. When you're in thread pool mode, omniORB has a thread doing
select() on all the open connections. If the file descriptor number gets
larger than FD_SETSIZE, omniORB can't service it in thread pool mode, so
it has to close it as soon as it's opened, which would cause the client
to see a COMM_FAILURE. On Linux, FD_SETSIZE is normally 1024, but if
your application opens lots of files, or the server uses callbacks to
the clients, you could easily hit that with 300 or so clients. If you
run with traceLevel 20, you'll see a message if omniORB has to close a
connection for that reason.

You might try omniORB 4.1 beta, since that uses poll(), which doesn't
have a file descriptor limit.

Cheers,

Duncan.

--
-- Duncan Grisby --
-- ***@grisby.org --
-- http://www.grisby.org --

Slawomir Lisznianski

2006-07-06 03:34:51 UTC

Permalink

Post by Duncan Grisby
It may be due to having more file descriptors in use than can be put in
an fd_set. When you're in thread pool mode, omniORB has a thread doing
select() on all the open connections.

Indeed, all our clients are "servers" (via callbacks) and during heavier
loads we exceed 1024 FDs. Changing omni's trace-level helped confirm
this was going on under the hood.

We will try migrating to 4.1 in the near future. Meanwhile
threadPerConnectionPolicy=1 ought to suffice.

Thanks,
--
Slawomir Lisznianski
Paramay Group, Inc.
"Programs for Research Machinery"