Discussion:
[omniORB] How to the Naming Service to support many simultaneous connections.
souchaud
2007-04-25 18:20:47 UTC
Permalink
Hello,

When I launched my programm on a cluster and if I use more than ~200
nodes, my application cannot start and the following error appear :

... Failed to resolve NameService ...

Here is the portion of the code used to resolve the naming service :

// Obtain a reference of the name service:
COLCOWS_DEBUG(dblTest, "Obtain a reference of the Naming Service");
try {
obj_ref = orbp->resolve_initial_references("NameService");
new_node_servant->_naming_ctxt =
CosNaming::NamingContext::_narrow(obj_ref);
}
catch(...) {
COLCOWS_ERROR("Failed to resolve NameService");
}


I don't think the problem is in my source code because below 200 nodes
it works fine. My program works like this : the program is launched on
every nodes with MPI, and at the beginning of the program, each nodes
try to resolve an object on the Naming service.

I would like to know if there is way to configure the Naming Service (or
the client) so that the Naming Service can handle more than 200
connections at the same time.

Thanks in advance for your help,
Mathieu Souchaud
Duncan Grisby
2007-04-29 21:18:54 UTC
Permalink
Post by souchaud
When I launched my programm on a cluster and if I use more than ~200
... Failed to resolve NameService ...
COLCOWS_DEBUG(dblTest, "Obtain a reference of the Naming Service");
try {
obj_ref = orbp->resolve_initial_references("NameService");
new_node_servant->_naming_ctxt =
CosNaming::NamingContext::_narrow(obj_ref);
}
catch(...) {
COLCOWS_ERROR("Failed to resolve NameService");
}
Using catch-all clauses is generally a bad idea since it can mask the
reasons behind problems.

In this case, you are probably getting a COMM_FAILURE exception because
omniNames is unable to service new connections. You can try varying the
omniORB parameters about whether to use thread per connection or thread
pool mode. Depending on your platform, you ought to be able to configure
omniNames to support at least 1000 concurrent connections. If you run
omniNames with -ORBtraceLevel 25 -ORBtraceThreadId 1, you will probably
get an idea of why the connections are failing.

The other thing you can do is to reduce the time omniNames will hold
open idle connections by setting the inConScanPeriod to a small number
of seconds, and setting scanGranularity to 1 second rather than the
default 5. That will mean that it closes idle connections sooner, and be
able to reuse file descriptors and threads.

Cheers,

Duncan.
--
-- Duncan Grisby --
-- ***@grisby.org --
-- http://www.grisby.org --
souchaud
2007-05-10 14:18:51 UTC
Permalink
Post by Duncan Grisby
Post by souchaud
When I launched my programm on a cluster and if I use more than ~200
... Failed to resolve NameService ...
COLCOWS_DEBUG(dblTest, "Obtain a reference of the Naming Service");
try {
obj_ref = orbp->resolve_initial_references("NameService");
new_node_servant->_naming_ctxt =
CosNaming::NamingContext::_narrow(obj_ref);
}
catch(...) {
COLCOWS_ERROR("Failed to resolve NameService");
}
Using catch-all clauses is generally a bad idea since it can mask the
reasons behind problems.
In this case, you are probably getting a COMM_FAILURE exception because
omniNames is unable to service new connections. You can try varying the
omniORB parameters about whether to use thread per connection or thread
pool mode. Depending on your platform, you ought to be able to configure
omniNames to support at least 1000 concurrent connections. If you run
omniNames with -ORBtraceLevel 25 -ORBtraceThreadId 1, you will probably
get an idea of why the connections are failing.
The other thing you can do is to reduce the time omniNames will hold
open idle connections by setting the inConScanPeriod to a small number
of seconds, and setting scanGranularity to 1 second rather than the
default 5. That will mean that it closes idle connections sooner, and be
able to reuse file descriptors and threads.
Cheers,
Duncan.
Ok, it works well with 256 nodes now. The problem was that I did not
catch the transient exception. Now, if a transient or a comm failure
exception raise, I sleep 1 seconds and I try again. With 128 nodes there
is no retry, with 256 nodes some transient exceptions raise...
I did not have to tune the omniNames configuration.

Thanks for your help,
Mathieu

Continue reading on narkive:
Loading...