Discussion:
[omniORB] linux, windows communication problem + new problems
Karl Schulze
2006-10-10 14:50:52 UTC
Permalink
Hello everybody

About 4 month ago I posted a strange problem while using omniORB naming
service
and servers under Linux and clients under windows. After sending some
trace-output
nothing new about that topic was posted here.

I have now redone all my tests with omniORB 4.1.0-rc1 to test if this
topic was maybe
fixed within this new release. But the problem still remains the same.

The setup I use is the following: omniNames and the eg3_impl are running
on a Linux
box with some parameters set, that are
scanGranularity = 1
inConScanPeriod = 1
threadPerConnectionPolicy = 0
maxServerThreadPoolSize = 5

For the client I use a slightly modified version of eg3_clt. Where the
main reads like
int
main (int argc, char **argv)
{
try {
CORBA::ORB_var orb = CORBA::ORB_init(argc, argv);
for (CORBA::ULong count=0; count < 1000; count++)
{
cerr << count << "/1000" << endl;
CORBA::Object_var obj = getObjectReference(orb);
Echo_var echoref = Echo::_narrow(obj);
hello(echoref);
Sleep(2000); // sleep(2); for Linux
}

orb->destroy();
} ...
with the corresponding include for the sleep (#include <Windows.h> for
win or
#include <unistd.h> for Linux)

Usually the client get a CORBA::COMM_FAILURE for count is around 200.

While performing this tests I have tried some different setups too.
Namely I have used the above egl_clt in parallel from 2 different Linux
boxes
(different from the one where the server and omniNames is running).
The server and omniNames are using the same parameters stated above.
The strange thing is, after a while (count is around 50-200) all 4 involved
communication partners seems to be stalled. Any new connection attempt
also stall directly. When using just one client I was not able to
produce this
problem.

If it is useful I can provide the trace-output for all communication
partners,
but they are rather long :)

Is this behavior a known behavior with the parameters I use?
Am I doing something wrong (out of specifications)?

Thanks
Karl
Klaus Dieter Welast
2006-10-11 05:11:45 UTC
Permalink
Hello Karl,

the problem, that you describes, looks like a problem that we had in past with omniNames.

In our case the limitation of max file handle (256 per default) caused the stall of omniNames.
omniNames close the socket related file handle after a timeout (I can?t remember who long the timeout was). If your interval for new connection is shorter then the timeout period, the number of open files handles increase to the max number of file handle for omniNames and onmiNames stalled.

In our production environment we reduce the interval of new connection to solve the problem.

I hop it helps!


Best regards

Mit freundlichen Gr??en
Kl. D. Welast

Hellersberstr. 35A
41460 Neuss
Tel: +49 2131 166657
Mobil: +49 171 5638203
Email: ***@t-online.de
Karl Schulze
2006-10-11 14:29:28 UTC
Permalink
Post by Klaus Dieter Welast
In our case the limitation of max file handle (256 per default) caused the stall of omniNames
I have checked this. Neither netstat nor /proc shows any indication that
the the number of file handles
exceed some maximum (they are far below 20).
Also the test scenario I use (2 clients, 1 server, 1 omniNames) where
the orb within the client
is not destroyed over the complete test does not allow the extensive use
of file handles.

But I have also seen the lack of file handles in a completely different
scenario where we also
solved this by tuning the paramters for omniNames.

Thanks
Karl
Karl Schulze
2006-10-11 14:18:09 UTC
Permalink
Is it possible that the servant becomes deadlocked due to resource
sharing? I'm just speculating but I've been reading up on the
threading in omniORB and (not having looked at the example servants)
this sounds like a likely problem.
Maybe you can try setting mutexes in the servant code ?
You are right this may be a problem in the servant too, but the
communication stalls while both clients contact omniNames.
And any new connection attempt to omniNames fails too. So I am guessing
that the problem is somehow related to
omniNames and the parameter I use and not the servant code itself.

Karl
Duncan Grisby
2006-10-12 23:28:38 UTC
Permalink
Post by Karl Schulze
The setup I use is the following: omniNames and the eg3_impl are
running on a Linux
box with some parameters set, that are
scanGranularity = 1
inConScanPeriod = 1
I haven't been able to reproduce the error you see. However, the
configuration you've set means that once per second omniORB scans for
idle connections, and any that are idle at that moment in time are
closed. There is therefore a potential race condition where a client
opens a new connection, and before it manages to send a request the
server decides the connection is idle and closes it. That ought to
result in the client retrying, and indeed it does when I tested it, but
perhaps that's the cause of the problem you see. Do you see the problem
is you set inConScanPeriod to 2?

To confirm that that is really what's going on, you could try getting a
trace with -ORBtraceLevel 25 -ORBtraceInvocations 1 -ORBtraceThreadId 1
-ORBtraceTime 1 on both omniNames and the client. That way we can
correlate the connection handling on both sides.

Cheers,

Duncan.
--
-- Duncan Grisby --
-- ***@grisby.org --
-- http://www.grisby.org --
Karl Schulze
2006-10-20 15:17:30 UTC
Permalink
Post by Duncan Grisby
To confirm that that is really what's going on, you could try getting a
trace with -ORBtraceLevel 25 -ORBtraceInvocations 1 -ORBtraceThreadId 1
-ORBtraceTime 1 on both omniNames and the client. That way we can
correlate the connection handling on both sides.
Hello Duncan

I have now done a lot of new tests and I have now several logs for the
problem I have described earlier.
To keep the traffic small for the mailing list I have put them on
http://www.synedra.com/downloads/omniORB/logs.zip

The zip-File contains 3 folder, the results within there are for the
following tests

The folder called 'LinLinux' contains the logs for the communication
problem that arise when I connect
with two Linux clients to a Linux omniNames and a Linux server. At the
end of the log I have waited about
an hour for anything new on the connections but everything seems to be
stalled. This problem only
occurs with inConScanPeriod=1 and a sleep time of 2 seconds in the two
clients. With inConScanPeriod=2
it does no longer appear

The folder called 'WinLinux' contains the logs for one windows client
communicating with a Linux omniNames
and a Linux server. In this situation I was able to reproduce the
'crash' in the windows client with
inConScanPeriod=1 and inConScanPeriod=2 but not for inConScanPeriod=3.
(The client uses a sleep of
2000 msec).

The third folder called 'WinLinux2' contains the logs for one windows
client communicating with a Linux omniNames
and a Linux server, but with a longer sleep time for the client (10sec).
I have done this test only for
inConScanPeriod=1. Also in this situation the 'crash' appears which is
strange I think, because of the long client
sleep time I would not expect any 'race condition' for this situation.

Hopefully the logs can help the find the reason for the strange behavior
Thanks for the help
Karl

Continue reading on narkive:
Loading...