[omniORB] OpenVMS problem

Discussion:

Bruce Visscher

2009-03-28 02:15:14 UTC

Recently I have discovered a problem on the OpenVMS platform that only
occurs in omniORB 4(12). It did not occur in omniORB 2 or 3.

Occasionally, a server will display the following:

omniORB: Error return from select(). errno = 65535
omniORB: Unrecoverable error for this endpoint: giop:tcp:xx.xx.xx.xx:xxxxx,
it will no longer be serviced.

In order to get more information, I have modified the code in
SocketCollection.cc to utilize ::perror on the OpenVMS platform so now I can
see:

Error return from select(): non-translatable vms error code: 0x13C
%system-f-ivchan, invalid i/o channel
omniORB: Unrecoverable error for this endpoint: giop:tcp:xx.xx.xx.xx:xxxxx,
it will no longer be serviced.

This looks like a bug in TCPIP for OpenVMS to me but I can't really provie
it so I am trying to find a work around for it. I have discovered that If I
configure omniORB to use poll rather than select it have a longer mean time
between failures. It also helps to set -ORBconnectionWatchPeriod to (e.g.)
500000 (.5 sec rather than 50ms).

I am wondering what would happen if I turn off the timeout in select
altogether. (Iirc, you do that by passing a null in place of the timeout*)
Would this just make it more difficult to have a graceful shutdown? (The
way around that would be to issue a shutdown on the listening socket from a
different thread it seems to me.) Is there a way to do this already?

Has anyone ever had any problems with this on other platforms?

Thanks for any help,

Bruce Visscher
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.omniorb-support.com/pipermail/omniorb-list/attachments/20090327/771b5c85/attachment.htm

Duncan Grisby

2009-04-20 15:07:01 UTC

Permalink

Recently I have discovered a problem on the OpenVMS platform that only occurs
in omniORB 4(12).? It did not occur in omniORB 2 or 3.
omniORB: Error return from select(). errno = 65535
omniORB: Unrecoverable error for this endpoint: giop:tcp:xx.xx.xx.xx:xxxxx, it
will no longer be serviced.

[...]

I am wondering what would happen if I turn off the timeout in select
altogether.? (Iirc, you do that by passing a null in place of the timeout*)?
Would this just make it more difficult to have a graceful shutdown?? (The way
around that would be to issue a shutdown on the listening socket from a
different thread it seems to me.)? Is there a way to do this already?

Sorry for the delay in replying. It's not a good idea to remove the
timeout in select(), since that way you will sometimes fail to watch
connections when you should do. The timeout is used to cause the code to
rescan the set of file descriptors to watch.

Does it work if you modify the code to just carry on if you get that
error? i.e. duplicate the way it handles EBADF.

Cheers,

Duncan.

--
-- Duncan Grisby --
-- ***@grisby.org --
-- http://www.grisby.org --

Bruce Visscher

2009-04-20 16:05:44 UTC

Permalink

Duncan,

Thanks for you reply.

Post by Duncan Grisby

[...]

I am wondering what would happen if I turn off the timeout in select
altogether.? (Iirc, you do that by passing a null in place of the timeout*)
Would this just make it more difficult to have a graceful shutdown?? (The way
around that would be to issue a shutdown on the listening socket from a
different thread it seems to me.)? Is there a way to do this already?

Sorry for the delay in replying. It's not a good idea to remove the
timeout in select(), since that way you will sometimes fail to watch
connections when you should do. The timeout is used to cause the code to
rescan the set of file descriptors to watch.
Does it work if you modify the code to just carry on if you get that
error? ?i.e. duplicate the way it handles EBADF.

I had thought about doing that but wasn't sure it would lead to a
desirable outcome. I have developed a reproducer that works fairly
well so I will give this a try.

One thing I have done is to modify the code that reports
"Unrecoverable error for this endpoint: giop:tcp:xx.xx.xx.xx:xxxxx, it
will no longer be serviced." to actually shut down the ORB at that
point. For my purposes, it would be better to shut down a server that
no longer listens. It's a little bit of a sledge hammer, but it
works. I could submit a patch if you think it would benefit others.

Of course, I am pretty sure this is an OS level bug and I am trying to
get a resolution on that.

Post by Duncan Grisby
Cheers,
Duncan.
--
?-- Duncan Grisby ? ? ? ? --
? -- http://www.grisby.org --

Bruce Visscher

2009-04-22 00:28:31 UTC

Permalink

Post by Duncan Grisby

Does it work if you modify the code to just carry on if you get that
error? ?i.e. duplicate the way it handles EBADF.

I tried this and it seems to work just fine. I am not quite up to
speed on the omniORB internals as I used to be. Is there a situation
that could validly occur where a socket is closed in a different
thread but not yet removed from the list? If that is the case then
treating this as EBADF would seem to be the right thing to do.

Thanks,

Bruce