[omniORB] OmniOrb and CP1252 (Windows Latin 1) vs. ISO-8859-1

Discussion:

Steven Sauder

2008-07-29 03:18:28 UTC

Hi all!

We?re a long-time user of OmniOrb with great success in our applications,
but something has recently come up which is causing problems for our
European customers. Our applications all speak the (full) Windows CP1252
(Windows Latin 1) character set, in which Microsoft has used the code point
0x80 to represent the Euro symbol (?). CP1252 and ISO-8859-1 are ?almost?
the same, except that CP1252 utilizes the 0x80 code point to represent the
Euro, where ISO-8859-1 leaves this code point blank.

After a bit of investigation, it seems that OmniOrb by default uses
ISO-8859-1 as the ?native? codeset, which I had thought would mean that the
Euro symbol (and a couple of other ?special? characters such as the
trademark symbol, and the ?curly? printers quotes), which are represented in
CP1252, but not in ISO-8859-1, could not be handled by OmniOrb using its
default codeset. However, digging into cs-8859-1.cc a little more, it looks
like the translation tables ARE passing 0x80 through to UCS as 0x0080, so
unless I?m reading this wrong, any OmniOrb-to-OmniOrb communications (on
Windows) should pass the (Windows-specific) Euro code point 0x80 through
without problem. Am I reading this right?

However, the difficulty arises because we have several CORBA components
which are written using the standard Java ORB, which (it appears) is not
providing the same amount of leeway with this symbol, and insists on
transmitting the Euro symbol in it?s ?true? UCS16 representation (0x20AC),
which OmniOrb?s codeset converters end up turning into a ??? when we receive
it on the Windows end.

Has anyone had any experience with this? From what I?ve read so far, it
seems the only viable solution would be to write our own NCS-C
implementation that handled the CP1252 Euro symbol (0x80) to Unicode
(0x20AC) and back-again conversion through the translation tables as is
currently happening in cs-8859-1.cc, is this correct?

Any help would be hugely appreciated!
Thanks
Steve.
--
Steve Sauder
Chief Technology Officer
North Plains Systems Corp.
510 Front Street West, 4th Floor
Toronto, ON
Canada M5V 3H3
P: (416) 345-1900 ext. 500
F: (416) 599-0808
W: http://www.northplains.com/
E: ***@northplains.com

Confidentiality Notice:
The information contained herein is confidential and proprietary to North
Plains Systems Corp. ("North Plains") and is intended for review by
authorized persons only. Except as may otherwise be agreed to in writing by
North Plains, any disclosure, circulation, release or use of the information
contained herein is strictly prohibited.

Upcoming Webinar:
Marketing Made Easy With Digital Asset Management
August 14th, 2008 ? 1:00PM EST (10:00AM PST)
Click to register:
http://www.northplains.com/news/newsItem.cfm?cms_news_id=191&cms_news_type_i
d=13

TUG 2008 Conference
September 8th & 9th, 2008
Click to register:
http://www.northplains.com/en/customer_portal/conference.cfm

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.omniorb-support.com/pipermail/omniorb-list/attachments/20080728/94f5c2ec/attachment.htm

William Bauder

2008-07-29 05:04:11 UTC

Permalink

I haven't had to deal with this myself, but it did trigger a memory of
something I saw in OrbConstants:

// The CHAR_CODESETS and WCHAR_CODESETS allow the user to override
the default
// connection code sets. The value should be a comma separated list
of OSF
// registry numbers. The first number in the list will be the
native code
// set.
//
// Number can be specified as hex if preceded by 0x, otherwise they
are
// interpreted as decimal.
//
// Code sets that we accept currently (see core/OSFCodeSetRegistry):
//
// char/string:
//
// ISO8859-1 (Latin-1) 0x00010001
// ISO646 (ASCII) 0x00010020
// UTF-8 0x05010001
//
// wchar/string:
//
// UTF-16 0x00010109
// UCS-2 0x00010100
// UTF-8 0x05010001
//
// Note: The ORB will let you assign any of the above values to
// either of the following properties, but the above assignments
// are the only ones that won't get you into trouble.
public static final String CHAR_CODESETS = SUN_PREFIX +
"codeset.charsets";
public static final String WCHAR_CODESETS = SUN_PREFIX +
"codeset.wcharsets";

Assuming that you're using strings, and the problem isn't in their
ISO-8859 encoding, you might be able to fix on the java side by changing
the default codeset.

-Bill

-----Original Message-----
From: omniorb-list-***@omniorb-support.com
[mailto:omniorb-list-***@omniorb-support.com] On Behalf Of Steven
Sauder
Sent: Monday, July 28, 2008 5:18 PM
To: omniorb-***@omniorb-support.com
Subject: [omniORB] OmniOrb and CP1252 (Windows Latin 1) vs. ISO-8859-1

Hi all!

We?re a long-time user of OmniOrb with great success in our
applications, but something has recently come up which is causing
problems for our European customers. Our applications all speak the
(full) Windows CP1252 (Windows Latin 1) character set, in which
Microsoft has used the code point 0x80 to represent the Euro symbol (?).
CP1252 and ISO-8859-1 are ?almost? the same, except that CP1252 utilizes
the 0x80 code point to represent the Euro, where ISO-8859-1 leaves this
code point blank.

After a bit of investigation, it seems that OmniOrb by default uses
ISO-8859-1 as the ?native? codeset, which I had thought would mean that
the Euro symbol (and a couple of other ?special? characters such as the
trademark symbol, and the ?curly? printers quotes), which are
represented in CP1252, but not in ISO-8859-1, could not be handled by
OmniOrb using its default codeset. However, digging into cs-8859-1.cc a
little more, it looks like the translation tables ARE passing 0x80
through to UCS as 0x0080, so unless I?m reading this wrong, any
OmniOrb-to-OmniOrb communications (on Windows) should pass the
(Windows-specific) Euro code point 0x80 through without problem. Am I
reading this right?

However, the difficulty arises because we have several CORBA components
which are written using the standard Java ORB, which (it appears) is not
providing the same amount of leeway with this symbol, and insists on
transmitting the Euro symbol in it?s ?true? UCS16 representation
(0x20AC), which OmniOrb?s codeset converters end up turning into a ???
when we receive it on the Windows end.

Has anyone had any experience with this? From what I?ve read so far, it
seems the only viable solution would be to write our own NCS-C
implementation that handled the CP1252 Euro symbol (0x80) to Unicode
(0x20AC) and back-again conversion through the translation tables as is
currently happening in cs-8859-1.cc, is this correct?

Any help would be hugely appreciated!
Thanks
Steve.
--
Steve Sauder
Chief Technology Officer
North Plains Systems Corp.
510 Front Street West, 4th Floor
Toronto, ON
Canada M5V 3H3
P: (416) 345-1900 ext. 500
F: (416) 599-0808
W: http://www.northplains.com/
E: ***@northplains.com

Confidentiality Notice:
The information contained herein is confidential and proprietary to
North Plains Systems Corp. ("North Plains") and is intended for review
by authorized persons only. Except as may otherwise be agreed to in
writing by North Plains, any disclosure, circulation, release or use of
the information contained herein is strictly prohibited.

Upcoming Webinar:
Marketing Made Easy With Digital Asset Management
August 14th, 2008 ? 1:00PM EST (10:00AM PST)
Click to register:
http://www.northplains.com/news/newsItem.cfm?cms_news_id=191
<http://www.northplains.com/news/newsItem.cfm?cms_news_id=191&cms_news_t
ype_id=13> &cms_news_type_id=13

TUG 2008 Conference
September 8th & 9th, 2008
Click to register:
http://www.northplains.com/en/customer_portal/conference.cfm

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.omniorb-support.com/pipermail/omniorb-list/attachments/20080728/62d945f8/attachment.htm

Ridgway, Richard (London)

2008-07-29 12:40:33 UTC

Permalink

Jacorb it is controlled with -Dfile.encoding=ISO_8859_1 on the command line.

I had to start using that to get Jacorb to interop with Orbix. Never had any problems with omniorb or tao, but maybe didn't see the same situation.

Richard

-----Original Message-----
From: omniorb-list-***@omniorb-support.com [mailto:omniorb-list-***@omniorb-support.com] On Behalf Of William Bauder
Sent: 29 July 2008 00:05
To: 'Steven Sauder'; omniorb-***@omniorb-support.com
Subject: RE: [omniORB] OmniOrb and CP1252 (Windows Latin 1) vs. ISO-8859-1

I haven't had to deal with this myself, but it did trigger a memory of something I saw in OrbConstants:

// The CHAR_CODESETS and WCHAR_CODESETS allow the user to override the default
// connection code sets. The value should be a comma separated list of OSF
// registry numbers. The first number in the list will be the native code
// set.
//
// Number can be specified as hex if preceded by 0x, otherwise they are
// interpreted as decimal.
//
// Code sets that we accept currently (see core/OSFCodeSetRegistry):
//
// char/string:
//
// ISO8859-1 (Latin-1) 0x00010001
// ISO646 (ASCII) 0x00010020
// UTF-8 0x05010001
//
// wchar/string:
//
// UTF-16 0x00010109
// UCS-2 0x00010100
// UTF-8 0x05010001
//
// Note: The ORB will let you assign any of the above values to
// either of the following properties, but the above assignments
// are the only ones that won't get you into trouble.
public static final String CHAR_CODESETS = SUN_PREFIX + "codeset.charsets";
public static final String WCHAR_CODESETS = SUN_PREFIX + "codeset.wcharsets";

Assuming that you're using strings, and the problem isn't in their ISO-8859 encoding, you might be able to fix on the java side by changing the default codeset.

-Bill

-----Original Message-----
From: omniorb-list-***@omniorb-support.com [mailto:omniorb-list-***@omniorb-support.com] On Behalf Of Steven Sauder
Sent: Monday, July 28, 2008 5:18 PM
To: omniorb-***@omniorb-support.com
Subject: [omniORB] OmniOrb and CP1252 (Windows Latin 1) vs. ISO-8859-1

Hi all!

We?re a long-time user of OmniOrb with great success in our applications, but something has recently come up which is causing problems for our European customers. Our applications all speak the (full) Windows CP1252 (Windows Latin 1) character set, in which Microsoft has used the code point 0x80 to represent the Euro symbol (?). CP1252 and ISO-8859-1 are ?almost? the same, except that CP1252 utilizes the 0x80 code point to represent the Euro, where ISO-8859-1 leaves this code point blank.

After a bit of investigation, it seems that OmniOrb by default uses ISO-8859-1 as the ?native? codeset, which I had thought would mean that the Euro symbol (and a couple of other ?special? characters such as the trademark symbol, and the ?curly? printers quotes), which are represented in CP1252, but not in ISO-8859-1, could not be handled by OmniOrb using its default codeset. However, digging into cs-8859-1.cc a little more, it looks like the translation tables ARE passing 0x80 through to UCS as 0x0080, so unless I?m reading this wrong, any OmniOrb-to-OmniOrb communications (on Windows) should pass the (Windows-specific) Euro code point 0x80 through without problem. Am I reading this right?

However, the difficulty arises because we have several CORBA components which are written using the standard Java ORB, which (it appears) is not providing the same amount of leeway with this symbol, and insists on transmitting the Euro symbol in it?s ?true? UCS16 representation (0x20AC), which OmniOrb?s codeset converters end up turning into a ??? when we receive it on the Windows end.

Has anyone had any experience with this? From what I?ve read so far, it seems the only viable solution would be to write our own NCS-C implementation that handled the CP1252 Euro symbol (0x80) to Unicode (0x20AC) and back-again conversion through the translation tables as is currently happening in cs-8859-1.cc, is this correct?

Any help would be hugely appreciated!
Thanks
Steve.
--
Steve Sauder
Chief Technology Officer
North Plains Systems Corp.
510 Front Street West, 4th Floor
Toronto, ON
Canada M5V 3H3
P: (416) 345-1900 ext. 500
F: (416) 599-0808
W: http://www.northplains.com/
E: ***@northplains.com

Confidentiality Notice:
The information contained herein is confidential and proprietary to North Plains Systems Corp. ("North Plains") and is intended for review by authorized persons only. Except as may otherwise be agreed to in writing by North Plains, any disclosure, circulation, release or use of the information contained herein is strictly prohibited.

Upcoming Webinar:
Marketing Made Easy With Digital Asset Management
August 14th, 2008 ? 1:00PM EST (10:00AM PST)
Click to register: http://www.northplains.com/news/newsItem.cfm?cms_news_id=191 <http://www.northplains.com/news/newsItem.cfm?cms_news_id=191&cms_news_type_id=13> &cms_news_type_id=13

TUG 2008 Conference
September 8th & 9th, 2008
Click to register: http://www.northplains.com/en/customer_portal/conference.cfm
--------------------------------------------------------

This message w/attachments (message) may be privileged, confidential or proprietary, and if you are not an intended recipient, please notify the sender, do not use or share it and delete it. Unless specifically indicated, this message is not an offer to sell or a solicitation of any investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Merrill Lynch. Subject to applicable law, Merrill Lynch may monitor, review and retain e-communications (EC) traveling through its networks/systems. The laws of the country of each sender/recipient may impact the handling of EC, and EC may be archived, supervised and produced in countries other than the country in which you are located. This message cannot be guaranteed to be secure or error-free. This message is subject to terms available at the following link: http://www.ml.com/e-communications_terms/. By messaging with Merrill Lynch you consent to the foregoing.
--------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.omniorb-support.com/pipermail/omniorb-list/attachments/20080729/b6a49874/attachment-0001.htm

Duncan Grisby

2008-07-29 16:49:44 UTC

Permalink

On Monday 28 July, Steven Sauder wrote:

[...]

Post by Steven Sauder
After a bit of investigation, it seems that OmniOrb by default uses
ISO-8859-1 as the ?native? codeset

Yes. That is required by the CORBA spec.

Post by Steven Sauder
, which I had thought would mean that the Euro symbol
(and a couple of other ?special? characters such as the trademark symbol, and
the ?curly? printers quotes), which are represented in CP1252, but not in
ISO-8859-1, could not be handled by OmniOrb using its default codeset.
However, digging into cs-8859-1.cc a little more, it looks like the
translation tables ARE passing 0x80 through to UCS as 0x0080, so unless I?m
reading this wrong, any OmniOrb-to-OmniOrb communications (on Windows) should
pass the (Windows-specific) Euro code point 0x80 through without problem. Am
I reading this right?

Yes, but 0x0080 in Unicode is not the Euro symbol. ISO 8859-1 0x80 maps
to Unicode 0x0080, which maps back to ISO 8859-1 0x80, so if you're
pretending to use ISO 8859-1 while actually using CP1252 at both ends it
will appear to work. It's only when someone tries to interpret the
Unicode as some other code set that you notice the error.

The same is true if you are using any other string codeset while
claiming to use ISO 8859-1 -- it's just things will be more obviously
wrong when conversions to other code sets occur.

Post by Steven Sauder
However, the difficulty arises because we have several CORBA
components which are written using the standard Java ORB, which (it
appears) is not providing the same amount of leeway with this symbol,
and insists on transmitting the Euro symbol in it?s ?true? UCS16
representation (0x20AC), which OmniOrb?s codeset converters end up
turning into a ??? when we receive it on the Windows end.

Actually, I'd expect you to get a CORBA::DATA_CONVERSION exception since
0x20AC can't be mapped to ISO 8859-1. The Java ORB must be substituting
the character rather than throwing the exception the CORBA spec says it
should.

Post by Steven Sauder
Has anyone had any experience with this? From what I?ve read so far,
it seems the only viable solution would be to write our own NCS-C
implementation that handled the CP1252 Euro symbol (0x80) to Unicode
(0x20AC) and back-again conversion through the translation tables as
is currently happening in cs-8859-1.cc, is this correct?

Yes, that's the right thing to do. There are quite a few other 8 bit
code sets that it would be sensible to add too, including ISO 8859-15
which is equivalent to ISO 8859-1 but includes the Euro symbol at code
point 0xA4.

Another alternative would be to use UTF-8 and manually convert your
strings to that before passing them into the CORBA layer.

If you want to make a CP1252 codeset for omniORB, you can automatically
generate the tables using bin/scripts/make8bitcs.py giving it input from
here:

http://www.unicode.org/Public/MAPPINGS/

The DCE codeset ids come from here:

ftp://ftp.opengroup.org/pub/code_set_registry/code_set_registry1.2g.txt

Any volunteers to make a patch containing all the tables for the
additional ISO 8859 and Windows codesets in it?

Cheers,

Duncan.

--
-- Duncan Grisby --
-- ***@grisby.org --
-- http://www.grisby.org --