Peter Klotz
2013-12-03 09:24:18 UTC
Hello Duncan
During our tests we encountered several cases where omniORBs "validateUTF8" switch does not complain although it should.
The attached patch fixes the following cases:
* A codepoint beyond U+10FFFF is encountered (e.g. "\xf4\x90\x80\x80" which would be U+110000)
* Encoding of code points reserved for UTF-16 surrogate pairs (e.g. "\xed\xa0\x81" which would be U+D801)
* The forbidden surrogate range is U+D800..U+DFFF
* Overlong encodings. The Unicode Standard allows only the shortest representation (e.g. four byte encoding "\xf0\x82\x82\xac" which equals the Euro sign U+20AC but its correct encoding is the three byte encoding "\xe2\x82\xac")
* Encoding of 5-byte characters. To be compatible with UTF-16, only up to 4 bytes are allowed per UTF-8 character. In theory UTF-8 is extensible up to 6 characters. omniORB always throwed exceptions on 5 byte encodings in several methods in cs-UTF-8.cc but interestingly not in method TCS_C_UTF_8::validateString().
I also performed minor fixes for symmetry in lookup tables "utf8Count" and "utf8Mask" although omniORB always throws in these cases (regardless if "validateUTF8" is set or not).
The implementation is now in sync with the one in the Boost.Locale library (see https://svn.boost.org/svn/boost/trunk/boost/locale/utf.hpp).
Wikipedia (see https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences) says, that a UTF-8 decoder (especially when validating) should be prepared for all cases above.
The performance impact should be minimal, for ASCII characters it is zero.
Regards, Peter.
Peter Klotz
Software Engineer
Tel: +43 512 89059-424
E-Mail: peter.klotz at ith-icoserve.com
_____________________________________
ITH icoserve technology for healthcare GmbH
a siemens company - H CX HS INT CES ITH
Innrain 98, 6020 Innsbruck, ?sterreich - www.ith-icoserve.com
Rechtsform: Gesellschaft mit beschr?nkter Haftung - Firmensitz: 6020 Innsbruck, Innrain 98
Firmenbuchnummer: FN 174117f - Firmenbuchgericht: Innsbruck - DVR: 0983039
-------------- next part --------------
A non-text attachment was scrubbed...
Name: omniORB-4.1.7-UTF8Validation.patch
Type: application/octet-stream
Size: 2748 bytes
Desc: omniORB-4.1.7-UTF8Validation.patch
URL: <http://www.omniorb-support.com/pipermail/omniorb-list/attachments/20131203/4fdb0e22/attachment.obj>
During our tests we encountered several cases where omniORBs "validateUTF8" switch does not complain although it should.
The attached patch fixes the following cases:
* A codepoint beyond U+10FFFF is encountered (e.g. "\xf4\x90\x80\x80" which would be U+110000)
* Encoding of code points reserved for UTF-16 surrogate pairs (e.g. "\xed\xa0\x81" which would be U+D801)
* The forbidden surrogate range is U+D800..U+DFFF
* Overlong encodings. The Unicode Standard allows only the shortest representation (e.g. four byte encoding "\xf0\x82\x82\xac" which equals the Euro sign U+20AC but its correct encoding is the three byte encoding "\xe2\x82\xac")
* Encoding of 5-byte characters. To be compatible with UTF-16, only up to 4 bytes are allowed per UTF-8 character. In theory UTF-8 is extensible up to 6 characters. omniORB always throwed exceptions on 5 byte encodings in several methods in cs-UTF-8.cc but interestingly not in method TCS_C_UTF_8::validateString().
I also performed minor fixes for symmetry in lookup tables "utf8Count" and "utf8Mask" although omniORB always throws in these cases (regardless if "validateUTF8" is set or not).
The implementation is now in sync with the one in the Boost.Locale library (see https://svn.boost.org/svn/boost/trunk/boost/locale/utf.hpp).
Wikipedia (see https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences) says, that a UTF-8 decoder (especially when validating) should be prepared for all cases above.
The performance impact should be minimal, for ASCII characters it is zero.
Regards, Peter.
Peter Klotz
Software Engineer
Tel: +43 512 89059-424
E-Mail: peter.klotz at ith-icoserve.com
_____________________________________
ITH icoserve technology for healthcare GmbH
a siemens company - H CX HS INT CES ITH
Innrain 98, 6020 Innsbruck, ?sterreich - www.ith-icoserve.com
Rechtsform: Gesellschaft mit beschr?nkter Haftung - Firmensitz: 6020 Innsbruck, Innrain 98
Firmenbuchnummer: FN 174117f - Firmenbuchgericht: Innsbruck - DVR: 0983039
-------------- next part --------------
A non-text attachment was scrubbed...
Name: omniORB-4.1.7-UTF8Validation.patch
Type: application/octet-stream
Size: 2748 bytes
Desc: omniORB-4.1.7-UTF8Validation.patch
URL: <http://www.omniorb-support.com/pipermail/omniorb-list/attachments/20131203/4fdb0e22/attachment.obj>