Discussion:
[omniORB] Strict UTF-8 validation patch
Peter Klotz
2013-12-03 09:24:18 UTC
Permalink
Hello Duncan

During our tests we encountered several cases where omniORBs "validateUTF8" switch does not complain although it should.

The attached patch fixes the following cases:

* A codepoint beyond U+10FFFF is encountered (e.g. "\xf4\x90\x80\x80" which would be U+110000)
* Encoding of code points reserved for UTF-16 surrogate pairs (e.g. "\xed\xa0\x81" which would be U+D801)
* The forbidden surrogate range is U+D800..U+DFFF
* Overlong encodings. The Unicode Standard allows only the shortest representation (e.g. four byte encoding "\xf0\x82\x82\xac" which equals the Euro sign U+20AC but its correct encoding is the three byte encoding "\xe2\x82\xac")
* Encoding of 5-byte characters. To be compatible with UTF-16, only up to 4 bytes are allowed per UTF-8 character. In theory UTF-8 is extensible up to 6 characters. omniORB always throwed exceptions on 5 byte encodings in several methods in cs-UTF-8.cc but interestingly not in method TCS_C_UTF_8::validateString().

I also performed minor fixes for symmetry in lookup tables "utf8Count" and "utf8Mask" although omniORB always throws in these cases (regardless if "validateUTF8" is set or not).

The implementation is now in sync with the one in the Boost.Locale library (see https://svn.boost.org/svn/boost/trunk/boost/locale/utf.hpp).

Wikipedia (see https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences) says, that a UTF-8 decoder (especially when validating) should be prepared for all cases above.

The performance impact should be minimal, for ASCII characters it is zero.

Regards, Peter.


Peter Klotz
Software Engineer

Tel: +43 512 89059-424
E-Mail: peter.klotz at ith-icoserve.com
_____________________________________
ITH icoserve technology for healthcare GmbH
a siemens company - H CX HS INT CES ITH
Innrain 98, 6020 Innsbruck, ?sterreich - www.ith-icoserve.com
Rechtsform: Gesellschaft mit beschr?nkter Haftung - Firmensitz: 6020 Innsbruck, Innrain 98
Firmenbuchnummer: FN 174117f - Firmenbuchgericht: Innsbruck - DVR: 0983039


-------------- next part --------------
A non-text attachment was scrubbed...
Name: omniORB-4.1.7-UTF8Validation.patch
Type: application/octet-stream
Size: 2748 bytes
Desc: omniORB-4.1.7-UTF8Validation.patch
URL: <http://www.omniorb-support.com/pipermail/omniorb-list/attachments/20131203/4fdb0e22/attachment.obj>
Duncan Grisby
2014-01-06 17:01:30 UTC
Permalink
Post by Peter Klotz
During our tests we encountered several cases where omniORBs
"validateUTF8" switch does not complain although it should.
Thanks, and sorry for taking ages to reply. I've reformatted it a bit to
fit with the existing coding style and checked it in to trunk in svn.

Duncan.
--
-- Duncan Grisby --
-- duncan at grisby.org --
-- http://www.grisby.org --
Peter Klotz
2014-01-16 08:40:06 UTC
Permalink
Hello Duncan
Post by Duncan Grisby
Post by Peter Klotz
During our tests we encountered several cases where omniORBs
"validateUTF8" switch does not complain although it should.
Thanks, and sorry for taking ages to reply. I've reformatted it a bit to
fit with the existing coding style and checked it in to trunk in svn.
Thanks for applying the patch.

SVN trunk means it will be part of omniORB 4.2.x, right?

Do you have an any plans for a 4.2 release?

Regards, Peter.

Loading...