Welcome to the Second Life Forums Archive

These forums are CLOSED. Please visit the new forums HERE

Changes in UTF-8 support with Mono

Strife Onizuka
Moonchild
Join date: 3 Mar 2004
Posts: 5,887
03-31-2008 22:49
Just a heads up as to what is happening (I'm not happy about it btw).

With Mono, support for Unicode is changing. Several years ago UTF-8 was amended from supporting the full Unicode range of 2 billion possible characters to only the first 1114112 character codes. This was done because that is all UTF-16 supports (which leads me to wonder why they designed UTF-16 that way in the first place). Internally Mono uses UTF-16.

What this means:
Support for 5 and 6 byte UTF-8 characters has been dropped along with a sizable portion of upper part of the 4 byte range.

Currently when Mono encounters a string with characters outside of the new UTF-8 range it nukes the string in a very unfriendly way.

Old UTF-8
1 ~ 0x00000000 -> 0x0000007F
2 ~ 0x00000080 -> 0x000007FF
3 ~ 0x00000800 -> 0x0000FFFF
4 ~ 0x00010000 -> 0x001FFFFF
5 ~ 0x00200000 -> 0x03FFFFFF
6 ~ 0x04000000 -> 0x7FFFFFFF

New UTF-8
1 ~ 0x00000000 -> 0x0000007F
2 ~ 0x00000080 -> 0x000007FF
3 ~ 0x00000800 -> 0x0000FFFF
4 ~ 0x00010000 -> 0x0010FFFF

http://jira.secondlife.com/browse/SVC-1960
http://jira.secondlife.com/browse/SVC-1414
_____________________
Truth is a river that is always splitting up into arms that reunite. Islanded between the arms, the inhabitants argue for a lifetime as to which is the main river.
- Cyril Connolly

Without the political will to find common ground, the continual friction of tactic and counter tactic, only creates suspicion and hatred and vengeance, and perpetuates the cycle of violence.
- James Nachtwey
Day Oh
Registered User
Join date: 3 Feb 2007
Posts: 1,257
03-31-2008 23:05
Subscribed; awaiting further understanding ._.
_____________________
Nika Talaj
now you see her ...
Join date: 2 Jan 2007
Posts: 5,449
04-01-2008 06:02
hmm .... Strife, more words please? Can you (or someone) say what the implications of this are?
.
Winter Ventura
Eclectic Randomness
Join date: 18 Jul 2006
Posts: 2,579
04-01-2008 07:03
yeah I'm with Nika... "Uhm, what?" I really only understood 1/3 of the words you used.

It sounds like maybe you should file a JIRA bug report?

http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings
_____________________

● Inworld Store: http://slurl.eclectic-randomness.com
● Website: http://www.eclectic-randomness.com
● Twitter: @WinterVentura
Day Oh
Registered User
Join date: 3 Feb 2007
Posts: 1,257
04-01-2008 07:51
http://jira.secondlife.com/browse/SVC-1960
http://jira.secondlife.com/browse/SVC-1414
Having a question mark somewhere in the url makes them always work with img tags
_____________________
Strife Onizuka
Moonchild
Join date: 3 Mar 2004
Posts: 5,887
04-01-2008 08:15
What this means? It's a little early to say what the final shape of things will be. They only just made mono scripts not crash when they encounter old range UTF-8 characters. Currently they cope with them by replacing your (entire!) string with the message "String allocation error: Invalid UTF-8". I wish they had chosen a better message, it's going to be a bitch to catch.

This is actually kinda interesting because you could attach an invalid UTF-8 character to your string in your LSO script and no Mono script could read it.

I can't say if this will break anything but if you were using characters in the removed range you would know you were. If you don't understand what I'm talking about then you aren't going to be affected by this.
_____________________
Truth is a river that is always splitting up into arms that reunite. Islanded between the arms, the inhabitants argue for a lifetime as to which is the main river.
- Cyril Connolly

Without the political will to find common ground, the continual friction of tactic and counter tactic, only creates suspicion and hatred and vengeance, and perpetuates the cycle of violence.
- James Nachtwey
Deanna Trollop
BZ Enterprises
Join date: 30 Jan 2006
Posts: 671
04-01-2008 10:41
From: Strife Onizuka
If you don't understand what I'm talking about then you aren't going to be affected by this.
...Unless you're unknowingly using a script which was written by someone who does. ;)
Kidd Krasner
Registered User
Join date: 1 Jan 2007
Posts: 1,938
04-01-2008 13:02
From: Strife Onizuka
Internally Mono uses UTF-16.

Not UCS-2 or UCS-4? That seems odd. Or are UCS-2 and UTF-16 the same? I'd expect not, but I don't remember ever looking at encodings in UTF-16.

For everyone else:

Full Unicode takes four bytes, aka UCS-4. This is huge, and two bytes are enough for most cases, so we often see Unicode in two bytes (16 bits), called UCS-2. With both these systems, all characters are always the same size - either four bytes or two, respectively. But in UTF-8, characters can be anywhere from one to six bytes long. Usually UTF-8 is used for files and network traffic, while UCS-2 or UCS-4 is used internally within a program, so there's a conversion step to translate UTF-8 to UCS-2/4 or back. For English characters, the translation is easy, just zero-extend the one byte UTF-8 character to 2/4 bytes, or simply omit the high-order zero bytes when going back to UTF-8. For characters outside the Latin alphabet, the translation is more work.

So what Strife is saying is that this conversion used to handle all of Unicode but apparently won't anymore. This will matter if you have scripts with embedded characters that aren't in the new subset, or if you try to read notecards with such characters (if that's even possible). But perhaps the most significant case is if you're trying to get stuff from a web site, it might send you stuff that used to be perfectly legal UTF-8, but some characters in it are no longer valid UTF-8 under the new specification. But if you know the web site is dealing with European characters only, you should be ok.
Strife Onizuka
Moonchild
Join date: 3 Mar 2004
Posts: 5,887
04-02-2008 01:57
UTF-16 supersedes UCS-2. I'm not going to go into the details because the wikipedia article does an ok job of it.

http://en.wikipedia.org/wiki/UTF-16

With all honesty there aren't many applications for unassigned characters (that being every character outside the new range.

Mind you when aliens invade the earth I'll have no sympathy for the ISO comity's shortsightedness.
_____________________
Truth is a river that is always splitting up into arms that reunite. Islanded between the arms, the inhabitants argue for a lifetime as to which is the main river.
- Cyril Connolly

Without the political will to find common ground, the continual friction of tactic and counter tactic, only creates suspicion and hatred and vengeance, and perpetuates the cycle of violence.
- James Nachtwey
Ollj Oh
Registered User
Join date: 28 Aug 2007
Posts: 522
04-02-2008 02:22
sad.

This means:
- All printable ascii letters from the 33th to the 127th letter (including all roman letters) are 16 bit big in utf16, instead of 8 bit big in utf8..
- this ultimatively shortens the maximum length of simple roman-only strings by half while other glyphs (far east and middle east glyphs) are barely affected for better or worse...!
- If you use utf8 encoding to store data in a prims name/description or for compressed data transmission, that data will now be interpreted differently (or needs two different decoders and a switch to tell the format).
- It seems a string of utf16 encoded glyphs can store way less bits than a string of utf8 glyphs.

The purpose of utf8 is better backwards compartibility to the 8-bit era, it can run (faster) on hardware from the 1980s and has a bias on roman letters.
If you screw that compartibility the code can be slightly faster on hardware from the 1990s, but that speed boost can cost some memory.
Haravikk Mistral
Registered User
Join date: 8 Oct 2005
Posts: 2,482
04-02-2008 03:19
As I've commented in the relevant JIRA issues; LL needs to focus on compatibility, what is the point of Mono if some scripts can't be upgraded easily? Really Mono should aim to ultimately replace LSL, maybe not in the near future but further down the line.

This means that LL either needs to use UTF-8 internally, or convert between UTF-8 and UTF-16 at all entry and exit points to the script, such as link-messages, llHTTPRequest() etc.
_____________________
Computer (Mac Pro):
2 x Quad Core 3.2ghz Xeon
10gb DDR2 800mhz FB-DIMMS
4 x 750gb, 32mb cache hard-drives (RAID-0/striped)
NVidia GeForce 8800GT (512mb)