Welcome to the Second Life Forums Archive

These forums are CLOSED. Please visit the new forums HERE

Detecting size of UTF8 strings

Deanfred Brandeis
one who programs
Join date: 20 Aug 2006
Posts: 20
04-07-2008 16:00
I'm just going to go through my logic here for how I would do this, and since it's not working, someone please correct me where I'm wrong. :)

(I'm pretty sure my logic is failing in the base64 conversion.)

1) LSL integers are 32 bits or 4 bytes long.
2) By running llBase64ToInteger(llStringToBase64(char)), where char is a single UTF8 character, I should get an accurate bit-for-bit translation unless the character is over 4 bytes long (which is unlikely).
3) Testing each byte's high bit should tell me how many bytes are in the string.

This is the basic logic of what I'm doing for a single character, but it's not even close to working (assuming the string s is >= 1 character long):

string char = llGetSubString(s, 0, 0); // Get the first character.
integer ichar = llBase64ToInteger(llStringToBase64(char)); // Get the byte-for-byte translation of the char, like casting to an int would do in C.
integer size = 1;

while ((ichar = ichar >> 8) & 0x80)
••• size++;

// Now size should give the number of bytes in the character.

I have a suspicion that having no unsigned integers has something to do with why this isn't working--as well as the base64 conversions. Any thoughts?
Hewee Zetkin
Registered User
Join date: 20 Jul 2006
Posts: 2,702
04-07-2008 22:19
Well, note that llBase64ToInteger() chokes (buffer overrun?) if it gets less than 6 characters, and returns zero if it gets more than 8. I assume that means it expects that the padding (trailing '=' characters) might be truncated, but whatever. Now a single-byte character would actually be encoded as 4 bytes (e.g. the 'v' character is binary '01110110', which would get encoded as 'dg==').

I would simply count the number of characters and trailing pad characters in the result of the llStringToBase64() conversion of your character. Something like:

CODE

integer getStrBytes(string str)
{
string encoding = llStringToBase64(str);
integer encodingLen = llStringLength(encoding);
integer paddingIndex = llSubStringIndex(encoding, "=");

integer bytes = 3*encodingLen/4;
if (paddingIndex < 0)
{
return bytes;
} else
{
return bytes-(encodingLen-paddingIndex);
}
}

integer getCharBytes(string char)
{
if (llStringLength(char) > 1)
{
return getStrBytes(llGetSubString(char, 0, 0));
} else
{
return getStrBytes(char);
}
}
Deanfred Brandeis
one who programs
Join date: 20 Aug 2006
Posts: 20
04-08-2008 19:25
That's a really good solution, actually. I had a similar idea, which was to use llEscapeURL, but it ends up being much slower. (Time trials below.)

My second attempt at a solution:

integer strsize1(string s)
{
••• s = llEscapeURL(s);
••• integer slen = llStringLength(s);
••• integer size = 0;

••• integer i = 0;
••• while (i < slen)
••• {
••• ••• string sub = llGetSubString(s, i, i);
••• ••• if (sub == "%";)
••• ••• ••• i += 3;
••• ••• else
••• ••• ••• i += 1;
••• ••• size++;
••• }

••• return size;
}

Your solution:

integer strsize2(string s)
{
••• s = llStringToBase64(s);
••• integer slen = llStringLength(s);
••• integer size = 3 * slen / 4;
••• integer padi = llSubStringIndex(s, "=";);

••• if (padi == -1)
••• ••• return size;

••• return size - (slen - padi);
}

I also ran several time trials of these functions on the string "résumé". Using llStringLength's running time as a base value, here are the results of 4,000 runs:

llStringLength: 1x (average: 1.7387e-3 seconds/call)
strsize1: 24.018x (average: 41.7612e-3 s/call)
strsize2: 4.256x (average: 7.3997e-3 s/call)