Welcome to the Second Life Forums Archive

These forums are CLOSED. Please visit the new forums HERE

list llSegmentString(string input, integer size, integer sizeIsInBytes)

Strife Onizuka
Moonchild
Join date: 3 Mar 2004
Posts: 5,887
11-06-2006 14:48
list llSegmentString(string input, integer size, integer sizeIsInBytes)

When sizeIsInBytes is FALSE:
returns a list that when typecast into a string equals "input", for any value of "size"
input == (string)llSegmentString(input, size, FALSE)
The resulting list, each segment is no longer then "size" characters.

When sizeIsInBytes is TRUE:
If the size is too small to hold the longest character in the string a null list is returned; otherwise returns a list that when typecast into a string equals "input", for any value of "size"
The resulting list, each segment is no longer then "size" bytes.

Why this is needed & how it is useful:
When cutting Unicode string to be sent by chat, instant message or email, it is incredibly hard to know where to segment the string so as to not have chat loose parts of it. Before utf8 support was added to LSL it was easy, characters == bytes but utf8 support mulit-byte characters. There are no specific functions in LSL that address this. To do get around this limitation in LSL is *very* cumbersome. Including support to segment the string by characters instead of by bytes is a no-brainer and would be useful in certain situations.

Some rough C code as to how to implement it; written & released by me into the public domain. The logic of the code is good, the syntax may not quite parse.
CODE

list llSegmentString(char * input, int size, int byte)
{//public domain, written by Strife Onizuka
list stack = [];
if(size > 0)
{
if(input == NULL || input[0] == '\x00')
return [""];
int pos = 0;
if(byte)
{
int length = strlen(input);
if(size >= length )
return [input];
char * buffer = malloc(size + 1);
while(pos < length)
{
int len = size;
if(len > length - pos)
len = length - pos;
else
while((input[len + pos] & 0xC0) == 0x80)//is it the middle of a utf8 character?
--len;//backup the end pos, till the last char is a char start.
if(len)
memcpy(buffer, input + pos, len);
else //not enough space to specify a full char, most likely they specified a value for "size" less then 6
return [];
buffer[len] = '\x00';//make the buffer a null terminated string.
stack += [buffer];
pos += len;
}
free(buffer);
}
else
{//this could be optimized to not require llStringLength, by using strncpy
int length = llStringLength(input);
if(size >= length )
return [input];
while(pos < length)
{
stack += [llGetSubString(input, pos, pos + size - 1)];
pos += size;
}
}
}
return stack;
}
_____________________
Truth is a river that is always splitting up into arms that reunite. Islanded between the arms, the inhabitants argue for a lifetime as to which is the main river.
- Cyril Connolly

Without the political will to find common ground, the continual friction of tactic and counter tactic, only creates suspicion and hatred and vengeance, and perpetuates the cycle of violence.
- James Nachtwey
Strife Onizuka
Moonchild
Join date: 3 Mar 2004
Posts: 5,887
12-20-2006 21:56
Two functions that make good use of llSegmentString and would benefit from it's inclusion.

CODE

//The TightListType (TLT) class of functions are used to encode list & decode strings, as such that the list can be regenerated undamage.
//TLT Parse uses llSegmentString to break the TLT header into a list so it can be used as seperators for parsing the main body into a list.
list TightListTypeParse(string input) {
if(llStringLength(input) < 7) return [];//invalid or null list, return an empty list
string seperators = llGetSubString(input,0,5);//first 6 bytes are the seperator list.

list partial = llDeleteSubList( llParseStringKeepNulls( llDeleteSubString(input,0,5), [], llSegmentString(seperators, 1, FALSE)),0,0);//skip the first pos, it's null.

//([] != partial) is the same as -llGetListLength(partial), but is faster and saves bytecode.
integer pos = ([] != partial);//makes it even in length
list current;//a buffer
integer type = 0;//llGetListEntryType() - 1; we could use the TYPE_* constants but it would cost more,
integer sub_pos = 0;//temp position value, makes the code faster and saves us a few bytes.
do//we use -~x in two places, it is the same as 1+x, it saves in bytecode (and is faster, not sure about Mono)
{
current = [input=llList2String(partial, sub_pos = -~pos)];//TYPE_STRING || TYPE_INVALID (though we don't care about invalid)
if(!(type = llSubStringIndex(seperators, llList2String(partial,pos))))//TYPE_INTEGER
current = [(integer)input];
else if(type == 1)//TYPE_FLOAT
current = [(float)input];
else if(type == 3)//TYPE_KEY
current = [(key)input];
else if(type == 4)//TYPE_VECTOR
current = [(vector)input];
else if(type == 5)//TYPE_ROTATION
current = [(rotation)input];
partial = llListReplaceList(partial, current, pos, sub_pos);
}while((pos = -~sub_pos));//as long as it's not zero...
return partial;
}

//Escape is a function used to replace characters (or strings of characters with C style escapes.
//It uses llSegmentString to get around the return byte limits of both llEscapeURL and llUnescapeURL
string Escape(string input)
{//Use with Unescape
list groups = llSegmentString(input, 84, TRUE);
integer position = ([] != groups);//negitive index of the first entry.
string result;

do
{ string temp = llList2String(groups, position);
//We do the text replacement in here because we can save alot of cpu time not copying
//the full string onto and off of the stack many time.
temp = str_replace(str_replace(str_replace(str_replace(llEscapeURL(temp), "%5C","\\\\"),"%0D", "\\n"),"%0A", "\\n"), "%22", "\\q");
integer loop = 6;

//converts utf-8 multi-byte & chars < 32 into raw escaped string, easier then building \u codes.
do//to do this, find the chars to escape, figure out length, strip the "%" and make it a \h code
{//~-x is the same (x - 1)
string search = "%"+llGetSubString("01CDEF", loop = ~-loop,loop);
@loop;//instead of a while loop, saves 5 bytes (and run faster).
integer pos = llSubStringIndex(temp, search);
if(~pos)
{
integer char = (integer)("0x"+llGetSubString(temp, -~pos, pos + 2));//get the first byte of the utf-8 char
integer len = (0xFC <= char) + (0xF8 <= char) + (0xF0 <= char) + (0xE0 <= char) - ~(0xC0 <= char);//calculate the chars length
integer end = pos + ~(len * -3);
//using the length get the char, strip the "%" and replace the char with the hex escape code version.
temp = llInsertString(llDeleteSubString(temp, pos, end), pos, "\\h" +
(string)len + (string)(llParseString2List(llGetSubString(temp, pos, end), ["%"], [])));
jump loop;//then loop back to see if there is another char.
}//no more chars of this type, try another code
}while(loop);
result += llUnescapeURL(temp);//store the results into the result buffer,
}while((position = position));
return str_replace(result, "\t", result = "\\t");//should be safe to feed to a notecard now.
}

//This function does multiple string replaces and is used by Escape
string str_replace(string str, string from, string to)
{
integer len = ~-llStringLength(from);
if(~len)
{
string buffer = str;
integer b_pos = -1;
integer to_len = ~-llStringLength(to);
@loop;//instead of a while loop, saves 5 bytes (and run faster).
integer to_pos = ~llSubStringIndex(buffer, from);
if(to_pos)
{
// b_pos -= to_pos;
// str = llInsertString(llDeleteSubString(str, b_pos, b_pos + len), b_pos, to);
// b_pos += to_len;
// buffer = llGetSubString(str, -~b_pos, 0x8000);
buffer = llGetSubString(str = llInsertString(llDeleteSubString(str, b_pos -= to_pos, b_pos + len), b_pos, to), -~(b_pos += to_len), 0x8000);
jump loop;
}
}
return str;
}
_____________________
Truth is a river that is always splitting up into arms that reunite. Islanded between the arms, the inhabitants argue for a lifetime as to which is the main river.
- Cyril Connolly

Without the political will to find common ground, the continual friction of tactic and counter tactic, only creates suspicion and hatred and vengeance, and perpetuates the cycle of violence.
- James Nachtwey
Argent Stonecutter
Emergency Mustelid
Join date: 20 Sep 2005
Posts: 20,263
I have a better proposal.
12-21-2006 10:39
They need to change the limits in SL that you're dealing with - the limits in chat and email and so on - to be in characters rather than bytes.

There's really no excuse for expecting people working in a high level language that operates in Unicode to have to deal with bytes at all.
Strife Onizuka
Moonchild
Join date: 3 Mar 2004
Posts: 5,887
12-21-2006 10:59
This is for segmenting strings not just by bytes but also by characters.

Yes, making the limits consistent would be nice but impractical. People would start encoding their data in Unicode characters to shove more data down stream. To offset this, LL would make the limits lower, our new 1023 byte limit for chat would be reduced to 170 characters. Either way, you could still use this function to quickly and painlessly split your strings.

This could also be useful if your external application that you are interacting with via email or http/xml-rpc has specific byte limits.

another example function

CODE

OwnerSay(string a)
{
list b = llSegmentString(a, 1023, TRUE);
integer c = ([] != b);
do
llOwnerSay(llList2String(b,c));
while(++c);
}
_____________________
Truth is a river that is always splitting up into arms that reunite. Islanded between the arms, the inhabitants argue for a lifetime as to which is the main river.
- Cyril Connolly

Without the political will to find common ground, the continual friction of tactic and counter tactic, only creates suspicion and hatred and vengeance, and perpetuates the cycle of violence.
- James Nachtwey
Learjeff Innis
musician & coder
Join date: 27 Nov 2006
Posts: 817
04-13-2007 12:18
Excellent suggestion.

I recommend making the description clearer to someone who doesn't already know what it's for, something like this:

list llSegmentString(string input, integer size, integer sizeIsInBytes)

Return a list consisting of input chopped into segment strings, where each segment is no larger than size, and where the returned list matches input when typecast to string.

If sizeIsInBytes is False, the size parameter is interpreted as a size in characters. Otherwise, it is interpreted as a size in bytes. In the latter case, if any character is larger than size, an empty list is returned.
Learjeff Innis
musician & coder
Join date: 27 Nov 2006
Posts: 817
04-13-2007 19:55
Strife, I don't understand your script:

CODE
OwnerSay(string a) 
{
list b = llSegmentString(a, 1023, TRUE);
integer c = ([] != b);
do
llOwnerSay(llList2String(b,c));
while(++b);
}


First, I have to assume that "++" on a list removes the first element from the list, presumably returning True if the list is nonempty and False if it's empty. Hmm, interesting. That sure would be nice to have documented somewhere. (Of course, any kind of professional quality documentation on LSL would be nice, but unfortunately LL doesn't take LSL documentation seriously.)

Second, why do you never print the first element of the list? Specifically, the statement:

CODE
integer c = ([] != b);


sets c to True unless b is the empty list. If b is the empty list, then it appears to call llOwnerSay("";); which presumably is harmless.

If b is not the empty list, then the first call to llOwnerSay() will print the results of llList2String(b,1), which is the second element in the list.

Or am I misreading the code somehow?

Thanks,
Jeff
Strife Onizuka
Moonchild
Join date: 3 Mar 2004
Posts: 5,887
07-27-2007 17:38
sorry about that, i mean ++c not ++b

The not-equal operator on lists returns the difference between the lengths of the lists. ";([] != b)" is equivalent to ";(llGetListLength([]) - llGetListLength(b))" or just "-llGetListLength(b)" but is much faster.
_____________________
Truth is a river that is always splitting up into arms that reunite. Islanded between the arms, the inhabitants argue for a lifetime as to which is the main river.
- Cyril Connolly

Without the political will to find common ground, the continual friction of tactic and counter tactic, only creates suspicion and hatred and vengeance, and perpetuates the cycle of violence.
- James Nachtwey
Learjeff Innis
musician & coder
Join date: 27 Nov 2006
Posts: 817
07-28-2007 05:40
Ah, interesting -- thanks for the tip! :)