list llSegmentString(string input, integer size, integer sizeIsInBytes)


list llSegmentString(char * input, int size, int byte)
{//public domain, written by Strife Onizuka
    list stack = [];
    if(size > 0)
    {
        if(input == NULL || input[0] == '\x00')
            return [""];
        int pos = 0;
        if(byte)
        {
            int length = strlen(input);
            if(size >= length )
                return [input];
            char * buffer = malloc(size + 1);
            while(pos < length)
            {
                int len = size;
                if(len > length - pos)
                    len = length - pos;
                else
                    while((input[len + pos] & 0xC0) == 0x80)//is it the middle of a utf8 character?
                        --len;//backup the end pos, till the last char is a char start.
                if(len)
                    memcpy(buffer, input + pos, len);
                else //not enough space to specify a full char, most likely they specified a value for "size" less then 6
                    return [];
                buffer[len] = '\x00';//make the buffer a null terminated string.
                stack += [buffer];
                pos += len;
            }
            free(buffer);
        }
        else
        {//this could be optimized to not require llStringLength, by using strncpy
            int length = llStringLength(input);
            if(size >= length )
                return [input];
            while(pos < length)
            {
                stack += [llGetSubString(input, pos, pos + size - 1)];
                pos += size;
            }
        }
    }
    return stack;
}

list llSegmentString(string input, integer size, integer sizeIsInBytes)
Strife Onizuka Moonchild Join date: 3 Mar 2004 Posts: 5,887	11-06-2006 14:48 list llSegmentString(string input, integer size, integer sizeIsInBytes) When sizeIsInBytes is FALSE: returns a list that when typecast into a string equals "input", for any value of "size" input == (string)llSegmentString(input, size, FALSE) The resulting list, each segment is no longer then "size" characters. When sizeIsInBytes is TRUE: If the size is too small to hold the longest character in the string a null list is returned; otherwise returns a list that when typecast into a string equals "input", for any value of "size" The resulting list, each segment is no longer then "size" bytes. Why this is needed & how it is useful: When cutting Unicode string to be sent by chat, instant message or email, it is incredibly hard to know where to segment the string so as to not have chat loose parts of it. Before utf8 support was added to LSL it was easy, characters == bytes but utf8 support mulit-byte characters. There are no specific functions in LSL that address this. To do get around this limitation in LSL is very cumbersome. Including support to segment the string by characters instead of by bytes is a no-brainer and would be useful in certain situations. Some rough C code as to how to implement it; written & released by me into the public domain. The logic of the code is good, the syntax may not quite parse. CODE list llSegmentString(char * input, int size, int byte) {//public domain, written by Strife Onizuka list stack = []; if(size > 0) { if(input == NULL \|\| input[0] == '\x00') return [""]; int pos = 0; if(byte) { int length = strlen(input); if(size >= length ) return [input]; char * buffer = malloc(size + 1); while(pos < length) { int len = size; if(len > length - pos) len = length - pos; else while((input[len + pos] & 0xC0) == 0x80)//is it the middle of a utf8 character? --len;//backup the end pos, till the last char is a char start. if(len) memcpy(buffer, input + pos, len); else //not enough space to specify a full char, most likely they specified a value for "size" less then 6 return []; buffer[len] = '\x00';//make the buffer a null terminated string. stack += [buffer]; pos += len; } free(buffer); } else {//this could be optimized to not require llStringLength, by using strncpy int length = llStringLength(input); if(size >= length ) return [input]; while(pos < length) { stack += [llGetSubString(input, pos, pos + size - 1)]; pos += size; } } } return stack; } _____________________ Truth is a river that is always splitting up into arms that reunite. Islanded between the arms, the inhabitants argue for a lifetime as to which is the main river. - Cyril Connolly Without the political will to find common ground, the continual friction of tactic and counter tactic, only creates suspicion and hatred and vengeance, and perpetuates the cycle of violence. - James Nachtwey
Strife Onizuka Moonchild Join date: 3 Mar 2004 Posts: 5,887	12-20-2006 21:56 Two functions that make good use of llSegmentString and would benefit from it's inclusion. CODE //The TightListType (TLT) class of functions are used to encode list & decode strings, as such that the list can be regenerated undamage. //TLT Parse uses llSegmentString to break the TLT header into a list so it can be used as seperators for parsing the main body into a list. list TightListTypeParse(string input) { if(llStringLength(input) < 7) return [];//invalid or null list, return an empty list string seperators = llGetSubString(input,0,5);//first 6 bytes are the seperator list. list partial = llDeleteSubList( llParseStringKeepNulls( llDeleteSubString(input,0,5), [], llSegmentString(seperators, 1, FALSE)),0,0);//skip the first pos, it's null. //([] != partial) is the same as -llGetListLength(partial), but is faster and saves bytecode. integer pos = ([] != partial);//makes it even in length list current;//a buffer integer type = 0;//llGetListEntryType() - 1; we could use the TYPE_* constants but it would cost more, integer sub_pos = 0;//temp position value, makes the code faster and saves us a few bytes. do//we use -~x in two places, it is the same as 1+x, it saves in bytecode (and is faster, not sure about Mono) { current = [input=llList2String(partial, sub_pos = -~pos)];//TYPE_STRING \|\| TYPE_INVALID (though we don't care about invalid) if(!(type = llSubStringIndex(seperators, llList2String(partial,pos))))//TYPE_INTEGER current = [(integer)input]; else if(type == 1)//TYPE_FLOAT current = [(float)input]; else if(type == 3)//TYPE_KEY current = [(key)input]; else if(type == 4)//TYPE_VECTOR current = [(vector)input]; else if(type == 5)//TYPE_ROTATION current = [(rotation)input]; partial = llListReplaceList(partial, current, pos, sub_pos); }while((pos = -~sub_pos));//as long as it's not zero... return partial; } //Escape is a function used to replace characters (or strings of characters with C style escapes. //It uses llSegmentString to get around the return byte limits of both llEscapeURL and llUnescapeURL string Escape(string input) {//Use with Unescape list groups = llSegmentString(input, 84, TRUE); integer position = ([] != groups);//negitive index of the first entry. string result; do { string temp = llList2String(groups, position); //We do the text replacement in here because we can save alot of cpu time not copying //the full string onto and off of the stack many time. temp = str_replace(str_replace(str_replace(str_replace(llEscapeURL(temp), "%5C","\\\\"),"%0D", "\\n"),"%0A", "\\n"), "%22", "\\q"); integer loop = 6; //converts utf-8 multi-byte & chars < 32 into raw escaped string, easier then building \u codes. do//to do this, find the chars to escape, figure out length, strip the "%" and make it a \h code {//~-x is the same (x - 1) string search = "%"+llGetSubString("01CDEF", loop = ~-loop,loop); @loop;//instead of a while loop, saves 5 bytes (and run faster). integer pos = llSubStringIndex(temp, search); if(~pos) { integer char = (integer)("0x"+llGetSubString(temp, -~pos, pos + 2));//get the first byte of the utf-8 char integer len = (0xFC <= char) + (0xF8 <= char) + (0xF0 <= char) + (0xE0 <= char) - ~(0xC0 <= char);//calculate the chars length integer end = pos + ~(len * -3); //using the length get the char, strip the "%" and replace the char with the hex escape code version. temp = llInsertString(llDeleteSubString(temp, pos, end), pos, "\\h" + (string)len + (string)(llParseString2List(llGetSubString(temp, pos, end), ["%"], []))); jump loop;//then loop back to see if there is another char. }//no more chars of this type, try another code }while(loop); result += llUnescapeURL(temp);//store the results into the result buffer, }while((position = position)); return str_replace(result, "\t", result = "\\t");//should be safe to feed to a notecard now. } //This function does multiple string replaces and is used by Escape string str_replace(string str, string from, string to) { integer len = ~-llStringLength(from); if(~len) { string buffer = str; integer b_pos = -1; integer to_len = ~-llStringLength(to); @loop;//instead of a while loop, saves 5 bytes (and run faster). integer to_pos = ~llSubStringIndex(buffer, from); if(to_pos) { // b_pos -= to_pos; // str = llInsertString(llDeleteSubString(str, b_pos, b_pos + len), b_pos, to); // b_pos += to_len; // buffer = llGetSubString(str, -~b_pos, 0x8000); buffer = llGetSubString(str = llInsertString(llDeleteSubString(str, b_pos -= to_pos, b_pos + len), b_pos, to), -~(b_pos += to_len), 0x8000); jump loop; } } return str; } _____________________ Truth is a river that is always splitting up into arms that reunite. Islanded between the arms, the inhabitants argue for a lifetime as to which is the main river. - Cyril Connolly Without the political will to find common ground, the continual friction of tactic and counter tactic, only creates suspicion and hatred and vengeance, and perpetuates the cycle of violence. - James Nachtwey
Argent Stonecutter Emergency Mustelid Join date: 20 Sep 2005 Posts: 20,263	I have a better proposal. 12-21-2006 10:39 They need to change the limits in SL that you're dealing with - the limits in chat and email and so on - to be in characters rather than bytes. There's really no excuse for expecting people working in a high level language that operates in Unicode to have to deal with bytes at all.
Strife Onizuka Moonchild Join date: 3 Mar 2004 Posts: 5,887	12-21-2006 10:59 This is for segmenting strings not just by bytes but also by characters. Yes, making the limits consistent would be nice but impractical. People would start encoding their data in Unicode characters to shove more data down stream. To offset this, LL would make the limits lower, our new 1023 byte limit for chat would be reduced to 170 characters. Either way, you could still use this function to quickly and painlessly split your strings. This could also be useful if your external application that you are interacting with via email or http/xml-rpc has specific byte limits. another example function CODE OwnerSay(string a) { list b = llSegmentString(a, 1023, TRUE); integer c = ([] != b); do llOwnerSay(llList2String(b,c)); while(++c); } _____________________ Truth is a river that is always splitting up into arms that reunite. Islanded between the arms, the inhabitants argue for a lifetime as to which is the main river. - Cyril Connolly Without the political will to find common ground, the continual friction of tactic and counter tactic, only creates suspicion and hatred and vengeance, and perpetuates the cycle of violence. - James Nachtwey
Learjeff Innis musician & coder Join date: 27 Nov 2006 Posts: 817	04-13-2007 12:18 Excellent suggestion. I recommend making the description clearer to someone who doesn't already know what it's for, something like this: list llSegmentString(string input, integer size, integer sizeIsInBytes) Return a list consisting of input chopped into segment strings, where each segment is no larger than size, and where the returned list matches input when typecast to string. If sizeIsInBytes is False, the size parameter is interpreted as a size in characters. Otherwise, it is interpreted as a size in bytes. In the latter case, if any character is larger than size, an empty list is returned.
Learjeff Innis musician & coder Join date: 27 Nov 2006 Posts: 817	04-13-2007 19:55 Strife, I don't understand your script: CODE OwnerSay(string a) { list b = llSegmentString(a, 1023, TRUE); integer c = ([] != b); do llOwnerSay(llList2String(b,c)); while(++b); } First, I have to assume that "++" on a list removes the first element from the list, presumably returning True if the list is nonempty and False if it's empty. Hmm, interesting. That sure would be nice to have documented somewhere. (Of course, any kind of professional quality documentation on LSL would be nice, but unfortunately LL doesn't take LSL documentation seriously.) Second, why do you never print the first element of the list? Specifically, the statement: CODE integer c = ([] != b); sets c to True unless b is the empty list. If b is the empty list, then it appears to call llOwnerSay(""; which presumably is harmless. If b is not the empty list, then the first call to llOwnerSay() will print the results of llList2String(b,1), which is the second element in the list. Or am I misreading the code somehow? Thanks, Jeff
Strife Onizuka Moonchild Join date: 3 Mar 2004 Posts: 5,887	07-27-2007 17:38 sorry about that, i mean ++c not ++b The not-equal operator on lists returns the difference between the lengths of the lists. "[] != b)" is equivalent to "llGetListLength([]) - llGetListLength(b))" or just "-llGetListLength(b)" but is much faster. _____________________ Truth is a river that is always splitting up into arms that reunite. Islanded between the arms, the inhabitants argue for a lifetime as to which is the main river. - Cyril Connolly Without the political will to find common ground, the continual friction of tactic and counter tactic, only creates suspicion and hatred and vengeance, and perpetuates the cycle of violence. - James Nachtwey
Learjeff Innis musician & coder Join date: 27 Nov 2006 Posts: 817	07-28-2007 05:40 Ah, interesting -- thanks for the tip!

Welcome to the Second Life Forums Archive

list llSegmentString(string input, integer size, integer sizeIsInBytes)