Welcome to the Second Life Forums Archive

These forums are CLOSED. Please visit the new forums HERE

Trouble in 'quad core' cpu land?

eltee Statosky
Luskie
Join date: 23 Sep 2003
Posts: 1,258
06-07-2005 23:25
Has LL tried to setup a new type of server and had something burp on the code?

My prior thread here abuot simulator performance related to specific objects causing massive problems via a 'pending download'...

well after that abated... a new problem began to rear its ugly head... after several sim crashes... we ended up on some kind of *very* wonky simulator.

Essentially it couldn't decide if we should have half of the machines time, or 1/4, and whenever it was 1/4 aka 0.25 sim cpu... the whole sim behaved HORRIBLY. random timeslices would take HUGE swaths of time, like half-way through processing them.. the sim itself stopped entirely... and then came back and finished the pass... causing sudden and intermittent just *stops* while moving, and doing other things... then the sim cpu would revert back to 0.5 and all would be well on the sim again...



Well, we heard abit of a rumor abuot sims with 4 cpus, or at least 4 cores now... and when taco was having these problems... a linden noted that there were 'three sims running on this machine'... so okay.. thats all fine and dandy... except taco is totally *HOSED* with this problem that was until today plaguing lusk...

when people came to investigate, they blamed it on scripts, or avatar attachments, or other things..

but remember, one of the primary symptoms of this problem is that the simulator 'burps out' at random times, so each server pass that the cpu was at .25.. a different timeslice would jump to like 10 times its normal value.. so sure one pass it looks like scripts are HUGE and laggy, but then they go to normal.. then another pass it looks like run agents is HUGE... and then it goes to normal.. the whole time, the sim cpu is at .25

then suddenly it snaps back to sim cpu .5 and everything runs fine... all the slices are at their proper intervals etc..

now to really disprove the attachments, we cleaned out the simulator except for several of us in there, wearing *ruth* (aka the default avatar of SL) no attachments, no scripts, just ruth... and yeah.. it was still doing the problem continuously, and exhibited all the same problems as when the sim was full of people going about their normal business

(photos will be attached here to show, 6 people as ruth in sim with 6 people total, and within 1-2 minutes the sims performance 'as a .25' versus its performance 'as a 0.5'



now im not going to just leave it at that though.. i want to pose a hypothetical scenario...

what if the simulator code running SL's machines got a little bit confused... or the mapping table/array as to which server is which class, somehow got some mis-entries...

lets just posit what might happen if you attempted to run say, three simulators, on a dual core server...

at any one point in time.. one of the cpu cores, would be running one simulator, as per the normal operating procedure, and everything would be well..

now due to the nature of parallel processing on a 'single cpu' the operating system is going to be presented with only one other cpu core, and two *very* demanding tasks to perform aka two entire live SL simulators. Now its going to attempt to run them both at the same time, on the second cpu... it will start processing one.. stop after a maximum timeslice has passed, start processing the other, stop it after a maximum timeslice has passed...

the timers for timeslice measurement would 'count' all the time that was spent on the *OTHER* simulator process and report randomly large slices, different each sim frame (since exactly when the handoff between two sims on one core occured, would vary, frame to frame)... since these timers would still be initialized from the start of their own process and the end-start time would be rather large (there was another sim being worked on inside of that gap).


Now after a little while of say sim a running on its own cpu, and sim b and sim c sharing a cpu... the operating system would notice the second cpu is over-burdened more than the first, and attempt to cycle say sim b, over to the first 'less used' cpu where sim a has been running happily.

now sim a and sim b will be running the mad dash for cpu time dance, and sim c will get a brief respite, back on its own cpu at .5 sim cpu reported, and its in game content will run normally.. but that will not last, chances are good sim a or sim b will be coming back to join it shortly...

in such a scenario, you would expect a sim to spend roughly 2/3 of its time at .25 sim cpu, and roughly 1/3 of its time at .5 sim cpu, as things were juggled around... and if you look at the sim cpu time charts in those two screen captures, (and counted things not .25 as .5 sim cpu slices, with just little bits here and there shaved off between reporting back) thats almost exactly what you see in the simulators run pattern.


Now is this the only possible explanation? of course not.. mebbe the new sims are quad core, but dual cpu, using the fancy new dual core opteron cpu's.. and mebbe the OS isn't able to actually *USE* the second core for some reason, that would still present a similar run down situation, where you have three cpu's sharing two 'active' cores (and the two other second cores would be sitting idle on those cpu's... thats a possibility as well.. though somewhat less likely i believe, as we saw this exact profile happening on a series of simulators between crashes of lusk over the last 3-4 days... everything from the very oldest opterons of the sim455.agni vintage, to the very newest ones in the high 700's...

given we saw this exact profile happen on several simulators, and that it was happening on simulators *known* or at least very intelligently assumed, to be only dual core... i think it is probably more likely that there is some error within the server process, where more live simulators are being assigned to a server than it has cpu's (or cores) to concurrently process... aka its giving 3 live sims to a 2 core server, thinking that its actually 4 core.



again now this is *ALL CONJECTURE* on my part, albeit hopefully somewhat intelligently founded conjecture given my background in time critical multi-thread software design...

its very possible it could be something else entirely (but NOT avatar attachments, as the forthcoming silly (but very meaningul data containing) ruth pictures will show)... but at least from what little data a second life resident can accumulate over several days, and a few dropped hints on the development/running environment... it would seem at least that this scenario is at least a *plausible* cause of the massive, and very erratic performance problems of late, especially on simulators that crash hard, and come back up on 'something' that was probably selected at random, and with great haste (possibly even if it already was 'full')


hopefully LL can eventually find whatever bug *IS* actually causing this issue, and nip it before it hurts too many other areas/people
_____________________
wash, rinse, repeat
eltee Statosky
Luskie
Join date: 23 Sep 2003
Posts: 1,258
the second picture (at .5 cpu just a minute later)
06-07-2005 23:29
just postin it here so it gets opened by default too, and people can see it

(i do appologize in advance for the large size of the images... thas jus how i run SL, and if i shrunk it down, you wouldn't be able to see either the numbers, or the ruths if i just cropped it)
_____________________
wash, rinse, repeat
Lee Linden
llBuildMonkey();
Join date: 31 Dec 1969
Posts: 743
06-08-2005 09:12
I'll make sure the grid people check this out as soon as they get in.
Lee Linden
llBuildMonkey();
Join date: 31 Dec 1969
Posts: 743
06-08-2005 10:42
It's being looked at now by several developers. Yes, something's misbehaving. As soon as we figure out where which is doing what to whom and why, we'll know when. ;^)
eltee Statosky
Luskie
Join date: 23 Sep 2003
Posts: 1,258
06-08-2005 10:48
thanks for the update lee ^.^ keep us posted
_____________________
wash, rinse, repeat
Baba Yamamoto
baba@slinked.net
Join date: 26 May 2003
Posts: 1,024
06-11-2005 10:30
From: eltee Statosky
thanks for the update lee ^.^ keep us posted



Very Interesting.... Taco is still acting up?
_____________________
Open Metaverse Foundation - http://www.openmetaverse.org

Meerkat viewer - http://meerkatviewer.org
Kim Anubis
The Magician
Join date: 3 Jun 2004
Posts: 921
06-11-2005 12:32
Just want to say thanks to eltee and the rest of the Lusk folks for all the time and effort you've been putting in to help the Lindens figure out the causes of simulator problems.
_____________________
http://www.TheMagicians.us
Lee Linden
llBuildMonkey();
Join date: 31 Dec 1969
Posts: 743
06-13-2005 09:47
It looks like we've got a small bug we're still tracking down that causes some regions to come back online on a server that's already in use.

The best symptoms I've seen so far are wildly fluctuating performance and varying sim CPU values. Unfortunately, seeing these doesn't automatically mean this is the cause... the only way to know for sure is for us to check on an internal diagnostic tool (though that's easy enough that even I can do it).

Fixing a sim that has this problem is also fairly easy; it just involves a restart and some work by the person on grid duty.

We'll keep an eye on this problem (I had another case of it this morning), and we'll do what we can to find and squash the bug as quickly as we can.
eltee Statosky
Luskie
Join date: 23 Sep 2003
Posts: 1,258
06-13-2005 10:10
yeah thats pretty much what it seemed like was happening... the best indicator is not actually a wildly fluctuating sim cpu, but more a sim cpu that seems to 'fix' at .25 and occasionally flicker up, and the time slices jump randomly (when the os scheduler suspends one sim in the middle of something to work on the other, then comes back)
_____________________
wash, rinse, repeat
eltee Statosky
Luskie
Join date: 23 Sep 2003
Posts: 1,258
06-17-2005 22:31
any eta on when this is gonna get patched lee? lusk got brought down by a crashed badly scripted vehicle and came back up in 'three is a crowd' land again sigh
_____________________
wash, rinse, repeat
Catherine Omega
Geometry Ninja
Join date: 10 Jan 2003
Posts: 2,053
06-17-2005 23:08
From: eltee Statosky
any eta on when this is gonna get patched lee? lusk got brought down by a crashed badly scripted vehicle and came back up in 'three is a crowd' land again sigh
Last time I checked, I wasn't Lee, and it I'm sure this may not be much consolation, but vehicles and other physical objects shouldn't crash sims come 1.7. So at least you won't have to worry about THAT part of it in a few weeks. Everything seemed to be okay immediately after the last update, so perhaps the bug is limited to sims that restart after a crash only, and not ones that are started properly?
_____________________
Need scripting help? Visit the LSL Wiki!
Omega Point - Catherine Omega's Blog
eltee Statosky
Luskie
Join date: 23 Sep 2003
Posts: 1,258
06-18-2005 05:56
yeah it seems mostly to be part of the code that assigns a 'new' server to sims that are recovering from crashes so they can come back up faster
_____________________
wash, rinse, repeat
Jeffrey Gomez
Cubed™
Join date: 11 Jun 2004
Posts: 3,522
06-19-2005 23:26
I've confirmed this is a handoff problem on startup.

I just accidentally crashed Game Dev 3, and sure enough, it came back up with the CPU wondering whether it's a quad or dual processor sim. Whoopsie!

Soooo... yeah. We know when it's happening. Any idea on when it'll be fixed?
_____________________
---