Weekend Grid Outages
|
Catherine Linden
SL Ballyhoo 24/7
Join date: 17 Sep 2003
Posts: 175
|
01-19-2009 11:44
Although the second half of 2008 showed a big reduction in usage hours lost to outages (we reduced outage hours by over 50%), stability challenges have increased over the past month on the Grid. This weekend was especially painful, and the first time since joining Linden Lab that I've experienced a full mySQL crash (this occurred just after 4pm PT on Sunday). When the central database crashes, it takes approximately 1 hour to rebuild tables and indexes before accepting queries and becoming fully functional. This was the main reason for us to employ the painful triage process of temporarily blocking logins, while the database is in an overload state. These 5-10 minute "blocking periods" are substantially less Resident impacting than a full database crash, and a 50-60 minute restart cycle. However, neither is acceptable and I wanted to continue updating our efforts to stabilize the infrastructure. We focus a great deal on the central database, but there are many other interdependent infrastructure components and services that also have been contributing to our stability problems. One of the positives to come out of this weekend's outages was our ability to gather data and complete some detailed analysis of the patterns which have been causing failures during our highest load periods. In addition to having our best development and operations resources watching the Grid activity this weekend, we've also brought in some of the best mySQL professional services teams, to help us tune and optimize, as well as recommend long term architectural changes. As the leader of our operations and infrastructure team, my immediate priority is to tune and optimize queries to get us back to a position where we can manage our Resident transactions during peak load. This mainly focuses on validating configurations (some of which were found to be in error this weekend) and moving high load query processes that hit the central database to slave databases that have more headroom at peak (essentially spreading the load and protecting the central database). In parallel, we have a separate engineering team that is pouring through the existing code base and developing a long term strategy for our data services that will properly scale. I'm attaching a write up below from one of our engineering leads detailing some of our efforts to re-architect the service. I'll also be monitoring the forums and responding to your questions. As I have also said in previous posts, our execution and delivery on promises of stability are what count, but I also want to be open in communication, even if it is a difficult message to deliver. Here is a more detailed view into our ongoing development efforts (authored by the Systems Infrastructure development lead - Sardonyx Linden): When we started building Second Life, the unique nature and scale of the challenge we set ourselves posed us many difficult questions. Among our difficulties was getting to grips with our data model: we started out by writing SQL queries against a single central database, and we added tables and columns whenever we needed new functionality. This intentional lack of architecture gave us a wonderful means to bootstrap ourselves: we had our hands full creating the machinery of a virtual world, and focusing on the perfect data architecture too early would have been inappropriate. As Second Life has grown, our data model has matured, and we are moving away from this one-database-fits-all model. There are two reasons for this. At some point, a single database (even with numerous replicas) will clearly not be able to keep up with the increasing query load. In addition, a clean internal architecture makes the system easier for our engineering and operations teams to maintain, extend, and scale. Our existing data layout is sprawling: there are more than 100 tables in our main databases. This means that we have to be careful in choosing the order in which to reconstruct data services: we pick the busiest and most important services first. For instance, the vibrant nature of the Second Life economy generates a heavy query load, so Linden Dollar transactions are among our early targets for conversion. Developing an internal REST-based Linden Dollar API has been a substantial process. We distilled over a hundred scattered SQL queries into a small, elegant interface. We developed correctness and stress tests for the interface. We converted simulators, other daemons, batch scripts, and data warehousing tools to the new APIs. With numerous short cycles of development and testing, we ensured that the new code base stayed close to our main line of development throughout. There are still databases behind the new API, but we can partition the data and scale to accommodate heavier load without touching any of the code that acts as clients of this API. We will be rolling out the new API on a limited scale over the coming months. Residents should see no changes as a result of this work. We have other, similar projects underway to give us cleaner, more modular access to other critical infrastructure, such as agent inventory ("where's my stuff?"  and space services ("what piece of the world should a simulator own?"  . These initiatives will help us to provide a more stable and responsive Second Life experience, even as our user base continues to grow. In addition, we keep a close eye on high-quality open source technologies for internal use, so that we can deploy the best for the engineers behind Second Life to work with. Sometimes, these technologies augment or replace older approaches. For instance, we have adopted Django as the framework for most of our internal web development needs. We chose Django after a comprehensive bake-off, in which we compared the performance and elegance of an application developed under several popular Python web frameworks. In other cases, we see gaps in our internal service offerings that we would like to fill, such as fast, robust messaging, and we are actively developing benchmarks and experience with contenders in those areas.
|
Quixotic Hermit
Registered User
Join date: 9 Nov 2008
Posts: 65
|
01-19-2009 11:47
OMG actual info! As if those who complain will even read this or the blog.
|
7Seas Sass
Registered User
Join date: 20 Jun 2008
Posts: 1
|
01-19-2009 11:54
Thanks for the post. Though it's been rough lately, this helps because it's a big-picture view of what's going on and what LL is doing to fix it.
Otherwise all we have to go by is the micro view of the status blog.
_____________________
7Seas Fishing: Collect cute wearable pet fish, socialize, relax! http://7seasfishing.com
|
Walentine Gazov
Registered User
Join date: 18 Mar 2007
Posts: 85
|
01-19-2009 11:57
From: Quixotic Hermit OMG actual info! As if those who complain will even read this or the blog. Yes! We did read it! The"funny" part was that when i read the blog, the database went down again 
|
Meade Paravane
Hedgehog
Join date: 21 Nov 2006
Posts: 4,845
|
01-19-2009 12:03
Heya Catherine! TY for the update! Even when things are running well, it's nice to hear the techo Lindens and grid monkeys talking about what's happening - please keep it up!! 
_____________________
Tired of shouting clubs and lucky chairs? Vote for llParcelSay!!! - Go here: http://jira.secondlife.com/browse/SVC-1224- If you see "if you were logged in.." on the left, click it and log in - Click the "Vote for it" link on the left
|
Hardon Darkstone
Registered User
Join date: 25 Oct 2007
Posts: 3
|
Still Trying to Understand
01-19-2009 12:04
These past few months, it has been very hard to jusitfy the expense of SL in my mind. I can't DJ because I can't rely on being logged on for more than a short period of time. Hell, I may as well communicate with my friends over regular email.
I know you guys are trying, and the ffort is appreciated, but this is getting bad. Real bad.
When I can fairly well predict the times of day i will crash and the number of times i will crash, you guys have a serious problem.
Are your customers getting their money's worth?
I love SL, but these past few months, its real hard to LIKE SL.
|
Dytska Vieria
+/- .00004™
Join date: 13 Dec 2006
Posts: 768
|
01-19-2009 12:07
So I guess open source means MySQL and not something like Oracle Enterprise & Times Ten...
_____________________
+/- 0.00004
|
Meade Paravane
Hedgehog
Join date: 21 Nov 2006
Posts: 4,845
|
01-19-2009 12:10
From: Dytska Vieria So I guess open source means MySQL and not something like Oracle Enterprise & Times Ten... /me does the licence fee calculation for MySQL on 10,000 servers. Er.. What was the question?
_____________________
Tired of shouting clubs and lucky chairs? Vote for llParcelSay!!! - Go here: http://jira.secondlife.com/browse/SVC-1224- If you see "if you were logged in.." on the left, click it and log in - Click the "Vote for it" link on the left
|
Argent Stonecutter
Emergency Mustelid
Join date: 20 Sep 2005
Posts: 20,263
|
01-19-2009 12:11
If the database itself is an issue there's always PostgreSQL.
|
Astarte Artaud
Registered User
Join date: 10 Feb 2007
Posts: 116
|
And here we go again.....?
01-19-2009 12:11
mmm Interesting...65.8K online and we start getting hiccups with the database again.... I wonder if there may be a correlation.... hey guys we can get marvelous concurrency figures...but everything grinds to a halt when we do...Sound familiar ???
|
Milla Alexandre
Milla Alexandre
Join date: 22 Jan 2007
Posts: 1,759
|
01-19-2009 12:12
Thanks Catherine~! Feedback is always a plus to SL residents.....as so often we are left guessing and wondering, and cannot help being frustrated. These kinds of informative posts are what folks need in order to feel like they are in the loop.....that they 'matter'. It's got to be an enormous task, maintaining the grid AND keeping the natives happy, haha, so I applaude LL's efforts and willingness to include us, and explain the technicalities. (even if all of us don't fully comprehend what we're reading lol) Bottom line....this is so so vital to the SL community. Communication is key.....it reinforces our trust and faith in LL to keep on top of issues, and not forget the little guy. The whiners will always be whiners.... but in the big picture, I think LL does a fantastic job at providing a stable and rather incredible world wide virtual community. 
|
Feliciana Zabaleta
Registered User
Join date: 17 Jan 2008
Posts: 6
|
Thanks, But Whats Next
01-19-2009 12:13
Thank you for the reporting, however, Second Life has been degrading to the point of a total failure as of late with status messages of issue, resolves, and reopened, in a continous cycle. Which makes you look totally inept in your handling of a problem. Lets roll back to clock to last April and thats about the scene we now have again with database crashes, lost inventory and total dissatisfaction with Second Life, not that any one at Linden Labs cares what we may think.
Keep up the band aid fixes and disabling log ins every 5 minutes that is a sure fire long term cure for broken infrastructure that can be fixed. Anymore nothing surprises me with Second Life, it is not a matter if "If" but when will it crash.
|
Persia Christensen
♥ Body Doubles ♥
Join date: 28 Dec 2005
Posts: 30
|
01-19-2009 12:16
This has me worried. I'm accustomed to the usual weekend outages, log-ins disabled, etc, and while that is pesky, its understandable as there are more people off work on the weekends and it's just natural that there will be more people online at any given moment. However, this is Monday, usually the least busy time of the week, at least at our shop in terms of visitors and items sold, and already log-ins are disabled, the grid is a mess, and our customers are complaining because their transactions are stale. If it's this borked on a Monday, I shudder to think about the upcoming weekend. Thank you for the update, and I hope that the powers-that-be are working on this and giving some thought about getting rid of traffic as it is now.
_____________________
I heart Body Doubles
|
Ann Otoole
Registered User
Join date: 22 May 2007
Posts: 867
|
01-19-2009 12:22
Let us know when you hire real metadata and data architecture services to really lay the groundwork for Second Life version 2. Yes a total rewrite ground up. And no those resources are not MySql professional consultants rofl. Sheesh. You get what you pay for. That goes for software too.
You guys get paid whether your stuff works or not. We are the ones having to make exit strategies.
Let me know when the board of directors ties your executive and staff compensation to actual results.
|
Viktoria Dovgal
…
Join date: 29 Jul 2007
Posts: 3,593
|
01-19-2009 12:23
From: Persia Christensen However, this is Monday, usually the least busy time of the week, Today's a holiday in the US, and it's nasty outside in some of the country, so that's probably brought a lot of extra SL activity on today.
|
Ciaran Laval
Mostly Harmless
Join date: 11 Mar 2007
Posts: 7,951
|
01-19-2009 12:24
From: Persia Christensen However, this is Monday, usually the least busy time of the week, at least at our shop in terms of visitors and items sold, and already log-ins are disabled, the grid is a mess, and our customers are complaining because their transactions are stale. If it's this borked on a Monday, I shudder to think about the upcoming weekend. Today is a bank holiday in the USA so there are probably more people logging in from home than usual  Thanks for the update Catherine/Frank. The long term plans are interesting, now what short term plans do you have and can residents do anything to help ease the load? Can we go on a green crusade and help save the planet!
|
Rhaorth Antonelli
Registered User
Join date: 15 Apr 2006
Posts: 7,425
|
01-19-2009 12:25
Is it FJ that will be responding to these posts, or Catherine?
just wondering because Frank is the one who posted on the blog, but Catherine posted here in the forums
my grid question, is a question that I think many folks are wondering.. it is a two part question kinda...
will the traffic metric ever be removed, and if so would it make any changes on stability? (less info to keep track of)
also will bots ever be dealt with, just think, get rid of the bots and you won't have this issue with too many logins
related to the bot question, is there a way you could tell us how many accounts logged in are bots (on average?)
_____________________
From: someone Morpheus Linden: But then I change avs pretty often too, so often, I look nothing like my avatar.  They are taking away the forums... it could be worse, they could be taking away the forums AND Second Life...
|
Eli Schlegal
Registered User
Join date: 20 Nov 2007
Posts: 2,387
|
01-19-2009 12:27
From: Persia Christensen This has me worried. I'm accustomed to the usual weekend outages, log-ins disabled, etc, and while that is pesky, its understandable as there are more people off work on the weekends and it's just natural that there will be more people online at any given moment. However, this is Monday, usually the least busy time of the week, at least at our shop in terms of visitors and items sold, and already log-ins are disabled, the grid is a mess, and our customers are complaining because their transactions are stale. If it's this borked on a Monday, I shudder to think about the upcoming weekend. Thank you for the update, and I hope that the powers-that-be are working on this and giving some thought about getting rid of traffic as it is now. Today (Monday) is sort of a holiday in the US. (Even though I still had to come in to work)
|
Catherine Linden
SL Ballyhoo 24/7
Join date: 17 Sep 2003
Posts: 175
|
01-19-2009 12:30
Hey everyone,
Frank will be responding to the forums. I opened up the thread on his behalf but my Grid database expertise is....well.....not impressive. I think he's in meetings now but will be around this afternoon to respond to your questions.
thanks!
|
Jaymes Kjeller
Registered User
Join date: 26 Mar 2008
Posts: 1
|
01-19-2009 12:31
Sorting out the database into a "Normal Form" (if you can excuse the amount of geekonese in that sentence) might resolve the issues, at least that is what I can determine.
The bots issue seems to be a difficult problem to handle. If compulsory bot registration is enacted, those who use them for good will register, while those that use them for bad will try harder to slip under the radar. Then it's a case of "How can it be enforced?" But that's another topic.
Thanks for the update. It's nice to know something is at least being done.
|
Argent Stonecutter
Emergency Mustelid
Join date: 20 Sep 2005
Posts: 20,263
|
01-19-2009 12:35
From: Jaymes Kjeller Sorting out the database into a "Normal Form" (if you can excuse the amount of geekonese in that sentence) might resolve the issues, at least that is what I can determine. Ack. Ooop. Nope. I've had to deal with a couple of databases where the DBA had been too aggressive about making everything too-strictly normal form with bunchteen extra tables and extra lookups to do everything.
|
Darcie55 Kraus
Registered User
Join date: 4 Aug 2008
Posts: 8
|
Crashes
01-19-2009 12:43
I am greatful for the information, it even shows that they care about their users.
But I'm still very puzzled because when SL crashes, I get this message (Display driver nvlddmkm stopped responding and has successfully recovered.)
It only hapeens in SL (knock on wood) and this has happened since the start of Dec.
Last night was for sure like everyone else a pain to get on and I had a lot of lag, and this has been happening since late Nov. first of Dec.
I would like the crashes to be fixed if possible!
Thank you
|
richard Zhichao
Registered User
Join date: 9 Mar 2007
Posts: 113
|
sunday carsh
01-19-2009 12:47
well i think the limit on second life is a little ver 70000 and the whole damn thing will crashso the other 1,000,000 well have to wait to get online.
|
Smokey Newman
Registered User
Join date: 12 Feb 2007
Posts: 6
|
01-19-2009 12:47
Why dont you look at alt and Bot usage. Then maybe the real people can get on.
|
richard Zhichao
Registered User
Join date: 9 Mar 2007
Posts: 113
|
bots
01-19-2009 12:50
second life will never do anything about bots because it makes their numbers look good so dont go there.
|