Etchification Rolling Retart 2009/02/17-2009/02/20
|
Prospero Linden
Linden Lab Employee
Join date: 6 Aug 2007
Posts: 315
|
02-13-2009 06:59
This thread is for discussion of the etchification rolling restart described in the forum post http://status.secondlifegrid.net/2009/02/12/post500/Some additional notes about this rolling restart: * This is to convert our servers from Debian Sarge to Debian Etch. Etch is the current "stable" version of Debian, and will continue to be supported for a while even after the next Debian Stable ("Lenny"  comes out. We need to update so that we can keep current with security updates and the like. * Normally during rolling restarts, we lock regions to hosts. Most of the time when a region restarts, it will migrate to a different host, but they don't do that during rolling restarts. However, this rolling restart is different; because hosts are not just having a Second Life software upgrade, but are being completely wiped and reinstalled, we won't have the freedom to lock regions to hosts. * One implication of that is that some regions will be restarted more than once. There's no way to anticipate exactly which regions will be restarted. * These upgrades take longer than a standard software upgrade. As such, while many regions will be down for a typical rolling restart length of time (a few to 10 minutes), some regions will be down for 30-60 minutes.
|
bigmoe Whitfield
I>3 Foxes
Join date: 29 Jul 2007
Posts: 459
|
02-13-2009 07:18
wow thats cool, is this "stable" version of debian faster too? does it handle things better then previous version of deb? I am a little lost on *nix as it is a big learning step for me
_____________________
GoodBye Forums we will miss you ~moe 2-2-2010~
|
Atashi Toshihiko
Frequently Befuddled
Join date: 7 Dec 2006
Posts: 1,423
|
02-13-2009 07:27
I know this has been discussed more than once with regards to rolling restarts, but there has to be some better way to handle scheduling on a region-by-region basis. For anyone expecing to see Second Life used in a business environment, it is simply not acceptable that the notification for an expected service outage is "Some time over the course of a few days, your region(s) will be offline once, or maybe more than once, for a duration that might be 10 minutes, but might be 30 or 60 minutes."
I understand that due to the dynamic way regions are assigned to physical hosts, there are some challenges and I accept that *right now* it isn't possible to provide more specific downtime information. It really is something that Linden Lab should be aware of however, particularily if LL is serious about attracting businesses (and educators).
By comparison, the ISP that provides my T1 connection for work (and whom I pay less than half what I pay LL per month) schedules downtime with a 14-day notice, and can tell to within a 2 hour period when the downtime will happen. They even go so far as to ask if it is ok, and if it isn't, I have the option of having them reschedule it.
Like I said, I understand that this isn't possible now, but it is something that has to be addressed going forward.
Good luck with the Etchification.
-Atashi
_____________________
Visit Atashi's Art and Oddities Store and the Waikiti Motor Works at beautiful Waikiti.
|
Kio Bade
Registered User
Join date: 23 May 2008
Posts: 1
|
02-13-2009 09:00
Good Luck and we all hope the Grid getting stable and better Performance
|
Shockwave Yareach
Registered User
Join date: 4 Oct 2006
Posts: 370
|
02-13-2009 09:20
I agree with Atashi. It's understandable to everyone if you announce a couple of weeks ahead of time that the entire service will be down for maintenance for a day. But what you have now is no different than the cable company telling you you have to take an entire day off to stay at home, and the installer might show up on that day between 8am and 6pm - no promises though. Why not just have 4 days in the year where you go down to cold iron and do what repairs need to be done? Call them "I brake for reality Breaks" when everyone is prodded into going out and getting an ice cream or some such treat in the real world.
|
Prospero Linden
Linden Lab Employee
Join date: 6 Aug 2007
Posts: 315
|
02-13-2009 11:13
Just scheduling 4 days of down time a year wouldn't work. First, we can't schedule that far in advance; we don't know that we will be ready. What would happen is that as we got close, we'd have to push off the down time, and then you'd have short notice for mulitple days of downtime, which would be *far* more disruptive.
Second, the analogy to the cable company telling you to stay home all day falls down. You do not have to be there for the rolling restart. It will happen sometime, and, for most regions, after 10 minutes the region will be back. Just keep using Second Life as normal. If you get a notification that the sim is about to go down, then go somewhere else for a few minutes.
Yes, this can be disruptive if it hits an event in progress. For that reason, we do hope in the future to make things more predictable. But it's not a week of constant downtime and maintenance for which you have to completely disrupt your schedule.
|
JJValero Writer
Registered User
Join date: 28 Feb 2007
Posts: 10
|
Debian Lenny is near
02-13-2009 11:13
Perhaps is better to wait one or two weeks to Lenny.
|
Prospero Linden
Linden Lab Employee
Join date: 6 Aug 2007
Posts: 315
|
02-13-2009 11:22
One or two weeks? I remember when Lenny was going to be out in September 2008. I'm not holding my breath  But, also, it's not just a matter of, "Hey, new Debian's out, let's upgrade!" the way (say) I do on my desktop. We have to do extensive testing to make sure everything works. Inevitably, at the beginning, it doesn't, so we have to make sure that we get the right configuration of packages, that we update our customized config files, etc. And, we want to test the heck out of it before we put it on the 6000 servers that run the main Second Life environment. We will assuredly go to Lenny at some point, but we're going to let Lenny be out and be "Debian stable" for a while before we do that. And, there'll be a couple of months of work behind the scenes as we get ready to go from etch to lenny before it hits the rolling restart stage. (It's been more than a couple of months to get etch ready-- it's been several months with several people working on it. But, what they've done will make this easier for us next time around.)
|
Argent Stonecutter
Emergency Mustelid
Join date: 20 Sep 2005
Posts: 20,263
|
02-13-2009 11:30
Why do the restarts take so long if there's migration going on as well? Waiting for a free server to migrate the region to? Or just handling migration of that many regions?
|
Atashi Toshihiko
Frequently Befuddled
Join date: 7 Dec 2006
Posts: 1,423
|
02-13-2009 11:45
From: Prospero Linden Yes, this can be disruptive if it hits an event in progress. For that reason, we do hope in the future to make things more predictable. But it's not a week of constant downtime and maintenance for which you have to completely disrupt your schedule. I do not disagree with what you are saying Prospero, the problem is that because we cannot know which day of the rolling restart is 'our' day, it means that all three days are unpredictable. To put it another way, were I trying to schedule an inworld business meeting, I have to blacklist all of the rolling restart days because SL is going to be unreliable through that three day period. (And I know your counter point that it's not three whole days, it's just a few hours each day. Unfortunately these things tend to happen during business hours for North America, so it's business days. For me at least.) The fact that odds are we will only have a 10 minute downtime is not relevant if we don't even know which day the 10 minutes fall on. It's a debate that has been mentioned in these restart threads a few times in the past. Yes it is inconvenient if the restart hits you in the middle of the event. So we want to plan around the restart and ensure that the restart does not hit our event (or our meeting, or whatever). How do we plan around it? We have to blacklist the entire restart period. Granted we can continue to use SL for lots of other things, but what we cannot do, is plan the Big Business Meeting, or the event, or whatever. I do have a (hopefully) helpful suggestion. Since restarts in the past have followed a pattern of day1 = preliminary / test regions, day2 = evens(or odds), day3 = odds(or evens), I would propose this: One week before the restart, lock all regions into either an even or odd server number (i'm not saying lock them to the exact server, just to the even or odd so they always come up on the same 'group'). At the same time, announce that (eg.) "next Wednesday will be the odds, and Thursday will be the evens." For regions that are on the preliminary, notify the estate owner (if it is a private estate region). Mainland regions that are in the preliminary, just publish a list of them on a wiki page somewhere. This at least will let people know a week in advance which day their restart will hit. Then if something happens on the prelim and it has to be rolled back, or the rest of it has to be delayed, delay the whole thing a full week. Make that the pattern; Tusday for the preliminary, Wed for the evens, Thurs for the odds. A little consistency would not hurt any of us. If the preliminary regions cannot be announced in advance, then announce what the server host names are for the preliminary. People who care enough about this will look it up and figure out if it affects them. With a system like this in place, then at least I could be in a position to know that 'Ok I know that my region is an odd number so it's safe to do the presentation on ----day but not on ----day.' -Atashi
_____________________
Visit Atashi's Art and Oddities Store and the Waikiti Motor Works at beautiful Waikiti.
|
Prospero Linden
Linden Lab Employee
Join date: 6 Aug 2007
Posts: 315
|
02-13-2009 11:47
From: Argent Stonecutter Why do the restarts take so long if there's migration going on as well? Waiting for a free server to migrate the region to? Or just handling migration of that many regions? Most restarts will be relatively fast, as they just start up on another host. However, we'll be doing enough hosts at one time that there will not always be enough spares to start all regions immediately. As such, some will have to wait a bit for some of the downed hosts to come back.
|
Argent Stonecutter
Emergency Mustelid
Join date: 20 Sep 2005
Posts: 20,263
|
02-13-2009 12:05
From: Prospero Linden However, we'll be doing enough hosts at one time that there will not always be enough spares to start all regions immediately. As such, some will have to wait a bit for some of the downed hosts to come back. That's what I thought... waiting for a free server to come up.
|
Elanthius Flagstaff
Registered User
Join date: 30 Apr 2006
Posts: 1,534
|
02-13-2009 15:04
From: Shockwave Yareach I agree with Atashi. It's understandable to everyone if you announce a couple of weeks ahead of time that the entire service will be down for maintenance for a day. But what you have now is no different than the cable company telling you you have to take an entire day off to stay at home, and the installer might show up on that day between 8am and 6pm - no promises though. Why not just have 4 days in the year where you go down to cold iron and do what repairs need to be done? Call them "I brake for reality Breaks" when everyone is prodded into going out and getting an ice cream or some such treat in the real world. Woah, woah, woah. We had that with the old weekly downtime that last all day and it totally sucked. This way is a million times better.
_____________________
Visit http://ninjaland.net for mainland and covenant rentals or visit our amazing land store at Steamboat (199, 56). Also, we pay L$0.15/sqm/week for tier donated to our group and we rent pure tier to your group for L$0.25/sqm/week. Free L$ for Everyone - http://ninjaland.net/tools/search-scumming/
|
Atashi Toshihiko
Frequently Befuddled
Join date: 7 Dec 2006
Posts: 1,423
|
02-13-2009 15:08
Yeah, I'm not eager to go back to the all-day-downtime that we used to have on update Wednesdays. All I'd really like is to know what day my downtime is going to be on. Figure that Prospero maybe has me on mute though... or he's ignoring me.  -Atashi
_____________________
Visit Atashi's Art and Oddities Store and the Waikiti Motor Works at beautiful Waikiti.
|
Sindy Tsure
Will script for shoes
Join date: 18 Sep 2006
Posts: 4,103
|
02-13-2009 15:13
From: Prospero Linden Most restarts will be relatively fast, as they just start up on another host.
However, we'll be doing enough hosts at one time that there will not always be enough spares to start all regions immediately. As such, some will have to wait a bit for some of the downed hosts to come back. Aside from "it's a pain in the bum to do that" why not slow the process down enough that regions don't get restarted until there are spares up and ready to handle them? Not giving you a hard time.. Honestly curious.
|
Abigail Merlin
Child av on the lose
Join date: 25 Mar 2007
Posts: 777
|
02-14-2009 12:30
From: Sindy Tsure Aside from "it's a pain in the bum to do that" why not slow the process down enough that regions don't get restarted until there are spares up and ready to handle them?
Not giving you a hard time.. Honestly curious. i can take an educated guess on that one, once you start the upgrade process you can move on to the next machine so it would be a waste of time to have to wait on enough spare servers to be available and with 6000+ servers all the time you can safe is needed, it will likely make the difference between 4 days and 1 week or more
|
Sindy Tsure
Will script for shoes
Join date: 18 Sep 2006
Posts: 4,103
|
02-14-2009 12:55
That'd mean that the number of sims that stay down because there isn't a spare keeps increasing as the process moves along. At least until sometime near the end when Prospero is starting more than stopping.. Is that exactly what's going to happen?
If they wanted to avoid that, or if they wanted to clamp the number of sims down at once, I don't think that the whole process ends up being much faster. You just net the time that the max number of sims down for an extended time would have taken.
|
Prospero Linden
Linden Lab Employee
Join date: 6 Aug 2007
Posts: 315
|
02-14-2009 13:42
Sindy -- if we slowed it down, the process would continue into yet another week. It's a balance between getting the thing done in a short period of time and trying to make the impact as small as possible. Hopefully we've managed to get the balance *roughly* right. Re: Atashi and complaints about rolling restarts, I fear I've had this conversation in *so many* rolling restart threads by now that I'm a little talked out on it  Yes, I know that people don't like rolling restarts. Yes, I know it's disruptive. Yes, we do hope to make it better as time goes on. No, I don't want to announce which servers will be hit ahead of time, because we can't really reliably predict that. However, I also think that people overblow the effects of them somewhat. Right now, about 1/10 of the pilot regions are volunteers. If we can get more volunteers for the pilot region, that will make things more predictable. However, we do need to fill out the pilot group if we don't have enough volutneers, so I continue to select them randomly.
|
Linnrenate Crosby
Registered User
Join date: 5 Jun 2007
Posts: 49
|
02-15-2009 01:39
All i ask Prospero is that you this time, and in the future will issue a gridwide warning (dialogue pop-up) that the roll-out has started. Please do that all the days you have scheduled the roll-out so your residents wil be warned, after all some of us are building things and do appreciate a warning so we can pickup our work and store it until later.
|
Tayra Dagostino
Registered User
Join date: 9 Jun 2007
Posts: 7
|
02-15-2009 04:06
from sarge to etch?
but today lenny announced as stable......
|
Chandra Magic
Registered User
Join date: 12 Jun 2008
Posts: 17
|
02-15-2009 04:53
From: Tayra Dagostino from sarge to etch?
but today lenny announced as stable...... Yah, but so is 'Windows'. And I think they want to drive it into the ground before they deploy it to their 6000+ servers. It wouldn't do to get through a huge rolling restart. Then have to do it again because their servers didn't like it. As Prospero already mentioned earlier in this thread.
|
Atashi Toshihiko
Frequently Befuddled
Join date: 7 Dec 2006
Posts: 1,423
|
02-15-2009 05:43
From: Prospero Linden Re: Atashi and complaints about rolling restarts, I fear I've had this conversation in *so many* rolling restart threads by now that I'm a little talked out on it  Yes, I know that people don't like rolling restarts. Yes, I know it's disruptive. Yes, we do hope to make it better as time goes on. No, I don't want to announce which servers will be hit ahead of time, because we can't really reliably predict that. However, I also think that people overblow the effects of them somewhat. Prospero, thanks for responding. I'm a bit troubled by a few things you've said, which makes me think that I have done a poor job in communicating my concerns - you seem not to understand what I am trying to say. Either that, or you've attached my name to a response that seems to have little or nothing to do with what I had posted. I am not complaining about rolling restarts, and I'm certainly not advocating we go back to "Update Wednesdays." What I am looking for, is a more-accurate prediction of when downtime will strike my regions, so that I can plan around it. I do understand that at the moment you can't reliably predict what servers will go down when, and I respectfully suggest that this is a problem which needs to be addressed. I made a suggestion in my previous post which I thought could perhaps resolve that problem. And finally, I appreciate that most of the time, these things are a minor disruption. It is those occasions when it is not a minor disruption that make it an issue. I honestly believe that your statement that you 'think people overblow the effects' is due to what many customers perceive as an overall lack of respect that Linden Lab has for its clients. When a customer tells you something concerns them, you don't tell the customer that they're just overreacting. Even if you think they are. I don't want to belabour the issue further, but I do hope I have been able to phrase my concerns in a way that you understand - and I will conclude with a simple straightforward question: If we can't get accurate predictions of the restart downtime now, can steps be taken so that we will get accurate predictions of restart downtime in the future? Thank you for your time. -Atashi
_____________________
Visit Atashi's Art and Oddities Store and the Waikiti Motor Works at beautiful Waikiti.
|
Cincia Singh
Registered User
Join date: 26 Jun 2007
Posts: 79
|
02-15-2009 06:24
I'm confused by all this angst over the unpredictability of the restarts, and perhaps multiple restarts of a region. Exactly what is it that is going on in any region, that would be so negatively impacted by a rolling restart, that we see this constant stream of pleading to make things more predictable and limited to just one restart per sim? Actually I think I know the answer to my own question, since there are to my knowledge only a very few things that have to be manually restarted, re-logged and put back into service and I'm here to tell you that if this is what all the fuss is about I say we need more unpredictable rolling restarts, not fewer.
Good luck with the etch-i-fication of the grid Prospero! Things are slowly getting better and you and your team are one of the major reasons why!
|
Linnrenate Crosby
Registered User
Join date: 5 Jun 2007
Posts: 49
|
02-15-2009 06:41
From: Cincia Singh Exactly what is it that is going on in any region, that would be so negatively impacted by a rolling restart, that we see this constant stream of pleading to make things more predictable and limited to just one restart per sim? Cincia, people build, get married, run events... many things really. I know that planning marriges and events take long time and has to be planned in good time, so of course when attending to events like that people will not have to worry about a sim restart at the most critical time in the event. As for buildres... well we sometimes work on big builds that take time to put back in the inventory, a 2 min. warning will not be enought to do that safely. Again... that's why i also ask Prospero to send out a gridwide warning 
|
bikerchad69 Cooperstone
Registered User
Join date: 13 Aug 2008
Posts: 4
|
02-15-2009 12:16
From: Prospero Linden Just keep using Second Life as normal. If you get a notification that the sim is about to go down, then go somewhere else for a few minutes. Going somewhere else is easy to say, but not always easy to do when tp's fail, or worse, you pick a region that is also about to restart and end up getting kicked off the grid anyway. Not a complaint, just an observation. I'm all for anything that makes the grid more stable.
|