Welcome to the Second Life Forums Archive

These forums are CLOSED. Please visit the new forums HERE

So... was the hard drive crash reason a lie?

Aaron Levy
Medicated Lately?
Join date: 3 Jun 2004
Posts: 2,147
12-30-2004 22:15
I've replaced a hard drive before. In fact I've replaced dozens.

A hard drive crash this was not.
Jillian Callahan
Rotary-winged Neko Girl
Join date: 24 Jun 2004
Posts: 3,766
12-30-2004 22:17
A lie? That's just silly.

The dead hard drive appears to have been the first event in a cascade, though.
_____________________
Aaron Levy
Medicated Lately?
Join date: 3 Jun 2004
Posts: 2,147
12-30-2004 22:20
EVERY time Linden Labs "fixes" something it starts a "cascade." Every time.
Lex Neva
wears dorky glasses
Join date: 27 Nov 2004
Posts: 1,361
12-30-2004 22:21
Oh come on, Aaron. I've replaced a disk in a machine, too, but I haven't done it on an enterprise machine running a database that many people access continuously. I'm sure there's a rebuild process involved, it's not like they can just go drop a new disk in and the world is happy.

I frankly have no idea what is involved here, and neither do you. You can't sit here and guess from the sidelines, and bitching or suggesting conspiracy only puts more negative pressure on the lindens. Just be patient, ok? You don't know what's going on in there, so don't insult their abilities.
Khamon Fate
fategardens.net
Join date: 21 Nov 2003
Posts: 4,177
12-30-2004 22:23
if i had to rebuild a raid drive on a database server with no backup system in place, i'd take the grid down for a few hours myself. sometimes though, the corrupt data is just the end result of another hardware problem in the system.

we've had this happen twice this year on dell servers. file systems died, wouldn't rebuild properly, their answer was always that our linux kernel (up for several months each) must've been deranged. but it was their hardware gone buggy both times. won't be using those anymore. guarantee you that.
_____________________
Visit the Fate Gardens Website @ fategardens.net
Hank Ramos
Lifetime Scripter
Join date: 15 Nov 2003
Posts: 2,328
12-30-2004 22:25
Update on Server Issues and Downtime:
/invalid_link.html
_____________________
Tread Whiplash
Crazy Crafter
Join date: 25 Dec 2004
Posts: 291
There's a Difference...
12-30-2004 22:30
Aaron -

How many 100 to 200-server systems have you run in your life?

I have been on 2 teams with that many servers to support; and let me tell you that there's a WORLD of difference between replacing a Hard Drive on your local PC, and replacing a Hard Drive on a distributed computing system or server farm. This is not an all-inclusive list; but let me hit some points for you:

1) The system was on a RAID array, from all appearances. This means that you have to replace the disk, reformat it, and restripe it / reconfigure your raid array.

2) You have to restore all data from backup that you can / clone the proper data from your RAID system. This possibly means including all information up TO the moment of the crash - but not any possibly corrupt or improper data from the crash or the time thereafter (unless it was initiated by a system that is still "OK";). This can take time and research.

3) You need to check to see if there was a problem with other parts of the machine that caused the HD failure - it would be BAD to put a fresh disk in, only to have it get fried a day later!!

4) With the new hardware in place, you need to bring up the systems in a controlled manner, to ensure connectivity and such.

5) With users jumping in and out of the game, having problems connecting, etc.... you now have the task of straightening out the various inter-dependant systems and re-synchronizing anything that needs to be.

6) You may need to update all your system logs, monitoring programs, and backup systems - perhaps doing an immediate full system backup once you DO have the situation straightened out, in case another failure in the near-term messes things up.


...AND all of the above assumes that you already have the spare parts on hand AND the data-center is local. At Sierra / WON.net, we had data-centers all over the USA, but we were all based at an office in Seattle - so travel time could add to any crisis. We tried to be triply-redundant in everything, but our multiplayer systems weren't persistent worlds; AND we ran up frightfully expensive costs with all of our redundancy and load-balancing equipment. As a size comparison - Sierra at the time probably employed 500 - 800 people in the US. LL employs around 30, according to their website. That difference in size translates DIRECTLY into a difference in the amount of $$ and personnel you can throw at any given problem.

Oh, and what would lying gain them, anyways?? Save the conspiracy theories for the faked Moon-landing or LBJ's assassination of JFK. :-)

Just food for thought.

Take care,

--Noel "HB" Wade
(Tread Whiplash)

Edit: Good link, Hank.... man am I GLAD I'm not at LL for this! Lots of caffiene abuse there, for sure...
Aaron Levy
Medicated Lately?
Join date: 3 Jun 2004
Posts: 2,147
12-30-2004 22:37
From: someone
How many 100 to 200-server systems have you run in your life?


Three actually. One for a Dana Corporation plant of 290 workstations and 50 servers, One for a Rolls-Royce Energy plant consisting of 190 servers (one for each supplier -- it was a complex system I'd rather forget to be honest), and one for a Mega-Church in Arizona that housed 20 servers and some 400-500 computers all over Phoenix, tied together through a WAN connection that wanted to break down whenever a good dust storm came rolling into town.

Don't jump to conclusions about my experience. Save all your "look what I've done" stuff. I've done everything on your little list.

The lie comment was tongue-in-cheek anyway. Get to know me and you'll know that about me. Jump to conclusions about my professional experience and you'll be wrong on just about every point you make.
Antagonistic Protagonist
Zeta
Join date: 29 Jun 2003
Posts: 467
12-30-2004 22:41
From: someone
Oh come on, Aaron. I've replaced a disk in a machine, too, but I haven't done it on an enterprise machine running a database that many people access continuously.


I have, and it's a royal pain in the ass.

Hopefully the Lindens aren't silly enough to be using RAID 5.

-AP
Aaron Levy
Medicated Lately?
Join date: 3 Jun 2004
Posts: 2,147
12-30-2004 22:41
I've been on huge server teams for several different companies, in total charge of one, and no, it's not, fun, pretty or easy. But like Ian's post said, they weren't prepared for this.

One of my duties for the Dana plant I worked at was to sit and brainstorm things that could go wrong and have funds and emergency plans in place if they ever did. I had to have a hurricane procedure for a plant in Ohio, for crying out loud.

No conspiracy... just lack of planning... the NUMBER ONE killer of businesses no matter what their size or vision.
Azelda Garcia
Azelda Garcia
Join date: 3 Nov 2003
Posts: 819
12-31-2004 03:04
> It's not like they can just go drop a new disk in and the world is happy.

Actually it is.

Compaq server:
- pull the two red levers
- remove disk, give to nice Compaq technician
- insert new disk
- close red levers

Watch as RAID rebuilds, but everything was still running throughout this whole time; though performance will degrade somewhat when you put the new disk in, as the RAID rebuilds.

Note that RAID isnt the be all and end all, since RAID controllers themselves have a habit of dieing. I've seen at least 3 die, and when they die, thats the end of that machine for maybe 2 hours till the Compaq technician gets there and replaces it.

Azelda
Hiro Pendragon
bye bye f0rums!
Join date: 22 Jan 2004
Posts: 5,905
12-31-2004 03:10
From: Aaron Levy
EVERY time Linden Labs "fixes" something it starts a "cascade." Every time.

I doubt it. I'm sure there are many fixes invisible to users that never get mentioned.

This post is just baseless gossip-hounding, Aaron. Come on, now. You're not an old woman at a beauty salon, are you? We've had problems where they didn't even know what was wrong and they've told us straight out that. What reason do they have to lie? Do you have some sort of inside knowledge on SL's hardware and software network that you know how everything interacts?
_____________________
Hiro Pendragon
------------------
http://www.involve3d.com - Involve - Metaverse / Emerging Media Studio

Visit my SL blog: http://secondtense.blogspot.com
Eggy Lippmann
Wiktator
Join date: 1 May 2003
Posts: 7,939
12-31-2004 03:15
*chuckles at LL's antics* :)
Hiro Pendragon
bye bye f0rums!
Join date: 22 Jan 2004
Posts: 5,905
12-31-2004 03:17
From: Azelda Garcia
> It's not like they can just go drop a new disk in and the world is happy.

Actually it is.

Compaq server:
- pull the two red levers
- remove disk, give to nice Compaq technician
- insert new disk
- close red levers

Watch as RAID rebuilds, but everything was still running throughout this whole time; though performance will degrade somewhat when you put the new disk in, as the RAID rebuilds.

Note that RAID isnt the be all and end all, since RAID controllers themselves have a habit of dieing. I've seen at least 3 die, and when they die, thats the end of that machine for maybe 2 hours till the Compaq technician gets there and replaces it.

Azelda

Actually, it isn't.

Have you dealt with anything besides RAID 0 or 1?
http://www.adaptec.com/worldwide/product/markeditorial.html?sess=no&prodkey=quick_explanation_of_raid
I guaran-frigging-tee that the Compaq techs have never touched a disk-striping RAID 5 array. :)
_____________________
Hiro Pendragon
------------------
http://www.involve3d.com - Involve - Metaverse / Emerging Media Studio

Visit my SL blog: http://secondtense.blogspot.com
Azelda Garcia
Azelda Garcia
Join date: 3 Nov 2003
Posts: 819
12-31-2004 04:19
Most people just wing everything into hardware RAID 5, then partition the whole RAID 5 a little.

In theory, one should use mirroring for the OS partition and for the journalling partition (for speed), and RAID 5 for the data partition (allows more data for same disk space), but noone ever does that because its too much hastle.

Not sure why you think that Compaq wouldnt support RAID 5. http://h30099.www3.hp.com/configurator/DataArea.asp

Azelda
nonnux white
NN Dez!gns
Join date: 8 Oct 2004
Posts: 90
12-31-2004 04:49
it is possible to change harddrives with only 1 click. no data lost, no hardware reboot, no nothing. if lindens says the main reason is 1 single hard drive , i donÂșt beleive it. i know some servers (already touched it) serving about 4000 students that u can remove 1 of the 4 discs, justclicking on the front panel of it. replacing is not so fast, but maybe more 5 seconds (if the disc is not wrapped)
Maxx Monde
Registered User
Join date: 14 Nov 2003
Posts: 1,848
12-31-2004 04:52
I heard it was nose-goblins.
Shadow Weaver
Ancient
Join date: 13 Jan 2003
Posts: 2,808
12-31-2004 07:32
Ok this may be off topic of the thread but let me ask this. Even if LL was using a Raid 5 controler this is only a 1 step back up process granted the information was stripped across several drives its not a totaly secure redundancy which in the case of SL needs to have multiple redundant system for critical asset servers. My Question is wouldnt this be a good reason if asset servers are crashing due to overage of inventory items for people to start having the ability to store "THEIR" Content on the client side?

Just a thought now mind you.

Shadow
_____________________
Everyone here is an adult. This ain't DisneyLand, and Mickey Mouse isn't going to swat you with a stick if you say "holy crapola."<Pathfinder Linden>

New Worlds new Adventures
Formerly known as Jade Wolf my business name has now changed to Dragon Shadow.

Im me in world for Locations of my apparrel

Online Authorized Trademark Licensed Apparel
http://www.cafepress.com/slvisions
OR Visit The Website @
www.slvisions.com
Tread Whiplash
Crazy Crafter
Join date: 25 Dec 2004
Posts: 291
...Client Side BAD!
12-31-2004 11:28
Storing ANYTHING beyond basic, non-critical information on the Client is bad bad bad!

I say this as an experienced Web Developer and an experienced Games-Industry-Vet (3 years at Sierra, running the multiplayer gaming servers for Half-Life, Homeworld, Ground-Control, and many others).

Why you ask? Because griefers and bored trolls take it as an invitation to start trying to hack & exploit the hell out of the game. The rule of thumb is this: If you put it on the client, someone SOMEWHERE will figure out how to get to it and mess with it. Then, they will post it on the web, or start telling all their friends. It eventually turns into a never-ending "Running Gun-Battle" between the dev's and the exploiters - each taking a turn at being "ahead of the curve"; and everyone else having to download patch after patch after patch...

The only way to truly protect the information, is to make the server responsible for it - and then make SURE your server is well-protected from remote attacks.

Look at all the problems with cheating online in FPS games like CounterStrike - because the Server relies on the client to calculate some things. There's just no way around it.

You can spend a lot of time and energy with "security through obscurity" - hiding and trying to use trickery to conceal when and where you store information on the client... But ultimately that's a "hack-ish" way to do it - and you will still be vulnerable. You can also make the server "authenticate" every client-side object against a record of certain specs for that object that are stored on the server - but then you're just duplicating your data and slowing down the server for no necessary reason (just store it only on the server)! I suppose you COULD use some sort of public/private key encryption scheme to store the data on the client's drive; and only allow the server to decrypt it - BUT that still leaves you vulnerable to exploit when the object is in system memory on the client, getting ready to be encrypted & stored.

Bottom line: The client is unreliable and should be treated like a "black box" from the server's perspective. Unlike the client, the Server can be controlled and secured by the creators of the product - and they can ensure that everyone has the same gameplay experience. It sucks, but such is life...

Take care,

--Noel "HB" Wade
(Tread Whiplash)
Dallas Moreau
Registered User
Join date: 7 Dec 2004
Posts: 146
12-31-2004 11:38
Perhaps the real test will come when it happens again. We grant the company understanding and a big-cheer-when-fixed the first time something like this happens. The next time, assuming they've had the foresight and done the planning, we cheer them again when the crisis is handled well.

edited to add a missing word
_____________________
Azelda Garcia
Azelda Garcia
Join date: 3 Nov 2003
Posts: 819
01-01-2005 03:25
Tread,

The suggestion is not to distribute the asset server to the client, but to let people store their own creations on the client (basically anything to which they have full perms to). There's not much security issue here, beyond basic buffer overflows etc, on the import.

Azelda
Cutter Rubio
Hopeless Romantic
Join date: 7 Feb 2004
Posts: 264
01-01-2005 08:16
The big difference here is your definition of Enterprise Computing. Having 200 or more servers does not make an enterprise system in itself. It's much more about design and management than that.

I manage a clearinghouse system for healthcare billing and eligibility transactions, which as a rule has to be up all the time. We run clustered HP-UX systems to accomplish that, on top of fully redundant server hardware and dual-redundant HP Enterprise Virtual Arrays. We lose disk drives all the time - we never go down for it. We've lost EVA controllers, which are themselves redundant inside the array. Not a lick of down time for that. That's what you get when you spend a quarter of a million each on competent disk arrays.

This is what LL needs to decide - do they intend to run this service, and that's what it is, on an Enterprise class platform? If so, it's time to start spending that $8 million on infrastructure. They need to be running a serious database on the backend, like Oracle 10g, that can scale to the kinds of volume they need and take advantage of an underlying, fully redundant hardware platform.

I'm a huge fan of Open Source, and anything in general that takes Microsoft down a peg or two, but MySQL is not, and never will be, an enterprise database system. It was probably fine for a startup company, but plans should have been made to have migrated a long time ago. LL has some great vision, but they missed looking far enough forward on this one...

I've been doing what the Linden grid monkeys are doing for a bit more than 20 years, first on VMS and VMS clusters, TRU-64 UNIX Clusters, AIX and HP-UX Clusters. I've done it almost exclusively in the health care arena, where downtime just doesn't cut it. It CAN be done - the question is "Will they?"
_____________________
The early bird may get the worm, but the second mouse gets the cheese.
Antagonistic Protagonist
Zeta
Join date: 29 Jun 2003
Posts: 467
01-01-2005 09:29
Using RAID5 these days is silly.

It was designed to be less expensive than mirroring a drive and with the low price of drives these days, it just doesnt (usually) make much sense to use RAID5. RAID 1 + 0 is the way to go .. and be sure to use disks from different vendors for your mirrors, or at a minimum disks from different manufacturing batches.

-AP
Deklax Fairplay
Black Sun
Join date: 2 Jul 2004
Posts: 357
01-01-2005 10:30
It didn't take them that long long to fix, they just locked the rest of us out and let their favorites play. Their system can't handle us all and yes they need scalability. Fat chance.

Dallas Moreau: You obviously havent been around very long.
MrsJakal Suavage
Purple Butterfly
Join date: 18 Jul 2004
Posts: 1,434
01-01-2005 10:42
From: Deklax Fairplay
It didn't take them that long long to fix, they just locked the rest of us out and let their favorites play. Their system can't handle us all and yes they need scalability. Fat chance.

Dallas Moreau: You obviously havent been around very long.


I must be favorite then :p
_____________________
1 2