Welcome to the Second Life Forums Archive

These forums are CLOSED. Please visit the new forums HERE

It Is Official---nothing will ever be fixed

Calliope Simon
Registered User
Join date: 21 May 2006
Posts: 154
08-12-2007 12:59
Just took a look at the blog, seems they took the advice of just coming out with the truth.

Unfortunately, they probably should have just stayed quiet.

IPSEC tunneling to throw SL traffic between hubs over the INTERNET? That's not how you query databases in san francisco from texas. You do that kind of thing on an absolutely static route over a single backbone provider that does hard frame encryption between their routers and has nothing to do with VPNing---because that's fast and stable. Not slow and unreliable.

And a single hard drive in a RAID array capable of bringing down AUTHENTICATION? This just proves beyond any doubt that the people who designed this stuff are IDIOTS. There are NO circumstances in which it is excusable to allow a single hard drive failure to bring down a RAID array. Lets just remind ourselves that raid stands for:

Redundant Array of Inexpensive Drives.

REDUNDANT array. REDUNDANT. That means that if one goes down, you don't lose the array.

In fact, there are three RAID hardware manufacturers that I can think of off the top of my head who will send you a new drive often before you've even realized that one went bad--less than four hours usually. Swapping should be any more than walking up to the rack, yanking out the broken drive and slapping the new one in, and then watching it rebuild itself automatically.

That a failed drive brought down an entire array can mean either or both of the following:

1. More than one drive had failed in that array, and the array lost parity, and therefore failed. It is so highly unlikely that more than two drives would fail at the same time that it's safe to assume that it didn't happen---and that it had *already had a failed drive or drives in it that hadn't been replaced when the last one failed*

2. It was set up without parity.

Both are absolutely novice moves and entirely without excuse.
Kascha Matova
Bus Bench Supermodel
Join date: 30 Mar 2007
Posts: 342
08-12-2007 13:45
From: Calliope Simon
Just took a look at the blog, seems they took the advice of just coming out with the truth.

Unfortunately, they probably should have just stayed quiet.

IPSEC tunneling to throw SL traffic between hubs over the INTERNET? That's not how you query databases in san francisco from texas. You do that kind of thing on an absolutely static route over a single backbone provider that does hard frame encryption between their routers and has nothing to do with VPNing---because that's fast and stable. Not slow and unreliable.

And a single hard drive in a RAID array capable of bringing down AUTHENTICATION? This just proves beyond any doubt that the people who designed this stuff are IDIOTS. There are NO circumstances in which it is excusable to allow a single hard drive failure to bring down a RAID array. Lets just remind ourselves that raid stands for:

Redundant Array of Inexpensive Drives.

REDUNDANT array. REDUNDANT. That means that if one goes down, you don't lose the array.

In fact, there are three RAID hardware manufacturers that I can think of off the top of my head who will send you a new drive often before you've even realized that one went bad--less than four hours usually. Swapping should be any more than walking up to the rack, yanking out the broken drive and slapping the new one in, and then watching it rebuild itself automatically.

That a failed drive brought down an entire array can mean either or both of the following:

1. More than one drive had failed in that array, and the array lost parity, and therefore failed. It is so highly unlikely that more than two drives would fail at the same time that it's safe to assume that it didn't happen---and that it had *already had a failed drive or drives in it that hadn't been replaced when the last one failed*

2. It was set up without parity.

Both are absolutely novice moves and entirely without excuse.



Does this mean RAID doesn't kill roaches dead? :D

No really - that did seem kinda peculiar. The whole purpose of RAID is to prevent single points of failure in the storage system. Even a simple mirroring setup can prevent catastrophe with one drive. Not the most efficient way to do it but...

Point 2 is frightening. A striped set? And that's it? I know mirroring has no parity either but it would have survived a single disk failure. That only leaves RAID 0 since the rest have either dedicated or distributed parity. Egads!
Day Oh
Registered User
Join date: 3 Feb 2007
Posts: 1,257
08-12-2007 14:04
They did say the drive didn't really fail, but half-failed in such a way that it kept working, but very very poorly.

I don't really know this hardware stuff, I really just want this info in here so it's responded to
Sindy Tsure
Will script for shoes
Join date: 18 Sep 2006
Posts: 4,103
08-12-2007 14:28
From: Day Oh
They did say the drive didn't really fail, but half-failed in such a way that it kept working, but very very poorly.

I don't really know this hardware stuff, I really just want this info in here so it's responded to

Don't go mixing facts in, Day.. This isn't about facts, it's about FUD.
Rusty Satyr
Meadow Mythfit
Join date: 19 Feb 2004
Posts: 610
08-12-2007 14:45
Wow, wish I could do IT in your universe, where problems are always crystal clear and obvious, and nothing ever partially fails in an un-trappable way.

VPN for connectivity between sites is clearly something they've adopted as necessary... if they're going to open-source the SIM side of SecondLife, they are not going to be able to set up dedicated secure lines between each SIM server and their back-end asset/auth services.
Malachi Petunia
Gentle Miscreant
Join date: 21 Sep 2003
Posts: 3,414
08-12-2007 15:59
From: someone
VPN for connectivity between sites is clearly something they've adopted as necessary... if they're going to open-source the SIM side of SecondLife, they are not going to be able to set up dedicated secure lines between each SIM server and their back-end asset/auth services.
But it isn't something that need be done *now* for California<->Texas traffic especially given its now known failure mode.

If they really are "co-locating" across the public Internet (and not just writing poorly in the blog) they should be happy that it works at all, ever. Contrariwise, if they are using IPSec over a dedicated link, they really need to buy better site-to-site transport.

I had the same reaction to the blog entry as did the OP, that this is how I'd expect "Fred's IP Hosting" system to be put together.
_____________________
Calliope Simon
Registered User
Join date: 21 May 2006
Posts: 154
08-12-2007 16:53
From: Day Oh
They did say the drive didn't really fail, but half-failed in such a way that it kept working, but very very poorly.

I don't really know this hardware stuff, I really just want this info in here so it's responded to


Any modern RAID array would have detected that sort of thing immediately---since it will always, always produce massive amounts of channel errors. And then you yank it and slap a new drive in, then go home and let it rebuild itself (while it continues to function normally through the entire process)
Calliope Simon
Registered User
Join date: 21 May 2006
Posts: 154
08-12-2007 16:55
From: Rusty Satyr
Wow, wish I could do IT in your universe, where problems are always crystal clear and obvious, and nothing ever partially fails in an un-trappable way.

VPN for connectivity between sites is clearly something they've adopted as necessary... if they're going to open-source the SIM side of SecondLife, they are not going to be able to set up dedicated secure lines between each SIM server and their back-end asset/auth services.


They don't have to set them up---they're already there. Every backbone provider will lease bandwidth directly to anyone willing to pay them---and its a lot less expensive, generally, than is assumed.
Brenda Connolly
Un United Avatar
Join date: 10 Jan 2007
Posts: 25,000
08-12-2007 17:19
Who's going to be responsible for translating all that into English?
_____________________
Don't you ever try to look behind my eyes. You don't want to know what they have seen.

http://brenda-connolly.blogspot.com
Rusty Satyr
Meadow Mythfit
Join date: 19 Feb 2004
Posts: 610
08-12-2007 17:43
From: Calliope Simon
since it will always


Really? Will it now?


I've been in IT for 20 years... The only things I'm sure of are these:

Complex systems will eventually fail in ways that even the best of people can not plan for.

And there will usually be some armchair quarterback prattling on about coulda-woulda-shoulda.
Osgeld Barmy
Registered User
Join date: 22 Mar 2005
Posts: 3,336
08-12-2007 18:08
i do agree with the RAID points, the entire point of running a raid in a situation like SL (or any other networking system) is not speed, its redundancy

heck like i really give a crap if the disk systems are faster when i cannot log in becuase of a failure, this is truley a novice, and if you need speed and redundancy use a stripe set with parity! if this is beyond you please let me know ill show you how to do it in 1 min flat

linden labs, seriously contact me and ill send you my (ms) networking essentials book that i got in a 101 class, theres like 3 chapters on exactly how to use this, written in a way that a housewife with no computer experience at all could understand
Osgeld Barmy
Registered User
Join date: 22 Mar 2005
Posts: 3,336
08-12-2007 18:12
From: Rusty Satyr
Really? Will it now?


I've been in IT for 20 years... The only things I'm sure of are these:

Complex systems will eventually fail in ways that even the best of people can not plan for.

And there will usually be some armchair quarterback prattling on about coulda-woulda-shoulda.



yea it will 99.999% of the time, unlike 20 years ago its pretty ez for the system to notice that it has to do alot of corrections for the data to be valid

and forgive me for not believing in your 20 year experience, but our 30 year veteran just last Friday tried to wire a 100base T connection using straight tru serial wire(id be giving him credit by saying it was cat 1 but i wont because this shit was insulated with stone) , and proceeded to fuss and cuss at it for almost an hour before i dropped a cat 5 cable on his desk, so ...

and i would rather trust an armchair that knows what their doing vs a novice charging me money any day
Tod69 Talamasca
The Human Tripod ;)
Join date: 20 Sep 2005
Posts: 4,107
08-12-2007 18:19
At least they didnt call in "Geek Squad" (that we know of!):eek:
_____________________
really pissy & mean right now and NOT happy with Life.
Malachi Petunia
Gentle Miscreant
Join date: 21 Sep 2003
Posts: 3,414
08-12-2007 18:19
From: someone
I've been in IT for 20 years... The only things I'm sure of are these:

Complex systems will eventually fail in ways that even the best of people can not plan for.

And there will usually be some armchair quarterback prattling on about coulda-woulda-shoulda.
So with all that experience do you find yourself making the same errors today as you did 10 years ago? I assume not.

The failures - as LL reported - "woulda" been novel or interesting 15 years ago but are now so passe as to be silly. As noted above, they have a mission critical system that can't tell them when a disk is in distress? Shame on them; that's what we now call a "solved problem". Can't contact your remote site and your operations software doesn't tell you before you notice? Another solved problem - unless you are LL, it seems.
_____________________
Draco18s Majestic
Registered User
Join date: 19 Sep 2005
Posts: 2,744
08-12-2007 22:51
Worse is Better

Edit:
oh right. BBCode is down.

Worse is Better:
http://www.jwz.org/doc/worse-is-better.html
Dnali Anabuki
Still Crazy
Join date: 17 Oct 2006
Posts: 1,633
08-13-2007 01:05
From: Draco18s Majestic
Worse is Better

Edit:
oh right. BBCode is down.

Worse is Better:
http://www.jwz.org/doc/worse-is-better.html


Wonderful article..thanks for posting it..I've been wondering about whether to cul de sac myself by learning Lisp...now to find out where Java fits...
Kascha Matova
Bus Bench Supermodel
Join date: 30 Mar 2007
Posts: 342
08-13-2007 02:35
From: Rusty Satyr
Wow, wish I could do IT in your universe, where problems are always crystal clear and obvious, and nothing ever partially fails in an un-trappable way.

VPN for connectivity between sites is clearly something they've adopted as necessary... if they're going to open-source the SIM side of SecondLife, they are not going to be able to set up dedicated secure lines between each SIM server and their back-end asset/auth services.


Are failing drives 100% efficient and reliable in you universe? To the point where there would be no noticeable increase in read/write errors or other performance factors?

Can you pick me up at 8? :D
Kitty Barnett
Registered User
Join date: 10 May 2006
Posts: 5,586
08-13-2007 03:06
From: Rusty Satyr
VPN for connectivity between sites is clearly something they've adopted as necessary... if they're going to open-source the SIM side of SecondLife, they are not going to be able to set up dedicated secure lines between each SIM server and their back-end asset/auth services.
VPN might simply have been a necessity that came up, rather than a conscious decision.

If they originally developped the whole architecture with the assumption that everything would always be in the same colocation they might not have seen much of a need to implement secure communication.

When it became clear that the SF colo would no longer meet their needs, they would have had the option to either hurry and implement it, or go with something that would do the job without extra coding which is to VPN the two colocations together into one virtual network.

Just a guess for a scenario where VPN would make sense.
Rusty Satyr
Meadow Mythfit
Join date: 19 Feb 2004
Posts: 610
08-13-2007 12:04
From: Osgeld Barmy
yea it will 99.999% of the time, unlike 20 years ago its pretty ez for the system to notice that it has to do alot of corrections for the data to be valid

and forgive me for not believing in your 20 year experience, but our 30 year veteran just last Friday tried to wire a 100base T connection using straight tru serial wire(id be giving him credit by saying it was cat 1 but i wont because this shit was insulated with stone) , and proceeded to fuss and cuss at it for almost an hour before i dropped a cat 5 cable on his desk, so ...

and i would rather trust an armchair that knows what their doing vs a novice charging me money any day


People make mistakes. Vendors release drivers and updates that cause strange problems. Heavy loads results in a dynamic environment can cause new failure conditions that weren't tested for by vendors or implementation staff. Yes, MOST of the failures are expected and planned for.

There are still times when you have to call the vendor and beat them up for a while to get them to acknowledge that there are problems with their product not behaving according to spec.

I'd very happily retire today if it meant that hardware & software wouldn't need the likes of me anymore.
Jotheph Nemeth
Registered User
Join date: 9 Aug 2007
Posts: 142
08-13-2007 14:40
From: Rusty Satyr
Wow, wish I could do IT in your universe, where problems are always crystal clear and obvious, and nothing ever partially fails in an un-trappable way.

VPN for connectivity between sites is clearly something they've adopted as necessary... if they're going to open-source the SIM side of SecondLife, they are not going to be able to set up dedicated secure lines between each SIM server and their back-end asset/auth services.


Why does the idea of them doing this strike me as a bad idea?

It almost seems like they are trying to position themselves as only software, and in control of the money.

If someone else sets up their own servers, what's to prevent them from making counterfeit lindens? Or even a whole new money? They connect, but almost immediately they will start to diverge from the linden version.

Ok. This might not be so bad. In fact, it might mean real competition in terms of software and money. But it could also mean with real competition comes the end of the Lindens being in control.

If they insist on overseeing any other servers that connect, there might be little reason to do so.
AWM Mars
Scarey Dude :¬)
Join date: 10 Apr 2004
Posts: 3,398
08-14-2007 05:19
Whats most worrying is, none of the VPN or Raid systems are new technology... I run a personal (for my business) raid setup with 2tb's of HD space over a 4 disk setup. I use 120,000 hr MBTF HD's and have a throughput of 500gb's of precious data per month through this system without a single clitch. I don't consider my supporting system to be of corporate status either.
Unless LL have setup the raid as OBG (One Big Disk), rather than mirror sets or fast access over singular disk sets, I can't see how their whole system went down..... geeez... for all my hosting services across the world supporting our business, I have never come across such flaky service, especially from a company with such a high $ value throughput. And they charge how much for basic hosting services?
_____________________
*** Politeness is priceless when received, cost nothing to own or give, yet many cannot afford -

Why do you only see typo's AFTER you have clicked submit? **
http://www.wba-advertising.com
http://www.nex-core-mm.com
http://www.eml-entertainments.com
http://www.v-innovate.com
Rusty Satyr
Meadow Mythfit
Join date: 19 Feb 2004
Posts: 610
08-14-2007 09:34
From: AWM Mars
and have a throughput of 500gb's of precious data per month


SL serves up almost that much data every minute. (Obviously, not from the same raid.)

I'd love to see the back-end architecture supporting SL and how it was laid out. I've been piecing together bits over the years as I hear specific mention of parts, but unlike sim servers (which probably fill more than 50 server racks) which are easy to estimate... asset servers, inventory servers and the like could be partitioned off in any quantity, with even more, depending on redundancy.

--
"Shouldn't" happens.
AWM Mars
Scarey Dude :¬)
Join date: 10 Apr 2004
Posts: 3,398
08-14-2007 10:02
From: Rusty Satyr
SL serves up almost that much data every minute. (Obviously, not from the same raid.)

Opps typo time.. I meant per day on average.. however, I also said that my support systems are not considered corporate standard, which I would expect from LL.
_____________________
*** Politeness is priceless when received, cost nothing to own or give, yet many cannot afford -

Why do you only see typo's AFTER you have clicked submit? **
http://www.wba-advertising.com
http://www.nex-core-mm.com
http://www.eml-entertainments.com
http://www.v-innovate.com
Rusty Satyr
Meadow Mythfit
Join date: 19 Feb 2004
Posts: 610
08-14-2007 12:01
From: AWM Mars
Opps typo time.. I meant per day on average.. however, I also said that my support systems are not considered corporate standard, which I would expect from LL.


(nod) I expect the same. I just know that in my small shop of 300+ misc servers that I, and my peers, get patched through to developer level engineers at our primary hardware & software vendors to resolve "unexpected problems", even when our deployment follows that vendor's suggested best practices.

I can only imagine how much more grief LL has with 10x as many servers.
Bobbyb30 Zohari
SL Mentor Coach
Join date: 11 Nov 2006
Posts: 466
12-06-2007 12:37
From: Draco18s Majestic
Worse is Better

Edit:
oh right. BBCode is down.

Worse is Better:
http://www.jwz.org/doc/worse-is-better.html


BBC will never get fixed...
_____________________
1 2