Welcome to the Second Life Forums Archive

These forums are CLOSED. Please visit the new forums HERE

A single network cable breaks the grid? Where's the redundancy?

Scalar Tardis
SL Scientist/Engineer
Join date: 5 Nov 2005
Posts: 249
01-02-2007 14:48
(Comments are disabled in the blog, which is why this post is in here.)

This latest grid problem from yesterday suggests to me that LL does not use redundant network cable paths for its hardware, and just a single network cabling failure is enough to disrupt a large portion of the grid. For a company as large and important as LL, something needs to be done about this.

Are you planning to build a more redundant network for the grid, so that single-point failures like this don't happen again in the future?


The blog post says a GBIC failed. For non-technical people who do not know what this is, a GBIC is used to plug a fiber-optic network cable pair into a network switch.

Image of an HP ProCurve 2824 switch, GBIC, and a fiber-optic cable pair:


The thing is, cables can fail. The glass-fiber cable can be damaged, the GBIC module can fail, or the switch can fail, the power can fail, and so forth.

To deal with this, the network can be made redundant. This means there are at least two alternate paths for information to follow from each server. If one of the paths fail, it is possible for the system to auto-detect the failure, re-route informaiton along the alternate path, and inform network operations of the failed path.

Many high-end HP and Cisco switches support redundant paths as well as parallel trunking of multiple cables, which adds to the total bandwidth capacity when there are no problems, and can fall back to non-trunked mode if any one of the cables fail. Even the network switch power supplies can be redundant, powered from parallel battery backup systems.

Sometimes only the core of the network backbone is redundant, but for really critical work it can extend all the way down to the servers themselves, with two network cables running out from each server to two different redundant switches.



Now if you're some tiny company running off a shoestring budget, you probably cannot afford to buy duplicate network equipment and set up at least two paths away from any given network switch for data to follow. If something breaks, you may need to shut down your whole business to fix the break before you can return to normal.

But this grid that LL is making isn't run on a shoestring budget. And eventually it's going to need to get to a point where it is so reliable that it almost never shuts down for any reason. I really hope you're going to spend the money necessary to prevent a problem like this from happening again.


On a related note, has LL purchased its Internet service from two different companies following two different physical paths out of your colo's, so that if one of your OC-48s is cut by a backhoe, you'll still be immediately available on the 'net via the other ISP's path?

.
Pathfinder Linden
Administrator
Join date: 15 Mar 2005
Posts: 507
01-02-2007 16:39
Hi Scalar,

I've asked someone from our ops group to answer this one for you.

-Pathfinder
_____________________
beez Linden
Studio Director
Join date: 16 Mar 2006
Posts: 30
01-02-2007 17:04
I won't gloss over it, Scalar. You're absolutely correct.

We haven't been in the affected colo very long, and set up initially without the redundancy we would have preferred. The redundant links have been on-order since before the holiday, and we expect that they'll be in and online shortly.

Second, when part of the grid goes dark for any reason, the case could be handled more gracefully. That's now on the bug list. This outage didn't take down the entire grid, but the login problems were bad enough that we chose to close the doors.

We are absolutely working to introduce more redundancy to the grid so that we don't have single point failures in the future.

~~b