This latest grid problem from yesterday suggests to me that LL does not use redundant network cable paths for its hardware, and just a single network cabling failure is enough to disrupt a large portion of the grid. For a company as large and important as LL, something needs to be done about this.
Are you planning to build a more redundant network for the grid, so that single-point failures like this don't happen again in the future?
The blog post says a GBIC failed. For non-technical people who do not know what this is, a GBIC is used to plug a fiber-optic network cable pair into a network switch.
Image of an HP ProCurve 2824 switch, GBIC, and a fiber-optic cable pair:

The thing is, cables can fail. The glass-fiber cable can be damaged, the GBIC module can fail, or the switch can fail, the power can fail, and so forth.
To deal with this, the network can be made redundant. This means there are at least two alternate paths for information to follow from each server. If one of the paths fail, it is possible for the system to auto-detect the failure, re-route informaiton along the alternate path, and inform network operations of the failed path.
Many high-end HP and Cisco switches support redundant paths as well as parallel trunking of multiple cables, which adds to the total bandwidth capacity when there are no problems, and can fall back to non-trunked mode if any one of the cables fail. Even the network switch power supplies can be redundant, powered from parallel battery backup systems.
Sometimes only the core of the network backbone is redundant, but for really critical work it can extend all the way down to the servers themselves, with two network cables running out from each server to two different redundant switches.
Now if you're some tiny company running off a shoestring budget, you probably cannot afford to buy duplicate network equipment and set up at least two paths away from any given network switch for data to follow. If something breaks, you may need to shut down your whole business to fix the break before you can return to normal.
But this grid that LL is making isn't run on a shoestring budget. And eventually it's going to need to get to a point where it is so reliable that it almost never shuts down for any reason. I really hope you're going to spend the money necessary to prevent a problem like this from happening again.
On a related note, has LL purchased its Internet service from two different companies following two different physical paths out of your colo's, so that if one of your OC-48s is cut by a backhoe, you'll still be immediately available on the 'net via the other ISP's path?
.