Firstly, thanks for your excellent game, and the hard work that goes into it.
Secondly, I am puzzled at the latest failure, ~7 on 1/1, said to be the result of a single GBIC failure. I would think that the SL server network would be fully redundant - live failover when a server or switch/router/link fails, also load balanced, as all large enterprise datacenters are. Your LL posts say that you are actually colo with your SP, so I would think they would insist on that.
This is a level of simple network design providing robustness beyond that of grid/cluster technology, which protects your processor resources and database content, but does not do so well with equipment failure.
The previous massive failure over NY weekend seemed to point to unwonted sensitivity to point hardware failures, tho I understand that ultimately there were two points of failure.
Is the SL network not redundant?
Thirdly, the ongoing need to take residents off the grid when it is in trouble points to a need for better QoS, so that all mgmt activity gets through regardless of resident traffic.
Also, I would think that it would be possible to do more proactive CAC, limiting the number of residents online at single locales, to protect the ongoing phenomenon of individual locale sims crashing repeatedly as they become overpopulated. I think all residents would prefer to be refused entry because of fullness than get into a discussion with another resident that crashes repeatedly.
Cheers,
Nika