Welcome to the Second Life Forums Archive

These forums are CLOSED. Please visit the new forums HERE

Now that the dust has settled....

Rock Vacirca
riches to rags
Join date: 18 Oct 2006
Posts: 1,093
06-21-2009 02:29
The recent 'emergency' and its aftermath are not the first time this has happened.

My question is, for those with network experience:

Is it simply not possible to have a server/network/backup system that is immune to grid-wide outages, as we have seen over the weekend?

The organisation I work for has a massive datacentre (providing all the satellite weather data for the whole of Europe), and we have never had a system outage. Is that down to good luck, or good management? Our network engineers tell me about diversity of routing, hot redundancy, zero singularities (whatever that means), etc, etc.

(Incidently, we exchange data with NOAA from the States. At every morning briefing our data guys report 'no problems' in data transfer to the States in the last 24 hours, while our NOAA liaison engineer invariably is telling us about the 'reason' for the latest outage of data from the States to us. If you sat in on our meetings you would definitely get the impression that is was an 'American' thing).

Are the problems that SL have experienced with their servers since day 1, been to do with the design, just the way things are in the States, poor choice of equipment, too much reliance on 3rd party services, or what?

My motivation in asking this is because I am wondering if new players, such as Blue Mars, are likely to run into exactly the same problems as SL (Blue Mars is also US based, in Hawaii), or were there lessons to be learnt for these newcomers, which would ensure a competitive edge if heeded.

Rock
Sling Trebuchet
Deleted User
Join date: 20 Jan 2007
Posts: 4,548
06-21-2009 02:32
American engineers see Europe as damage and route around it.
_____________________
Maggie: We give our residents a lot of tools, to build, create, and manage their lands and objects. That flexibility also requires people to exercise judgment about when things should be used.
http://www.ace-exchange.com/home/story/BDVR/589
Novis Dyrssen
Girl Geek
Join date: 6 May 2007
Posts: 1,452
06-21-2009 02:33
From: Rock Vacirca
Is it simply not possible to have a server/network/backup system that is immune to grid-wide outages


Show me one single computer/server/network system that is completely immune against outages under heavy use. That should answer your question.

Incidentally, my first brush with Blue Mars? Was the message that I couldn't connect because the server was down.
_____________________
~~ immortal words of Rob Thomas ~~
Hey-yeah, welcome to the Real World
Nobody told you it was gonna be hard
Ciaran Laval
Mostly Harmless
Join date: 11 Mar 2007
Posts: 7,951
06-21-2009 02:50
Do you want to pay for it Rock? I sure don't want to see my tier fees doubled to pay for super duper redundancy.
Rock Vacirca
riches to rags
Join date: 18 Oct 2006
Posts: 1,093
06-21-2009 03:08
From: Novis Dyrssen
Show me one single computer/server/network system that is completely immune against outages under heavy use. That should answer your question.



I did. Pay attention.

Rock
Rock Vacirca
riches to rags
Join date: 18 Oct 2006
Posts: 1,093
06-21-2009 03:10
From: Ciaran Laval
Do you want to pay for it Rock? I sure don't want to see my tier fees doubled to pay for super duper redundancy.


I don't think that super duper redundancy is the only solution. Worlds and games that have a distributed server system, rather than a centralised server system, don't have the huge problems that SL has.

Rock
Ciaran Laval
Mostly Harmless
Join date: 11 Mar 2007
Posts: 7,951
06-21-2009 03:18
From: Rock Vacirca
I don't think that super duper redundancy is the only solution. Worlds and games that have a distributed server system, rather than a centralised server system, don't have the huge problems that SL has.

Rock


Which worlds and games? Unless they've changed recently both City of Heroes and World of Warcraft used to crash and burn due to a network failure.

Linden Lab have datacenters in San Francisco and Dallas, I've read that they're looking for datacenters in Europe and the far east too.
Tegg Bode
FrootLoop Roo Overlord
Join date: 12 Jan 2007
Posts: 5,707
06-21-2009 03:45
I test a beta game at the moment and they did a planned surprise server wipe over the weekend, imagine if the SL servers got wiped by failure or World of Warcraft, there would be mass suicides, and possibly a world better off without some of those people :)
_____________________
Level 38 Builder [Roo Clan]

Free Waterside & Roadside Vehicle Rez Platform, Desire (88, 17, 107)

Avatars & Roadside Seaview shops and vendorspace for rent, $2.00/prim/week, Desire (175,48,107)
Rock Vacirca
riches to rags
Join date: 18 Oct 2006
Posts: 1,093
06-21-2009 03:49
From: Ciaran Laval
Which worlds and games? Unless they've changed recently both City of Heroes and World of Warcraft used to crash and burn due to a network failure.

Linden Lab have datacenters in San Francisco and Dallas, I've read that they're looking for datacenters in Europe and the far east too.


WoW also uses centralized servers. Just Google virtual worlds or 3d virtiual games with the keyword p2p to get lists. The first few that came up for me were Open Croquet, Twinverse, Moove.

There is an excellent paper on Solipsis, presented at the 2008 IEEE Conference on Virtual Reality. http://www.pap.vs.uni-due.de/MMVE08/slides/MMVE08-piegay.pdf

It also defines what a virtual world is on page 8 (slide 11) (Argent, don't go there, you will not be pleased).

I am not a network engineer, that is why I asked the original question (and I did ask for replies from those with network experience, but that seems to have gone largely ignored so far). It would appear that SL ARE trying to decentralize, or are there other reasons for SL to look for datacenters in Europe and the Far East?

Rock
Rock Vacirca
riches to rags
Join date: 18 Oct 2006
Posts: 1,093
06-21-2009 03:51
From: Tegg Bode
I test a beta game at the moment and they did a planned surprise server wipe over the weekend, imagine if the SL servers got wiped by failure or World of Warcraft, there would be mass suicides, and possibly a world better off without some of those people :)


They did the same in Far Realms, but they did give notice that all would be wiped. I cannot imagine why any beta world would do this by surprise though.

Rock
Ciaran Laval
Mostly Harmless
Join date: 11 Mar 2007
Posts: 7,951
06-21-2009 03:55
From: Rock Vacirca
I am not a network engineer, that is why I asked the original question (and I did ask for replies from those with network experience, but that seems to have gone largely ignored so far). It would appear that SL ARE trying to decentralize, or are there other reasons for SL to look for datacenters in Europe and the Far East?

Rock


I have got network experience, that's why I talked of the cost implications :p
Novis Dyrssen
Girl Geek
Join date: 6 May 2007
Posts: 1,452
06-21-2009 03:57
From: Rock Vacirca
I did. Pay attention.


Your system is sure as hell not immune just because they say they have no problems. And I dare say that SL servers have to manage a whole different league of "heavy use" than a server providing data for downstream.
_____________________
~~ immortal words of Rob Thomas ~~
Hey-yeah, welcome to the Real World
Nobody told you it was gonna be hard
Domchi Underwood
Registered User
Join date: 4 Aug 2007
Posts: 44
06-21-2009 03:58
From: Rock Vacirca
Is it simply not possible to have a server/network/backup system that is immune to grid-wide outages, as we have seen over the weekend?


It's not a server/network/backup system, it's a bleeding edge technology experiment of unprecedented scale.

See how much scaling problems Twitter has, and they're dealing only with 140-character messages. Technologically, Twitter can be compared to a sim-local chat, (group chat is a bit more complicated than Twitter). And have in mind that chat messages in SL can be up to 1024 characters long. ;)

Even GMail was once down for hours, and Google throws a lot of resources at its reliability and redundancy.
Windsweptgold Wopat
Registered User
Join date: 24 May 2007
Posts: 1,003
06-21-2009 05:56
From: Rock Vacirca
WoW also uses centralized servers. Just Google virtual worlds or 3d virtiual games with the keyword p2p to get lists. The first few that came up for me were Open Croquet, Twinverse, Moove.

There is an excellent paper on Solipsis, presented at the 2008 IEEE Conference on Virtual Reality. http://www.pap.vs.uni-due.de/MMVE08/slides/MMVE08-piegay.pdf

It also defines what a virtual world is on page 8 (slide 11) (Argent, don't go there, you will not be pleased).

I am not a network engineer, that is why I asked the original question (and I did ask for replies from those with network experience, but that seems to have gone largely ignored so far). It would appear that SL ARE trying to decentralize, or are there other reasons for SL to look for datacenters in Europe and the Far East?

Rock

Unless Moove has had major changes of recent it can be out for days.

I do find it funny how angry ppl seem to get when SL is down and how they claim SL should have a way for this not to happen. As i see it, it is like driving down a major road and a big accident happens, some times you can go around but some times you just have to sit and wait what could be hours.
_____________________
"Mushrooms grow well in BS, trust and honesty do not"
Qie Niangao
Coin-operated
Join date: 24 May 2006
Posts: 7,138
06-21-2009 06:01
Getting just the network redundancy is a little costly, but it's doable. It basically doubles the cost of the datacenter network, adds to operations cost and complexity, requires more and different hardware in each host, and--more costly--requires fully non-intersecting paths between centers and to the Internet.

But that's not really the big cost. The real cost is in the software, and it's not at all a solved problem in any general sense. Especially where real-time performance is at stake, completely transparent end-to-end failure tolerance is pretty much impossible across any system of interesting size.

All that said, however, there are major improvements possible, even with the requirements of the Second Life application. They are, in fact, doing at least some of the right things to realize those improvements--changes that will also open up orders of magnitude better scalability in the backend servers. (This my interpretation based on some of Babbage's comments at his office hours.) To an outsider, it seems that they are adopting some of the very large scale architecture approaches that make Google so highly available--which makes sense, given that they've made many of the same component choices (the frequent bitching about them notwithstanding).
Clarissa Lowell
Gone. G'bye.
Join date: 10 Apr 2006
Posts: 3,020
06-21-2009 09:28
So...let me guess, Rock.

This is another reason why people should try Blue Mars.
_____________________
Deira Llanfair
Deira to rhyme with Myra
Join date: 16 Oct 2006
Posts: 2,315
06-21-2009 09:32
Rock, what makes you think the dust has settled? I see no ships!
_____________________
Deira :)
Must create animations for head-desk and palm-face!.
23rdDjin Negulesco
Unfinished Build Master
Join date: 30 May 2007
Posts: 661
06-21-2009 09:35
perhaps the dust has settled to muddy the waters.
_____________________
"What am I in the eyes of most people--a nonentity, an eccentric, or an unpleasant person--somebody who has no position in society and will never have; in short, the lowest of the low. All right, then--even if that were absolutely true, then I should one day like to show by my work what such an eccentric, such a nobody, has in his heart." -Vincent van Gogh
Dagmar Heideman
Bokko Dancer
Join date: 2 Feb 2007
Posts: 989
06-21-2009 11:05
The dust has settled as particles floating all about the air like whispy clouds of dirt particles swirling around the surface of SL. Oh wait.....that's still dust.....never mind.....
Rock Vacirca
riches to rags
Join date: 18 Oct 2006
Posts: 1,093
06-21-2009 11:07
From: Clarissa Lowell
So...let me guess, Rock.

This is another reason why people should try Blue Mars.


Not at all. I don't know if BM have learnt any lessons from the networking experience of SL. Time will tell if they have or not. It has been known for some years about SL's scalability problems, and the Lindens themselves acknowledge that scalability is a serious weakness. It would a big plus to any competitor to SL to outdo SL in terms of availability.

As far as the outages in Sl are concerned, it is not just the inconvenience of downtime, it is also all the content that has been lost as well (US$$$), for which there is no redress from SL.

I have been intrigued by the approach by Solipsis (p2p network), but this is open source, and not very attractive to commercial companies, who see centralized servers as the way to maintain control, content, and profits.

But, disregarding the commercial aspects for a moment, which IS better, in terms of network availability? Is it centralized or distributed (P2P)? The few Google sources I have read so far all favour P2P as the winner of that one.

Rock
Qie Niangao
Coin-operated
Join date: 24 May 2006
Posts: 7,138
06-21-2009 11:38
From: Rock Vacirca
But, disregarding the commercial aspects for a moment, which IS better, in terms of network availability? Is it centralized or distributed (P2P)? The few Google sources I have read so far all favour P2P as the winner of that one.
Sure, because it (almost) reduces the problem to pure network availability: if a peering node goes down, only that node suffers.

But that's kind of how OpenSim ends up with such an intractable Intellectual Property problem: all content authorization is distributed to the individual sims (where the sims are seen as peers for this purpose).

If there is to be centralized access control to anything (not just IP-protected content, but presence, etc.), that centralized service has to somehow achieve high availability. That's possible, and the web services (lowercase "w" "s";) approach being pursued now makes that a lot easier than the current approach of distributed database access all the way from the sims themselves.

I guess one might refer to this sort of distributed, redundant web service architecture as a kind of peer-to-peer *layer*. Well, okay, no, not really. :o But it's trying to achieve the same sort of availability advantage with a distributed service implementation.
Jesse Barnett
500,000 scoville units
Join date: 21 May 2006
Posts: 4,160
06-21-2009 13:35
From: Qie Niangao
Getting just the network redundancy is a little costly, but it's doable. It basically doubles the cost of the datacenter network, adds to operations cost and complexity, requires more and different hardware in each host, and--more costly--requires fully non-intersecting paths between centers and to the Internet.

But that's not really the big cost. The real cost is in the software, and it's not at all a solved problem in any general sense. Especially where real-time performance is at stake, completely transparent end-to-end failure tolerance is pretty much impossible across any system of interesting size.

All that said, however, there are major improvements possible, even with the requirements of the Second Life application. They are, in fact, doing at least some of the right things to realize those improvements--changes that will also open up orders of magnitude better scalability in the backend servers. (This my interpretation based on some of Babbage's comments at his office hours.) To an outsider, it seems that they are adopting some of the very large scale architecture approaches that make Google so highly available--which makes sense, given that they've made many of the same component choices (the frequent bitching about them notwithstanding).


Linden Labs may leave a lot to be desired when it comes to customer policies but they do a great job on upgrading the service. The latest is contracting with Terremark's NAP facility which is better then state-of-the-art. The fail safes, security and backups are mind boggling there and since it is about 30 minutes from my house, I wish there was some way I could see it(Cold day in hell!). Look at the specs on this place:

http://www.terremark.com/technology-platform/nap-of-the-capital-region.aspx

That's right, this is not a typo: "Terremark offers 100% service level agreements on power and environmentals for the NAP of the Capital Region." They can offer that because of their redundancy in multiple power grids and thier generators and on-site fuel storage:

"Each of the five planned 50,000 square foot data centers at the site will be supported by 11 Caterpillar 2.25 megawatt generators, for a total of 55 generators once the campus is completed. To support those generators, the NAP of the Capital Region can store up to 520,000 gallons of diesel fuel on site."

http://www.datacenterknowledge.com/archives/2009/01/20/terremark-begins-second-facility-at-culpeper/

The security at the place is like you see only in the movies with stuff like:

"Security is a major focus of the NAP of the Capital Region, which was designed around the requirements of federal customers. The data center is ringed with a 10-foot-high fence topped with barbed wire, which stands at the outer edge of a 150-foot security perimeter populated with large earthen berms to slow intruders. The front entrance to the facility is protected by a 14-inch thick wall of solid concrete. All staff and visitors must pass through an enclosed “man trap” and several layers of biometric security before entering the facility."

http://www.datacenterknowledge.com/archives/2008/06/16/inside-terremarks-culpeper-data-fortress/

EDIT: COOL! Found a video tour of the place. Power is simply amazing with 110 megawatts of UPS. Each megawatt is supplied by three 660 lb flywheels spinning at over 7,700 rpm.

Can anyone tell that I am a geekette? :p
_____________________
I (who is a she not a he) reserve the right to exercise selective comprehension of the OP's question at anytime.
From: someone
I am still around, just no longer here. See you across the aisle. Hope LL burns in hell for archiving this forum