Second Life Forums Archive - Here's the deal with network b0rkedness

Calliope Simon

Registered User

Join date: 21 May 2006

Posts: 154

08-09-2007 05:02

I've been in this scene professionally for a little over 20 years. I've seen fiber cables cut by back hoes, routers the size of a refrigerator catch fire, and electrical arcs cross 20" of nothing but air from the back of a bad power supply.

I've seen multiplexers and demuxes melt down. I've seen RJ45 connectors snapped off while still plugged in by clumsy n00bs.

I've seen configuration mistakes. I've seen configurations deleted completely. I've been on the phone with USR, Lucent, and Cisco asking for the back door into a device that's otherwise completely shot.

And I know two things:

1. There is no excuse for *any* network problem, short of the building that all the network stuff is in actually being destroyed, taking more than a few hours to fix. Every network hardware manufacturer has the ability to ship any part you need (up to entire gigantic routers, or even many of them) in four hours. Unpacking whatever it is, slapping it in place of whatever is broken, configuring it, and turning it on shouldn't take any more than another hour and a half. And that's if you're already incompetent and don't have a cold spare standing by on site.

2. Linden Labs "ops" (whatever that is---but it sounds very impressive) are not "fixing" ANYTHING that has to do with a network in a colo space that's worth it's weight in camel poop. The point of a colo space is that you don't have to spend a lot of money on "ops" "fixing" "network" "problems". That's what you're paying through the nose for the colo monkeys to do.

It would be really nice if linden labs just once told the truth and gave some details. Why *exactly* is this taking so long, Lindens? The very fact that you're not really telling the truth in the first place speaks volumes. Maybe you could comment on those volumes.

Wise Clapsaddle

Registered User

Join date: 14 Aug 2006

Posts: 29

08-09-2007 05:42

*claps*

Finally a post from someone who seems to have a grasp when it comes to network administration.

I would be very interested in hearing your thoughts on the variety of problems second life is currently having. For a couple of reasons :

1. It would help those who dont understand much about networking and its administration to see how easy to fix some of the problems are for engineers with any basic understanding of the field.

2. It would start to cast a stronger light on linden labs, who for the most part are shielded from a full on investigation into their actions because the general population are not network or software engineers, with a little more technical help translated to non technical jargon it may help people to start realizing that Linden Labs is indeed acting with incompetence and using people's lack of information on the subject to get away with anything they see fit.

3. It might, just might give second life residents just enough amunition to confront linden labs in as close to a single entity as we can say...........Fix it or watch your competition (fast approaching) take your userbase and enjoy a virtual world once again.

Im not a professional in the field, but i know people that are either very close to, or are professional that offset their knowledge to me from time to time.

I therefore would like to post a question to you if i may. In your professional opinion given your background, what steps would you imagine are taken if a serious problem, causing long term outages (fatal for providers of hosting solution) in a platforms code are found, if something is ill configured what would be the symptoms.

Everyone who supports the idea of second life, even those that openly complain when something is obviously wrong have one common ideal, that is a desire to remain in second life because of its appeal to many. We just want it fixed, a simple goal for most companies to honour......why not linden labs?.

Brenda Connolly

Un United Avatar

Join date: 10 Jan 2007

Posts: 25,000

08-09-2007 06:34

While I have no idea what you are talking about, and don't really care, just based on my avrage knowledge from how I use my computer and the internet, I'm going to agree with you for the most part. Bugs, slowdowns and the like are understandable. Even outaages, to a point. But the frequency of them, and the duration and subsequent after effects are inexcusable in my opinion. And Linden being fortright and clear on anything they tell us has been nonexistant as far as I can tell in the 8 months I've been here. But I keep coming back so shame on me I guess.

_____________________

Don't you ever try to look behind my eyes. You don't want to know what they have seen.

http://brenda-connolly.blogspot.com

Draco18s Majestic

Registered User

Join date: 19 Sep 2005

Posts: 2,744

08-09-2007 06:39

You understand, of course, that LL's colo facilitiy is not a redundancy of the grid, right? If one of the facilities (either the one in San Fran or the one in Texas) goes asplody, half the grid goes with it.

It takes several hours to boot the grid if it's offline, they have to turn on/reset each sim verify that each sim came up correctly, wait for all of them, then connect them enmass to the grid (or connect them as they come up, whichever is best at the time).

The colo facility has a redundancy on possibly the asset, presence, login, and other non-sim servers, but not having them puts incredible strain on the other set. It's not true redundancy.

Calliope Simon

Registered User

Join date: 21 May 2006

Posts: 154

08-09-2007 06:39

From: Wise Clapsaddle

*claps*

Finally a post from someone who seems to have a grasp when it comes to network administration.

I would be very interested in hearing your thoughts on the variety of problems second life is currently having. For a couple of reasons :

1. It would help those who dont understand much about networking and its administration to see how easy to fix some of the problems are for engineers with any basic understanding of the field.

2. It would start to cast a stronger light on linden labs, who for the most part are shielded from a full on investigation into their actions because the general population are not network or software engineers, with a little more technical help translated to non technical jargon it may help people to start realizing that Linden Labs is indeed acting with incompetence and using people's lack of information on the subject to get away with anything they see fit.

3. It might, just might give second life residents just enough amunition to confront linden labs in as close to a single entity as we can say...........Fix it or watch your competition (fast approaching) take your userbase and enjoy a virtual world once again.

Im not a professional in the field, but i know people that are either very close to, or are professional that offset their knowledge to me from time to time.

I therefore would like to post a question to you if i may. In your professional opinion given your background, what steps would you imagine are taken if a serious problem, causing long term outages (fatal for providers of hosting solution) in a platforms code are found, if something is ill configured what would be the symptoms.

Everyone who supports the idea of second life, even those that openly complain when something is obviously wrong have one common ideal, that is a desire to remain in second life because of its appeal to many. We just want it fixed, a simple goal for most companies to honour......why not linden labs?.

Ok, to answer the question:

--any outage at all of a "mission critical application", no matter what the cause, should be discovered within 60 seconds of occurrence. There are a million automated systems that can handle this easily, and even SMS the people involved automatically the second they sniff a problem---any problem. If it can be reduced to a perl script, it can be checked.

--Once the outage is discovered, its cause should be known within sixty minutes. Often the cause is obvious--like a router on fire. But just as often it's not obvious at all, like some dingbat introduced a weird syntax error to a router config file that results in a rare bug in DS3 multiplexing to surface and confuse everyone. Even if it's not obvious at all, it should still be discovered within 60 minutes. Very rare cases can take longer, but they are *very rare*.

--If the problem is with software, and it's causing *loss of revenue* (which this is) through extended outages (which this is), then the standard course of action is to roll back whatever update broke everything and fix it. The operative term here is "roll back". Not fix it for the next update while everyone sits around and waits. Roll it back, so that it works while people are waiting. It seems to me that Linden Labs has built an application that does not lend itself easily to being rolled back, which is absolutely a problem of design.

--This particular problem may or may not be a network issue, as is claimed on the official blog. It could be anything. That's another problem--Linden Labs doesn't exactly like to tell the whole truth about outages. If it is a pure network problem that they have control over, as is implied currently in the blog, then it should have taken a few hours at the MOST to fix, and I can say that even not knowing exactly what the problem is. The longest fix I've ever seen was with a horribly configured Veritas Volume Replicator (which Linden Labs does not use) misconfiguration that wasn't discovered for weeks. It took about two hours to figure out what the problem was, and then about 12 hours for the fix to be effective.

--If this problem is database corruption, and they're running database software appropriate for such a large application, and the problem is as bad as it can possibly be, then retrieving the last good DB dumps, restoring them, syncing and going live with it again, assuming their largest database is something less than 500gb, should take less than six hours.

Calliope Simon

Registered User

Join date: 21 May 2006

Posts: 154

08-09-2007 06:40

From: Draco18s Majestic

You understand, of course, that LL's colo facilitiy is not a redundancy of the grid, right? If one of the facilities (either the one in San Fran or the one in Texas) goes asplody, half the grid goes with it.

Yes, and thats a problem with bad, uninspired design. The correct design would have cost much less and been much more reliable.

Draco18s Majestic

Registered User

Join date: 19 Sep 2005

Posts: 2,744

08-09-2007 06:43

From: Calliope Simon

Yes, and thats a problem with bad, uninspired design. The correct design would have cost much less and been much more reliable.

If you care to suggest how to create a redundancy in the sim severs, /be my guest./

Calliope Simon

Registered User

Join date: 21 May 2006

Posts: 154

08-09-2007 06:49

From: Draco18s Majestic

If you care to suggest how to create a redundancy in the sim severs, /be my guest./

I've made the suggestion many times over the years actually. But it would require admitting that a classic "grid" topology is not appropriate for Second Life, and converting it to a HA/HPC hybrid in one location.

The "grid" idea is usually implemented by people who are worried about latency and bandwidth, and is almost always misused. High internode bandwidth applications (like the SL server software) are NOT appropriate for grid topologies. They shot themselves in the foot with this one.

Draco18s Majestic

Registered User

Join date: 19 Sep 2005

Posts: 2,744

08-09-2007 06:55

Yes, it's a bad system. I know that. That doesn't exactly solve the problem does it?
Solving the problem means they have to rebuild the "grid" and sim code from the ground up and find a way to replace the existing code wihtout breaking anything.

And do it for 5000 simulators.

Overnight.

It ain't happening.

Care to make a suggestion that is feasable at this stage in the game?

Calliope Simon

Registered User

Join date: 21 May 2006

Posts: 154

08-09-2007 07:13

From: Draco18s Majestic

Yes, it's a bad system. I know that. That doesn't exactly solve the problem does it?
Solving the problem means they have to rebuild the "grid" and sim code from the ground up and find a way to replace the existing code wihtout breaking anything.

And do it for 5000 simulators.

Overnight.

It ain't happening.

Care to make a suggestion that is feasable at this stage in the game?

No, solving the problem does not mean rebuilding the "grid", nor the "sim code". This is exactly why Linden Labs fails. If you have the sort of brain that can recognize this as a solution, then it's not hard to figure out how to implement it with the least impact.

For example, one could invest in a small IBM machine capable of virtualizing Linux under Z/OS or MVS or whatever they like the most, stick it in a rack, and use it to test permutations of this solution. My first instinct would be to build a miniature version of the SL "grid", but as a HA/HPC hybrid on that one little machine with 5K or so Linux virtualizations. Its likely that very little of the "sim code" would have to be changed, and it certainly would not have to be re-designed to function on a hybrid cluster.

Then once it all works, you *replace the main grid completely*, hardware and all, with the new hybrid cluster---or even better, you replace it with a virtual cluster running big iron. Sell off the 5K or so idiot boxes and cancel expensive COLO contracts (which undoubtedly to some extent to replace some of what you spent on the big iron, and its as simple as that.

And it wouldnt have to be "overnight".

Its exactly this kind of negative attitude, filled with miscalculated assumption that makes Linden Labs utterly incompetent.

Samuel Geiger

Registered User

Join date: 16 Oct 2006

Posts: 12

08-09-2007 07:43

From: Calliope Simon

Ok, to answer the question:
--If this problem is database corruption, and they're running database software appropriate for such a large application, and the problem is as bad as it can possibly be, then retrieving the last good DB dumps, restoring them, syncing and going live with it again, assuming their largest database is something less than 500gb, should take less than six hours.

This is the part that's going to make you laugh... they're using MYSQL! Which is great for a home network running a VERY small intranet at best, perhaps hosting phpnuke or something similar, but it was NEVER meant to run the way they're making it run. If they want to take our money and use free software, they should have hired REAL db admins and went with postgres or something along those lines.

Calliope Simon

Registered User

Join date: 21 May 2006

Posts: 154

08-09-2007 09:43

From: Samuel Geiger

This is the part that's going to make you laugh... they're using MYSQL! Which is great for a home network running a VERY small intranet at best, perhaps hosting phpnuke or something similar, but it was NEVER meant to run the way they're making it run. If they want to take our money and use free software, they should have hired REAL db admins and went with postgres or something along those lines.

Yeah, I suspected as much. But you know, even MYSQL can be tuned to support this sort of a hell of a lot better than they have. It will never be the right database for their needs, but it can absolutely work much better for them than it does.

Even though, to this day, you STILL cant do true hot dumps of MYSQL databases.

Parsimony Paragon

SL Post-Anarchist, I Hope

Join date: 26 Oct 2006

Posts: 195

Isn't this what consultant-firms are for?

08-09-2007 10:51

There is a constant in all this, that I believe to be the fundamental and insurmountable obstacle to any *real* change. Linden insists on running a structured system, requiring a central organizing control, using "decentralized coordination" as its organizing philosophy. This dysfunctional chaos-theory-driven (notice I didn't say "Tao-driven"

group of system controllers created the dysfunction, and the management recognize this, yet, who are they giving the problem back to, in order to solve it? Their own design/programming team!

At this point, long ago, any responsible management team would have both brought in outside efficiency analysts to dy-mystify the dysfunction, AND hired outside consultants to come on board and finally really *fix* they system!

I sure do begin to feel like a lemming, here!

Slade Christensen

Liquid Heat CTO

Join date: 25 Dec 2005

Posts: 31

08-09-2007 10:57

MySQL????? *groans* Someone call Oracle please

Wiseguy Capra

Resident Wenzel Hopper

Join date: 21 Jan 2007

Posts: 160

08-09-2007 11:05

well, it would be a start if LL would stop bringing new features into Sl all the time that again make the actual source code larger and larger without fixing them parts in the same code that (as we all know) don't work. Same issues all the time with messed L$ balance, lost inventory etc etc etc. This issues should be a top priority above any other updates and new features.

And yes I agree, SQL sucks for this but it's what they have choosen. I wish the server software would go open sources so we could host our own servers. have our own continents and do our own patches.

Johan Laurasia

Fully Rezzed

Join date: 31 Oct 2006

Posts: 1,394

08-09-2007 11:42

I know a thing or two about networks Calliope, and getting equipment shipped in "four hours" seems too fast, 24 hrs, yeah, but 4? Don't think so. Seems funny to me how you can comment on a setup you know nothing about with such detail. How much networking equipment does LL have? In how many locations? Since you seem to have the ability to estimate to the minute how fast stuff should be fixed, you must know this information. If not (which I'm sure is the case), you really dont stand in a positition to say anything. If you're so sure about all this, why dont you go to work for LL, as, I'm sure they'd love to have an employee who can fix any problem that arises in 5 1/2 hrs. But then again, you'd take away everyone's reason to bitch all the time in the forum.

Cocoanut Koala

Coco's Cottages

Join date: 7 Feb 2005

Posts: 7,903

08-09-2007 12:12

From: Calliope Simon

I've been in this scene professionally for a little over 20 years. I've seen fiber cables cut by back hoes, routers the size of a refrigerator catch fire, and electrical arcs cross 20" of nothing but air from the back of a bad power supply.

I've seen multiplexers and demuxes melt down. I've seen RJ45 connectors snapped off while still plugged in by clumsy n00bs.

I've seen configuration mistakes. I've seen configurations deleted completely. I've been on the phone with USR, Lucent, and Cisco asking for the back door into a device that's otherwise completely shot.

And I know two things:

1. There is no excuse for *any* network problem, short of the building that all the network stuff is in actually being destroyed, taking more than a few hours to fix. Every network hardware manufacturer has the ability to ship any part you need (up to entire gigantic routers, or even many of them) in four hours. Unpacking whatever it is, slapping it in place of whatever is broken, configuring it, and turning it on shouldn't take any more than another hour and a half. And that's if you're already incompetent and don't have a cold spare standing by on site.

2. Linden Labs "ops" (whatever that is---but it sounds very impressive) are not "fixing" ANYTHING that has to do with a network in a colo space that's worth it's weight in camel poop. The point of a colo space is that you don't have to spend a lot of money on "ops" "fixing" "network" "problems". That's what you're paying through the nose for the colo monkeys to do.

It would be really nice if linden labs just once told the truth and gave some details. Why *exactly* is this taking so long, Lindens? The very fact that you're not really telling the truth in the first place speaks volumes. Maybe you could comment on those volumes.

My guess - as a completely non-techy person who knows zilch about colos, networks, or whatever - is what they did was get the cheapest colo space possible, and that is the one in Texas.

It's like the billing going all to hell in a handbasket, with old residents all of a sudden finding their cc's and paypal will no longer work, because LL farmed that out to someplace in England, doubtless due to its cheapness or some kind of backroom deal.

This is all just a guess. Problem is, then they get stubborn about it, and refuse to change it, even though it doesn't work right.

coco

_____________________

VALENTINE BOUTIQUE
at Coco's Cottages

http://slurl.com/secondlife/Rosieri/85/166/87

Nika Talaj

now you see her ...

Join date: 2 Jan 2007

Posts: 5,449

08-09-2007 12:26

MYSQL??? omg *runs and pulls all lindens out of her account, and makes a note to spend hours trimming inventory*

I'm know something about grid/cluster/HA designs, but don't know LL's deployment. So I'm not going to speculate on what they should be doing, or the quality of their colo design/provider, etc..

However, I will say this: the last two days do not smell like straightforward colo issues to me. I believe the colo has issues ... I just don't know why ... grid attack? server instability due to new bug(s)? voice presence/routing creating unforeseen bottlenecks?

I don't think LL is lying, I think they have little incentive to practice full disclosure.

Dytska Vieria

+/- .00004™

Join date: 13 Dec 2006

Posts: 768

08-09-2007 12:41

From: Johan Laurasia

I know a thing or two about networks Calliope, and getting equipment shipped in "four hours" seems too fast, 24 hrs, yeah, but 4? Don't think so.

Ummm, actually, for years the big companies, especially NSP's and ISP's do have 4 hour agreements with vendors to deliver network and server equipment on site, it's not unusual, however it costs a lot and I doubt LL subscribes to such services.

_____________________

+/- 0.00004

Calliope Simon

Registered User

Join date: 21 May 2006

Posts: 154

08-09-2007 13:22

From: Johan Laurasia

I know a thing or two about networks Calliope, and getting equipment shipped in "four hours" seems too fast, 24 hrs, yeah, but 4? Don't think so. Seems funny to me how you can comment on a setup you know nothing about with such detail. How much networking equipment does LL have? In how many locations? Since you seem to have the ability to estimate to the minute how fast stuff should be fixed, you must know this information. If not (which I'm sure is the case), you really dont stand in a positition to say anything. If you're so sure about all this, why dont you go to work for LL, as, I'm sure they'd love to have an employee who can fix any problem that arises in 5 1/2 hrs. But then again, you'd take away everyone's reason to bitch all the time in the forum.

Yes, four hours---and if you don't think so, then you've never worked under a good contract with a major hardware vendor. I have seen four hour parts replacement from IBM, Cisco, Sun, and Network Appliance.

And, I stand in a position to say quite a lot of things, having had a hand in designing and implementing networks far, far, far larger and more complicated than anything Second Life will ever use.

Now, if you knew the thing or two about networks that you claim you do, you would know that no matter how good a network plan is, management can always completely ruin it in implementation. I'm only one of thousands of people who, given the budget, could fix all of Linden Labs technical problems in about 12 months---and I'm only one of thousands of people who would never attempt such a thing with their current management.

Wise Clapsaddle

Registered User

Join date: 14 Aug 2006

Posts: 29

08-09-2007 15:57

Strange enough i understand everything you guys are saying and although not in the field i have known isp hardware to fail, big pipes, large scale switches? and always seem to be fixed 1 (assuming they have the equipment) to 5 hours after a major failure.

We did see linden labs go through this sometime last year when they had what seemed to be a major hardware failure which took them around 12-16 hours to fix. This isnt the same breed of problem. The system is failing in many area's, one after the other in a repeating pattern over and over again with no sign of change. Im no expert but that seems to me as though the problem lies much deeper in the design.

If these were more isolated problems, or area's of code or hardware then i think the problem would have been solved if not 100%, then at least in each update or soonafter. Maybe they just know the problem now but the cost in many ways is too rich for them.

We all see the potential of second life, we all tend to see the dream in one way or another, but i think people have to start facing the fact that linden labs are not the guys to be working with this, like was mentioned above there are likely thousands of people who could have a go with this, and the only thing none of them could stand is working for linden labs....give em funding, give em a reason and they can knock up something far more scalable and far more stable.

Brenda Connolly

Un United Avatar

Join date: 10 Jan 2007

Posts: 25,000

08-09-2007 16:12

I love when Computer guys play "Mine is bigger than yours".

*makes popcorn*

_____________________

Don't you ever try to look behind my eyes. You don't want to know what they have seen.

http://brenda-connolly.blogspot.com

Wise Clapsaddle

Registered User

Join date: 14 Aug 2006

Posts: 29

08-09-2007 16:16

Just fueling the fire for further debate from the masters. Seriously guys go on were listening

*grabs popcorn*

Parsimony Paragon

SL Post-Anarchist, I Hope

Join date: 26 Oct 2006

Posts: 195

Who says we need schoolhouses?

08-10-2007 09:42

*Filch a small handfull of popcorn*

Thanks, learning...I'm learning

Welcome to the Second Life Forums Archive

Here's the deal with network b0rkedness