Load Balancer
|
|
SuezanneC Baskerville
Forums Rock!
Join date: 22 Dec 2003
Posts: 14,229
|
02-03-2008 11:42
Anyone have any further information add to the blog's last post on the website problems? From: old blog post, latest i see More Intermittent Website Problems Monday, January 28th, 2008 at 9:04 AM PST by: Teeple Linden
Those of us who post here occasionally find ourselves in the position of play-by-play announcers. We think it’s important to let you know that we’re aware of a problem as quickly as possible, instead if letting it sit until we can blog a tidy post mortem.
Because problems in a heavily distributed environment can be subtle and difficult to thoroughly trace, we sometimes have to relay unpleasant surprises (in the form of the infamous [RE-UN-RESOLVED] tags) as our tech team closes in iteratively on an operational glitch.
Having said all that: These intermittent failures with our Website seem to be setting some kind of record for aggravation.
Here’s what we know so far:
* It looks like a load balancer’s mostly to blame. A really stubborn load balancer. * If you get a 500 Error, a 503 Error, or a redirect error, hitting ‘refresh’ in your browser will generally get you a good connection the second or third time around. * Your irritation is thoroughly understandable, and we apologize for contributing to it with premature reports of resolution. * Multiple Web-savvy Lindens are focused on this. Some of them, still on the road to work, don’t know that yet. They’ll find out very shortly. * We’re not going to call this [RESOLVED] again until we’re sure the tag fits.
–teeple What is a load balancer? (thanks in advance for the post that says "someone or something that balances a load"?) What made it go bad? I recall the forums were moved recently. Where were they moved from, and where to?
_____________________
-
So long to these forums, the vBulletin forums that used to be at forums.secondlife.com. I will miss them.
I can be found on the web by searching for "SuezanneC Baskerville", or go to
http://www.google.com/profiles/suezanne
-
http://lindenlab.tribe.net/ created on 11/19/03.
Members: Ben, Catherine, Colin, Cory, Dan, Doug, Jim, Philip, Phoenix, Richard, Robin, and Ryan
-
|
|
Bradley Bracken
Goodbye, Farewell, Amen
Join date: 2 Apr 2007
Posts: 3,856
|
02-03-2008 11:51
Here's load balancer from Wikipedia: http://en.wikipedia.org/wiki/Load_balancing_(computing) Maybe someone will come along and describe it in English. My favorite line from the blog is this one: From: someone If you get a 500 Error, a 503 Error, or a redirect error, hitting ‘refresh’ in your browser will generally get you a good connection the second or third time around. Ha! If only it took two or thee tries.
_____________________
My interest in SL has simply died. Thanks for all the laughs
|
|
Brenda Connolly
Un United Avatar
Join date: 10 Jan 2007
Posts: 25,000
|
02-03-2008 11:54
It's not just the Forum though, the entire website is 503'ed, at leaast for me.
*18 tries to make this post.*
_____________________
Don't you ever try to look behind my eyes. You don't want to know what they have seen.
http://brenda-connolly.blogspot.com
|
|
Laurence Corleone
Registered User
Join date: 12 Oct 2006
Posts: 126
|
02-03-2008 12:04
Two scenarios.... In situation A there is a server (server 1) that gets a lot of hits. So many hits that some are turned away because the server is busy. This is not good. In situation B there are several servers (servers 1 through whatever) that are exact mirrors of each other. Instead of the call to the server going directly to the server (as in situation A) they are intercepted by a load balancer. The load balancer's job is to take that request and send it to one of the mirrored servers. Most cases it is sent to the server that has the least amount of traffic on it. This is good. This is a simplistic explaination. There might be more than one load balancer (likely) and each load balancer might be programmed to send certain calls to certain servers or groups of servers or even to another load balancer. An example would be a cluster of servers that just handle forums.
As for what made it go bad? Poor programming, a bad cable...it could be just about anything that could effect your internet access in your home the same way.
_____________________
There are no stupid questions, just stupid people.
|
|
Nika Talaj
now you see her ...
Join date: 2 Jan 2007
Posts: 5,449
|
02-03-2008 12:08
In this context, a load balancer is a device (or set of devices) that sits in front of a web farm and distributes traffic fairly among the servers in the farm. The purpose is to be able to optimally use a whole pile of servers that all do the same thing (think of Google, which has many buildings of identical servers). In this case, the web farm is a bunch of servers that all run this website, accessing a shared database for, for example, the forum postings.
Load balancers have a bunch of different ways to decide "fairness". I don't see that this website would require a particularly smart load balancer.
Teeple seems to be saying that one of the load balancers appears to be messed up ... maybe it thinks there are more servers than there actually are, or maybe it thinks there's more traffic than there actually is. Really obscure things could cause this ... maybe even a hardware failure, like a chunk of bad RAM (memory). Or, a traffic routing issue with the web farm, or outside of it. To find the problem, you'd look at recent routing changes in the data center/web farm, and also review the error logs from all the load balancers, etc., blah blah blah, the usual. Really, I'm puzzled as to why they haven't fixed this yet.
[I've noticed that some of the errors are coming from squid ... that is, their web cache, which also sits in front of the server farm, and supposedly serves up unchanged pages faster than the application servers do). There are lots of ways for web cache devices to screw up a traffic stream as well.).
_____________________
.  To contact forum folks, join the inworld group "The Forum Cartel". New residents with questions about SL more than welcome! We has parties!  To contact forum scripters, join the inworld group "Scriptoratti" (thanks Void!). New scripter questions welcome!
|
|
Alexin Bismark
Annoying Bastard
Join date: 7 May 2004
Posts: 208
|
02-03-2008 12:10
From: Bradley Bracken Here's load balancer from Wikipedia: Maybe someone will come along and describe it in English.
A picture usually saves alot of words so....  This is a very simple load balancing setup but it should get the idea across. The "web servers" in the pic could be for example, the servers that host the forum instances. Share the load acorss multiple servers so everyone isn't hammering one box. An examples involving firewall load would look something like this.  Hope that helps.
|
|
Qie Niangao
Coin-operated
Join date: 24 May 2006
Posts: 7,138
|
02-03-2008 12:29
That's a pretty good article, actually. In a former life, I had to buy these things, and the general problem is that the more features they have, the more complicated they are to configure, since most settings are interdependent. And it's hard to be very smart about what's going wrong since we don't know what load-balancers LL is using or what they're really trying to do with them. (E.g., are they terminating SSL at the load-balancers? Are they using them for DDoS protection? Are they--god forbid--doing web caching right at the load-balancer itself? Who knows?) If I had to guess, I'd venture that, perhaps: (least likely) the whole site is under a very extended Distributed Denial of Service attack. But lasting for over a week without seeing anything in the media makes this seem pretty unlikely. (To effectively counter such a thing, they'd need to get law enforcement involved, and stuff would get public by now, I'd think.) something about the network routing between colocation sites is not matching the configuration of the load balancer, maybe causing loops and leading the load balancer to conclude that the destination server is out-of-service--for all servers so routed. And LL's different network/server architectures for different services may be at the high end of complexity for these boxes anyway. (most likely) LL SNAFU. In my experience, load-balancer vendors are useless knobs when it comes to support, so LL Operations would be pretty much on their own, whether they're using a commercial product or something open source. And the fact we've gotten no update at all suggests they're kinda embarrassed. Or so completely flummoxed that they don't even know what to say.
|
|
Kerry McCallen
Contents Under Pressure
Join date: 22 Dec 2007
Posts: 6
|
02-03-2008 12:34
Not knowing how LL's network is configured, I'd have to take a guess as to what's going on, but it seems that either they: 1. have a single load balancing device that's b0rked, acting as a single point of failure and causing problems for both the website and inworld as well. I can't imagine this being the case as having a non-redundant load balancer managing a network with this heavy of a load is foolish IMHO, as it creates a single point of failure for the entire network. 2. or, they have multiple load balancers that aren't communicating with each other properly and thus not correctly and evenly spreading the load out to the appropriate servers. Some servers are apparently getting a heavier load than others and aren't able to respond quickly enough to all of the requests. The load balancer eventually gets tired of waiting for a response and returns the dreaded "503 Service Unavailable" to the user. In the interest of full disclosure, neither Kerry nor his typist work for LL, although his typist plays a network admin elsewhere in real life... I certainly wouldn't want to be on LL's network staff right now - I doubt they have slept much over the past couple of weeks... Edit: You beat me to the punch, Qie, and made some points that didn't come to mind to me right away. It's safe to say that the network I manage is *somewhat* smaller than LL's The long and the short of the matter is that something is *seriously* fracked up in their network infrastructure...
_____________________
Brutal Truth #27: If you and your friend encounter a bear in the woods, just remember that you don't have to be faster than the bear; just faster than your friend.
|
|
Desmond Shang
Guvnah of Caledon
Join date: 14 Mar 2005
Posts: 5,250
|
02-03-2008 12:39
I'm not quite familiar with the particular details of this situation, but load balancing can be a lot trickier than it sounds, theoretically.
Imagine two freeways A and B. A is more trafficky, so the morning news says: B has less traffic, yay!!!! ... doh. There are lots of strange little issues like this. Wouldn't surprise me if something like that was happening here. Or there's a DOS attack or something. I'm actually very, very impressed we don't have 'the whole grid is down Sundays' any more. THAT was annoying, and I have no idea how the Company fixed that other than setting loose the FBI and other authorities on the perpetrators.
_____________________
 Steampunk Victorian, Well-Mannered Caledon!
|
|
Bradley Bracken
Goodbye, Farewell, Amen
Join date: 2 Apr 2007
Posts: 3,856
|
02-03-2008 12:43
From: Qie Niangao That's a pretty good article, actually. Yes, just a bit much for this very sleepy, groggy brain to catch on this rainy day. The brief descriptions and Alexin's nice pictures were easier for my limited brainwaves to grasp today. Interesting how easy it is once you understand it. It is exactly what the name implies.
_____________________
My interest in SL has simply died. Thanks for all the laughs
|
|
SuezanneC Baskerville
Forums Rock!
Join date: 22 Dec 2003
Posts: 14,229
|
02-03-2008 13:27
Is there something peculiar about Linden Lab's web site that makes it more difficult to keep working properly than sites with vastly more traffic, such as the New York Times or CNN sites and such?
Is the load balancer a distinct piece of hardware or a program?
Do the forums have a particular single physical location or are they now scattered about like the grid is partially in SF and partially in Dallas and maybe somewhere else I can't remember?
I recall that the LSL wiki, back in the old days, fell victim to what I think was an attack targeted at it. I might well be wrong about that, but that's what I remember from back when Catherine was running it, which attack was why LL took it in for a while, so they could deal with the pain of handling the attacks.
Currently there have been the folks doing the TMNT/Goatse/Cosby/Mario/Etc. using a client that spoofs IP numbers and hardware ids to keep from getting shut out. I wonder if that group or some other group is working on screwing up the website.
I hope it's just a technical problem that can get fixed and go away.
_____________________
-
So long to these forums, the vBulletin forums that used to be at forums.secondlife.com. I will miss them.
I can be found on the web by searching for "SuezanneC Baskerville", or go to
http://www.google.com/profiles/suezanne
-
http://lindenlab.tribe.net/ created on 11/19/03.
Members: Ben, Catherine, Colin, Cory, Dan, Doug, Jim, Philip, Phoenix, Richard, Robin, and Ryan
-
|
|
2k Suisei
Registered User
Join date: 9 Nov 2006
Posts: 2,150
|
02-03-2008 13:40
LL just needs to clear their cache.
|
|
Atashi Yue
Registered User
Join date: 24 Jan 2007
Posts: 703
|
02-03-2008 15:24
From: 2k Suisei LL just needs to clear their cache. And reboot the load balancer..
|
|
Rebecca Proudhon
(TM)
Join date: 3 May 2006
Posts: 1,686
|
02-03-2008 15:28
From: 2k Suisei LL just needs to clear their cache. And pay their ISP bill.
|
|
Day Oh
Registered User
Join date: 3 Feb 2007
Posts: 1,257
|
02-03-2008 15:29
From: someone Tracing route to loadbalancer2.lindenlab.com [208.67.217.130] over a maximum of 30 hops:
1 8 ms 7 ms 15 ms 10.120.32.1 2 6 ms 9 ms 9 ms 172.22.32.250 3 10 ms 11 ms 11 ms atl-edge-18.inet.qwest.net [216.206.221.149] 4 10 ms 13 ms 11 ms atl-core-01.inet.qwest.net [205.171.21.161] 5 10 ms 12 ms 9 ms atl-brdr-04.inet.qwest.net [205.171.21.174] 6 12 ms 9 ms 11 ms 0.so-3-0-0.BR1.ATL4.ALTER.NET [204.255.169.81] 7 11 ms 11 ms 19 ms 0.so-1-1-0.XT1.ATL4.ALTER.NET [152.63.86.170] 8 31 ms 32 ms 33 ms 0.so-3-2-0.XL3.NYC4.ALTER.NET [152.63.0.182] 9 33 ms 31 ms 31 ms POS6-0.GW2.NYC4.ALTER.NET [152.63.19.221] 10 39 ms 31 ms 31 ms splicetelecom-NewYork-gw.customer.alter.net [157 .130.14.214] 11 * * * Request timed out. 12 * * * Request timed out. 13 * * * Request timed out. 14 * * * Request timed out. 15 * * * Request timed out. 16 * * * Request timed out. 17 * * * Request timed out. 18 * * * Request timed out. 19 * * * Request timed out. 20 * * * Request timed out. 21 * ^C Does this mean anything? loadbalancer1, loadbalancer3, all the way up through 15 work fine, but loadbalancer2 doesn't? I mean, I don't know anything about networking... surely if this means anything, they would've figured it out a week ago?
|