SL Scalability -- Issues to address.
|
Morgaine Dinova
Active Carbon Unit
Join date: 25 Aug 2004
Posts: 968
|
10-22-2004 05:53
From: Hiro Pendragon While I agree the cell site analogy is weak and really not applicable, the fact is, Morgaine, you can't make all machines unassigned if you expect SL to expand past LL. Yes indeed Hiro, that is largely true. Beyond SL, in a community metaverse, the issues would be significantly harder. From: someone If you have all machines unassigned, then you have a sharing of servers by all providers. It's a simple matter of real world ownership. Let me lay it out in a series of steps how I see SL growing. (Granted, this is my opinion based on my analysis alone.)
1. Linden Lab, in the short to mid run, can do whatever shared resource optimization they desire since they own 100% of the servers. 2. At some point, Linden Lab's server farm grows in a pace that will lead to it become impractically huge to manage by one company. Very true. Hehe, I've yet to find a manager who thinks that something is too large for them to manage, but that's a different issue. And incredibly, I've come across product managers who fervently believed that in 5 years their managed product would be universal and the diverse anarchy of the Internet would be a thing of the past, with them in control. Managers are an odd lot.  We'll have to see how LL managers move on this one, but I think you're right: the size of a management domain is bounded. Beyond some bounds, it often goes rapidly downhill. From: someone 3. As Linden Lab strains to continue to manage all the servers, companies and organizations will demand the flexibility and ownership of their own servers to best provide service to their visitors / customers on the Metaverse / SL. I like this analysis. Actually, it'll be more of a continuous process, and only accelerated by the problem you highlight. Many of us would like to diversify that ownership even this early in the game, if it were distributable. Some of us just because it would be fun, others for commercial reasons. From: someone 4. Linden Lab makes the decision to lease out / open source the server code to 3rd parties - the same people as in 3. This results in the decentralization of LL as the primary content server. You're combining three issues here, machine resourcing (powering the content), dev resourcing (creating software), and for want of a better term, artistic resourcing (creating content). They are all needed, but they don't come as a package deal and therefore I would want to consider them separately. My own feeling is that there is no alternative to open-sourcing development, simply on economic grounds. The open source community is the only almost-limitless development resource on the planet, and to a large extent it can be harnessed at very little cost (the cost is largely paid in good will and openness). Contrast that with the sky-rocketing cost of labour everywhere, even in the 3rd world. No contest, in my view. From: someone 5. As Linden Lab is no longer the primary content server, shared resources between companies becomes improbable. "Joe's Autos" server does not want to share resources with "ABC Bank". Each want their own dedicated resources that they will be happy to pay for to ensure visitors to THEIR section of the Metaverse are served. They frankly don't care if other parts of the Metaverse are laggy. Indeed. There is ample precedent for people taking this approach, eg. websites using client-killing content, games requiring the latest graphics cards, and so on. I think you're right, but I don't see it as a bad thing, because it pushes the envelope of what is possible, and that drives progress. From: someone And that's the crux of things. When the servers become decentralized away from Linden Lab, the natural consequence is that the responsibility for keeping lag low in the Metaverse will fall on the individual server owners. Ergo, shared resources are improbable. Sharing resources is actually a move toward centralization, not away from it as you assert. I agree with your premise and probably with your assessment about how things will evolve. However, it's not a showstopper, there are possible solutions to this. In particular, resources don't always have to be applied only to local objects, but you can set up a reverse-direction task stream to help out the client using server power. I don't think that is the way to go here because client numbers rise far more rapidly than server numbers, but I've used something similar for decoupling purposes so it is possible. For now I'm concentrating on how one could scale the single cohesive LL system, but scaling the metaverse is a wondefully more ambitious goal which I'm glad to see that someone has started to analyse here. Great stuff. 
|
Jack Digeridoo
machinimaniac
Join date: 29 Jul 2003
Posts: 1,170
|
10-22-2004 06:54
From: Morgaine Dinova But Jack, that ridiculous idea came from Eggy, not from me.  A mainframe is the least scalable implementation that there could possibly be, and it just shows how totally unfamiliar Eggy is with the concept of scalability that he even proposed it. Or with economic viability. It's his usual straw man approach --- invent some idea which wasn't proposed so that he can knock it down. The hardware of a scalable architecture for SL would be no different to that used currently, ie. just a collection of Linux boxes, subdivided into 3 or 4 load-balanced clusters or pools to deal with different aspects of it. LL seem to be doing that already in the database area. Morgain, if you are an expert, please state the make and model number of the mainframe that you say is the least scalable? You wouldn't talk about it if you didn't know the make & model would you? If you really do have all this experience, producing a make a model # should be pretty easy right? Actually based how much you claim to know, I'd expect you to list 5 and tell me which one was cheaper. And Morgain, you brought up scaling mobile objects. Your saying each AV get's assigned it's own server. Instead of 1 server per sim, you login and get assigned a server, it builds the area around your AV dynamically, pulling massive amounts of data from a the now essential cluster of DB's, and streams the data to your client. If that was not what you meant by mobile, please enlighten me. That would be an uncomprehensible increase in bandwidth between cells in the grid. Not to mention years of man-hours to redesign the software. So far Morgan, it seems like all your wordy posts boil down to "get a server iron man...they rawk.".
_____________________
If you'll excuse me, it's, it's time to make the world safe for democracy.
|
Hiro Pendragon
bye bye f0rums!
Join date: 22 Jan 2004
Posts: 5,905
|
10-22-2004 07:03
From: Morgaine Dinova ... not a showstopper, there are possible solutions to this.
That's what we're all pulling for, and certainly Philip & Co. From: someone For now I'm concentrating on how one could scale the single cohesive LL system, but scaling the metaverse is a wondefully more ambitious goal which I'm glad to see that someone has started to analyse here. Great stuff.  Well since Philip declare we're the Metaverse, we gotta keep our eye on biggity big. =) You may want to assume there has to be a server per sim, maybe your efforts might work best looking at the shared server farm as a supplement?
_____________________
Hiro Pendragon ------------------ http://www.involve3d.com - Involve - Metaverse / Emerging Media Studio
Visit my SL blog: http://secondtense.blogspot.com
|
Morgaine Dinova
Active Carbon Unit
Join date: 25 Aug 2004
Posts: 968
|
10-22-2004 07:36
From: Hiro Pendragon Well since Philip declare we're the Metaverse, we gotta keep our eye on biggity big. I'm not 100% sold on that, Hiro --- buyer beware. Yes, he's said some very encouraging things which I wholeheartedly endorse and trying to assist, but remember that he's a top-league power manager so one has to assume that many things are said for effect. Maybe even the "millions of customers" thing is more a desire than a plan, or not even a desire but just to encourage interest from the progressive community, so maybe we're all barking up the wrong tree even bothering to help.  For now though, I'm taking the face-value interpretation, for lack of counter-indications. And it's more interesting, anyway. 
|
Morgaine Dinova
Active Carbon Unit
Join date: 25 Aug 2004
Posts: 968
|
10-22-2004 07:53
From: Jack Digeridoo Morgain, if you are an expert, please state the make and model number of the mainframe that you say is the least scalable? LOL, Eggy brought up mainframes, you talk to him about it, not me. And by all means offer us a scalable solution based on mainframes, it would be great to see --- I mean that honestly. Personally, I'm not even considering solutions that don't use LL's existing grid boxes as their basic workhorse, but that's just commercial pragmatism on my part, and it most certainly doesn't mean that you cannot find a viable package if you are creative and look for it. Even the cost argument may not apply, because mainframe salesmen offer "discounts" that are stunningly huge beyond belief sometimes, to get a foot in the door. There's room for plenty of other types of solution --- go for it. Btw, I've seen first-hand one serious proposal for replacing a 1000-node network of Unix boxes with a single Cray back in academia, and it was certainly not rejected out of hand despite not being taken up. Managerially it was a terrific idea, the MIS types loved it. Btw, don't forget to factor in disposal of LL's grid boxes in your proposal. All the money that will be lost from PC depreciation will take a lot of discounts to overcome. Still, it's your proposal, or Eggy's, so you worry about it. We're listening. From: someone Your saying each AV get's assigned it's own server. No I did not, and therefore the rest of your paragraph wasn't relevant. That would be static assignment of another kind, which is a bad idea.
|
Morgaine Dinova
Active Carbon Unit
Join date: 25 Aug 2004
Posts: 968
|
10-23-2004 05:56
Here's a reply of mine about scaling posted to Philip's weblog. Philip has responded to various earlier comments and suggestions made by posters about load and scalability, as well as other subjects. It's a very positive discussion, but oddly low volume. Here's my new post which highlights a number of other benefits besides events scalability:From: Philip Linden (weblog) "Agree that repartitioning server resources to match load is an important step. However, our architecture does support this concept, in that we have a variety of ways to partition load and a distributed network with enormous capacity. You may have noticed that recently we have been testing 'ocean' sims for example, which are cases where we run many simulators on one physical machine to allow creating large sparse areas of land. This is an example of the sort of things we can do." But that is not an example of dynamic assignment of server resources to match load. It merely statically assigns multiple zones to a single box, which is to some extent the opposite of scaling up since it reduces the power available for a single zone rather than raising it. It doesn't throw multiple boxes at a given load problem, and so it cannot scale up resources for popular event handling. For anyone who is interested, we have a long-running thread on SL scalability in the forums at: <<<link to this forum thread>>>.Scaling services to millions is what I've been doing professionally, so the info in the thread is an adaptation of well-tested methods to the particular needs of server-side processing in SL. The service does actually seem very well suited to such restructuring. [Disclaimer: I'm not at work here, just enjoying my SL hugely and wanting it to improve by using the best techniques.] While it is theoretically possible to scale up a statically tiled grid to cope with large events in the greatly expanded SL of "Philip's millions" (and that's a wonderful vision in my view), ie. by putting massive resource into every zone, I'm sure it's clear that this isn't sensible commercially. The vast majority of resources would be wasted most of the time. A dynamic system would virtualize the grid of zones and allow all servers to work on the overall workload stream, or preferably on several workload streams reflecting the various different kinds of processing activity. In addition to scaling up SL for mobile objects which the static grid cannot do, and hence scaling for events, a large number of other beneficial properties would result from the architectural change. Graceful degradation is a major one, since loss of a server has no critical impact on service when there is no longer any single point of failure and the failed server is only one out of many in the pool. For exactly the same reason, maintenance becomes much easier, and the benefits to staff of no longer being called out in the middle of the night to replace a failed server cannot be overstated. The benefit to customers is obvious. Other benefits are no less important. Upgrading a static grid with better machines all at once is rarely tenable, yet doing it piecemeal is unfair to those customers who are stuck with the older server in their zone. (I noticed a complaint about that in the forums today.) That would not happen when an individual server's contribution is averaged out across all zones. And even old half-obsolete kit can continue to play a part, instead of needing to be removed from the grid to avoid major unfairness. Once a dynamically scalable framework is in place, scaling becomes a gradual process of renewal and growth to remain in step with customer numbers and expectations across the whole platform, without upheavals nor local disparity. And geographic zone expansion becomes decoupled from network expansion, more a matter of policy than a matter of resources. It's a good package of benefits all 'round. The key issue though is just the scalability for mobile objects, I believe. Without a dynamic approach to that, there will be no scaling of SL to a customer base of millions, because the vast majority of resources will always be stuck idling away their cycles where they are not needed. And the trend is terminal, because the better and more popular that SL events become, the more resource elsewhere is idling and the greater the resource deficit at the event. A bad vicious trend to be in. [Posted on October 23, 2004 05:25 AM]
|
Artillo Fredericks
Friendly Orange Demon
Join date: 1 Jun 2004
Posts: 1,327
|
10-25-2004 09:47
((( sends out group hug to everyone on this thread )))) just because! regardless of each of you's technical savvy in whatever area, I'm glad that there is some constructive discussion being done!!! Let's keep it that way and try not to let our egos get in the way of truly fruitful discussions, shall we?  Arti PS - HEY LINDENS WHERE ARE YOU???? YALL NEED TA BE IN ON THIS MORE EHH!!! :: BUMP::
_____________________
"I, for one, am thouroughly entertained by the mass freakout." - Nephilaine Protagonist --== www.artillodesign.com ==--
|
Morgaine Dinova
Active Carbon Unit
Join date: 25 Aug 2004
Posts: 968
|
10-28-2004 13:54
A little bit of info on a future dynamic server architecture appeared in today's Town Hall: From: Philip Linden (Town Hall log) You: Beatfox Xevious: Are there any plans to distribute processing of the grid evenly across the servers, instead of having a dedicated server per sim? This would be a major boon to high-traffic sims. Philip Linden: We will work on ways to better distribute load. This is a hard problem.... Philip Linden: we are thinking about different designs. Philip Linden: Long term I'm sure it will be something we'll be able to do. The tenses used probably indicate the current state of play: LL will work on load distribution, and they are thinking about different designs for it. Well, that's better than a "No", lol.  Dynamic assignment is unavoidable if LL is to scale, and they know it --- no surprise there. Having worked for clients where customer growth outpaced the scalability of the platform, I sure hope that they hurry. It's one of the worst things that can happen to a service to be in a permament server-support firefight owing to insufficient scalability, because there is little you can do about it apart from throwing extremely expensive hardware at the problem. You're fighting mathematics at that point, and that's pretty much a lost cause.
|
Hiro Pendragon
bye bye f0rums!
Join date: 22 Jan 2004
Posts: 5,905
|
10-29-2004 02:03
From: Morgaine Dinova A little bit of info on a future dynamic server architecture appeared in today's Town Hall:The tenses used probably indicate the current state of play: LL will work on load distribution, and they are thinking about different designs for it. Well, that's better than a "No", lol.  Dynamic assignment is unavoidable if LL is to scale, and they know it --- no surprise there. Having worked for clients where customer growth outpaced the scalability of the platform, I sure hope that they hurry. It's one of the worst things that can happen to a service to be in a permament server-support firefight owing to insufficient scalability, because there is little you can do about it apart from throwing extremely expensive hardware at the problem. You're fighting mathematics at that point, and that's pretty much a lost cause. Morgaine, the Lindens are watching More and more I'm finding discussions mentioned in the technical issues and feature feedback mentioned in townhalls and implemented in releases. Either we're prophetic or we're being heard. 
_____________________
Hiro Pendragon ------------------ http://www.involve3d.com - Involve - Metaverse / Emerging Media Studio
Visit my SL blog: http://secondtense.blogspot.com
|
Morgaine Dinova
Active Carbon Unit
Join date: 25 Aug 2004
Posts: 968
|
10-29-2004 02:47
From: Hiro Pendragon Morgaine, the Lindens are watching More and more I'm finding discussions mentioned in the technical issues and feature feedback mentioned in townhalls and implemented in releases. Either we're prophetic or we're being heard.  Yes, it's reassuring.  Maybe it's a combination of both of the above. I wouldn't call it prophecy though when one proposes industry-standard methods for scaling and they suddenly appear. More like common sense.  It's also worth remembering that the forums aren't the only channels for this, I bet other tech people are using direct email to LL as well --- I'm starting to use this more as it avoids wasting time responding to forum detractors. Mind you, it won't stop the fanboys and other regressives when they're shown to be wrong though, they'll continue saying that everything is impossible until either Philip says they're doing it in Town Hall or it appears magically in a release. 
|
Hiro Pendragon
bye bye f0rums!
Join date: 22 Jan 2004
Posts: 5,905
|
10-29-2004 03:15
From: Morgaine Dinova Yes, it's reassuring.  Maybe it's a combination of both of the above. I wouldn't call it prophecy though when one proposes industry-standard methods for scaling and they suddenly appear. More like common sense.  It's also worth remembering that the forums aren't the only channels for this, I bet other tech people are using direct email to LL as well --- I'm starting to use this more as it avoids wasting time responding to forum detractors. Common sense assumes the common person understands computer architecture? heh. Oh, next aspect of scalability - reverse compatability and integrating WWW stuff into SL. Thoughts? 
_____________________
Hiro Pendragon ------------------ http://www.involve3d.com - Involve - Metaverse / Emerging Media Studio
Visit my SL blog: http://secondtense.blogspot.com
|
Morgaine Dinova
Active Carbon Unit
Join date: 25 Aug 2004
Posts: 968
|
10-29-2004 05:21
From: Hiro Pendragon Oh, next aspect of scalability - reverse compatability and integrating WWW stuff into SL. Thoughts?  Good point. There are indeed a few reverse-compatibility issues associated with moving to a dynamic architecture, because after all some people may have been using knowledge of the restrictions in the current static one in their builds. For example, a horse-racing track owner might have built his gambling enterprise on the basis that a 40 or 50 seater grandstand is enough, and without including a grandstand booking limit, in the knowledge that the poor little single server will reject any more people than this from entering the zone. In a dynamic architecture, you would no longer be limited by the power of a single box at the server end, so the entrepreneur might find that his design wasn't adequate to meet demand. On the whole though, that kind of reverse-compatibility issue is a matter for the people who are building enterprises within SL, ie. they need to consider the impact of their popularity. It's not a reason for holding back progress for everyone else. There are also reverse-compatibility arguments that need to be considered client-side. Once the servers can handle not 40 but 400 or 4000 people in a zone then you require much more sophisticated variable rendering techniques in the client to give people control over their FPS, since rendering every polygon within visible range becomes completely untenable except for those who enjoy an FPS of 0.0000001.  Rendering crowds is currently a research area. While the research is ongoing, the pragmatic engineer's approach is to separate out av+clothes rendering from the rest, give it its own range of visibility for rendering avs (beyond that, grey shadows only), and allow the client to specify a max rendered av count as well, counting from range zero outwards. If you apply enough pragmatics and heuristics, it usually works well enough.  And of course, they should use the programmable hardware shaders in our moden GPUs, as I mentioned in that thread about shader effects. Our poor little CPUs are glowing red hot while the GPU shader hardware is twiddling its thumbs. And the client rendering thread needs to be decoupled from the network thread too before scaling is possible, or it's lag city. There's a lot of room for improvement.
|
Marker Dinova
I eat yellow paperclips.
Join date: 13 Sep 2004
Posts: 608
|
10-29-2004 05:44
From: Morgaine Dinova Rendering crowds is currently a research area. While the research is ongoing, the pragmatic engineer's approach is to separate out av+clothes rendering from the rest, give it its own range of visibility for rendering avs (beyond that, grey shadows only), and allow the client to specify a max rendered av count as well, counting from range zero outwards. If you apply enough pragmatics and heuristics, it usually works well enough.  I've seen this problem addressed on another metaverse by using what they call "blockheads", which act as placeholders for the excess avies but are rendered with very little detail, both image and animationwise. The cutoff for blockheads is related to both number and distance, but is not configurable on the client - at least in their implementation.
_____________________
The difference between you and me = me - you. The difference between me and you = you - me. add them up and we have 2The 2difference 2between 2me 2and 2you = 0 2(The difference between me and you) = 0 The difference between me and you = 0/2 The difference between me and you = 0 I never thought we were so similar 
|
Morgaine Dinova
Active Carbon Unit
Join date: 25 Aug 2004
Posts: 968
|
10-29-2004 06:10
From: Marker Dinova I've seen this problem addressed on another metaverse by using what they call "blockheads", which act as placeholders for the excess avies but are rendered with very little detail, both image and animationwise. The cutoff for blockheads is related to both number and distance, but is not configurable on the client - at least in their implementation. "Blockheads"? How uncharitable.  Sounds like that metaverse is starting to think ahead though, ready to scale up. That's good. Don't be afraid to tell us which it is.  Pity that the word "ghost" in SL has come to mean the bugged ghosted-object thingie, since otherwise our rather nice grey shadowy outlines for avs would suit the word admirably.
|
Marker Dinova
I eat yellow paperclips.
Join date: 13 Sep 2004
Posts: 608
|
10-29-2004 06:29
From: Morgaine Dinova "Blockheads"? How uncharitable.  Sounds like that metaverse is starting to think ahead though, ready to scale up. That's good. Don't be afraid to tell us which it is.  LOL... The "other" one is "There"... I just wanted to keep the discussion on the solution, avoiding the chance of it seeming like a faceslapping comparison 
_____________________
The difference between you and me = me - you. The difference between me and you = you - me. add them up and we have 2The 2difference 2between 2me 2and 2you = 0 2(The difference between me and you) = 0 The difference between me and you = 0/2 The difference between me and you = 0 I never thought we were so similar 
|
Morgaine Dinova
Active Carbon Unit
Join date: 25 Aug 2004
Posts: 968
|
10-29-2004 06:46
From: Marker Dinova LOL... The "other" one is "There"... I just wanted to keep the discussion on the solution, avoiding the chance of it seeming like a faceslapping comparison  Uh oh, not There! I didn't want to start another metaverse war thread.  It's actually very useful for people to report any downsides that they see in SL compared to other virtual worlds, so that the issues can be addressed. "Being the best/most advanced" is meaningless unless one compares SL's performance and features against those of other systems. The blockhead rendering you described from There seems a very useful scalability feature which we need to adopt. One can't expect balanced reporting on this from LL, nor from the perpetual fanboys, so it has to come from those who have wider experience of the state of the art in online worlds. As Philip said yesterday, he wants SL to cover MMOG territory and high-speed gaming, whereas SL currently trails massively in many areas of rendering and interaction speed and interactive gaming interfaces (the first link in my sig covers one aspect of that). So we have a long way to go, and many battles to fight.
|
Hiro Pendragon
bye bye f0rums!
Join date: 22 Jan 2004
Posts: 5,905
|
10-29-2004 15:41
From: Morgaine Dinova As Philip said yesterday, he wants SL to cover MMOG territory and high-speed gaming, whereas SL currently trails massively in many areas of rendering and interaction speed and interactive gaming interfaces (the first link in my sig covers one aspect of that). So we have a long way to go, and many battles to fight.
Hmmm, don't be TOO sure. How many people does it take to crash a sim? 50? I seem to remember the early days of EQ when 50 people in one zone would be a standstill.
_____________________
Hiro Pendragon ------------------ http://www.involve3d.com - Involve - Metaverse / Emerging Media Studio
Visit my SL blog: http://secondtense.blogspot.com
|
Morgaine Dinova
Active Carbon Unit
Join date: 25 Aug 2004
Posts: 968
|
10-29-2004 16:08
From: Hiro Pendragon I seem to remember the early days of EQ when 50 people in one zone would be a standstill. Ah, you predated me in EQ by a lot then. I came in just after Velious was introduced, and subsequently Luclin's wonderful Bazaar brought 340 avatars together in very close proximity without any server crashes. My client would start lagging at around 250 people in the Bazaar as I recall (but I had really old machinery), so I'd have to wind the clip plane down to keep FPS up. With a full 340 people in the Bazaar, my clip plane was down to 30 metres or I'd be walking through treacle. This never affected the raiding game where speed mattered though, since even our biggest raids rarely used more than the 6 groups of 6 people supported by the raid window. At raid time with 36 people and another 50-80 mobs in densely populated dungeons, there was no discernible lag for me except under fault conditions. And fault conditions were pretty rare. In retrospect I hated the EQ world because of the built-in hardships of life, but one certainly couldn't fault them on reliability. Sony Customer Relations/Support may hate their customers with a passion, but their development branch must have a fantastically thorough QA/Testing department.
|
Morgaine Dinova
Active Carbon Unit
Join date: 25 Aug 2004
Posts: 968
|
Customer numbers and areas of scaling.
10-30-2004 05:14
From: Hiro Pendragon How many people does it take to crash a sim? 50? Crash is probably the wrong word, more like grind to a halt, or was there a time when the server code actually crashed? I guess it's hard to test, since we'd have to fill up a sim up with people to the 40 zone crossing limit and then teleport in another 10 at least. And then there's the issue of different kinds of boxes being able to carry different loads to consider. Relevant to this, Philip's vision on customer numbers became a lot clearer after the last Town Halls of 27/28 Oct. There was a small question mark before over where he'd find his millions of customers if SL remained only a 3D chat client and artistic creation tool, because the audience for that is limited compared to the wider gaming market. But now that he has clearly stated what many of us had assumed was the case, that he aims SL to provide a world capable of MMOG-type and other fast interactive gaming, the earlier statements about customer numbers become crystal clear and easily understandable. Indeed, merely millions is no longer even ambitious, since there already are many millions of online game players spread across the existing major offerings. By the end of the decade we can easily expect the size of the world online 3D player population to be in the hundreds of millions, and therefore even a relatively unambitious visionary would be expected to be aiming at dozens of millions ... and nobody could accuse Philip of being unambitious.  So, let's assume that he does mean dozens of millions, as a small but respectable slice of the overall pie of online gamers across all genres by the end of the decade. That's a nice goal for service scalability. The full target size is itself directly relevant to scaling customer databases etc.. When you divide the total number of customers down to give the number concurrently online then you're into overall world scaling. Then below that you need to scale at the local level which is event-oriented (there won't be any zones as we know them in a dynamic architecture, and a static one isn't even in the picture, on the grounds of being a toy at that level). Then below that you get scaling viewer locality as part of rendering load management, and a whole range of solutions there are clear even at this early stage. This is a brilliant engineering goal. I'm in my element. 
|
Bran Brodie
Registered User
Join date: 5 Jun 2004
Posts: 134
|
11-01-2004 18:13
From: Eggy Lippmann 160 avatars worth of textures, and would it even fit on my crappy graphics card with its limited memory? How about the 800 prims worth of fancily textured and scripted attachments? Naked people, that is the only answer. 
_____________________
Someday there will be a Metaverse that puts users first. Sadly LL does not want to be that Metaverse.
|
Bran Brodie
Registered User
Join date: 5 Jun 2004
Posts: 134
|
11-01-2004 18:46
From: Morgaine Dinova My local sim serves me fine when I am at home in Kamba, but it cannot help reduce the load that I cause when I visit Gibson, to use my previous example. It has been partitioned off, it is no longer available for dynamic assignment, and it represents wasted resource when its home avs or objects move elsewhere for an event. One shouldn't get to hung up on "wasted resources", that can be a distraction from a good pragmatic solution. In any event, in order to provide dynamically configured server resources a scheme must be created that allows more than one server to serve an area. This is not an easy problem. Eggy: this is not the inverse of one server serving more than one sim.
_____________________
Someday there will be a Metaverse that puts users first. Sadly LL does not want to be that Metaverse.
|
Gwyneth Llewelyn
Winking Loudmouth
Join date: 31 Jul 2004
Posts: 1,336
|
Region-free grid vs. metaverse building blocks
11-05-2004 06:06
Hello all, Since this thread reminds me of some thoughts I posted on my blog in early October (see the section "Bottlenecks and improving performance"  , I'd like to bring them into the discussion here. Another disclaimer, unlike Morgaine, I haven't been designing multi-million-user architectures professionally for a long while, so I'm quite outdated in several ways  and highly likely there are better, newer ways of doing things that we never dreamed of when the Internet was such a smaller space... Unlike Eggy, I also don't view the "client" as being the major limitation on improving performance. SL is too ambitious for my taste - it wants the client to get all the data inside a sim (and the neighbouring sims as well) and try to render that data at the same time. I understand the approach, since SL is so dynamic and uses lots of transparent textures and quirky tricks. A better approach would just stream the data that is "needed" - but this approach needs much more server-side code (ie. the wall conveniently blocking lots of objects behind suddenly disappears from one frame to the next - and you have to get all the data that was previously hidden, fast). I've given some thought on those issues, but, sincerely, in my days, "graphical cards" were something you got on Silicon Graphic workstations for US $10,000 or more, and "normal people" did ray-tracing in software - so I don't have a clue what a modern GPU can do these days. Things I've read on other threads have left my mouth open with wonder. Seems that GPUs nowadays can do things with pixels that not even existed conceptually or algoritmically 15 years ago. My, I'm really getting old  So, back to the grid's architecture! While Morgaine has the Solution with capital S - abolish regions, run the whole of SL in a small cluster of machines, distribute the load heavilly with caches and nifty tricks (very similar to what Akamai does with streaming) - Hiro raised a very important point. This solution works wonderfully well (and scales to zillions of users) if Second Life is something "restricted" to Linden Lab, and the whole metaverse "is" just Second Life and nothing else. Now this is the MSN approach - and if you remember Philip's words, he does not believe in that way of building the metaverse. He wants to have several grids (or isolated sims) spread around the world. He wants to be able to point at a physical machine and say "this computer serves the region for company X". Why? Because it's the way Web hosting has worked in the past 15 years or so, and it's the way customers expect to be billed. So currently "content" is something you usually can point at a machine and say "this is MY content, I pay a fee per month for having it online". Putting into another words... a static grid is marketable, it follows the same concepts as Web hosting, it enables non-LL regions to exist all around the world by simply running LL's software. So, a static grid makes commercial sense. However, it cannot deal with scalability issues - CPU is wasted on "empty" regions, and no matter how much hardware you throw on busy regions, there will still be lag there. My point is, it's hard to get both views - the technical issues and the marketing issues - working together. Will this mean that LL will fail because it can't address both issues at the same time? Well, again referring to my blog, there are certain compromises that could be implemented (and, again, this has a very similar counterpart on the Web). Some of you are "Internet-old" enough to remember when you could just have one Web server running on a physical box - way before "virtual hosting" appeared. If you needed to host a second Web server, you had to add a second box. If one box had one zillion hits and the other one was hosting a personal homepage, CPU cycles were wasted. This sounds remarkably like what's happening with SL right now. The Web is also a "static grid" - each server hosts its own content, and you don't share CPU power among Web servers. (Of course, unlike what happens in SL, the Web does not have moving, dynamic objects, jumping from one place to another  ) What seems to follow logically is running "virtual sims" inside the same hardware, and that's what LL is doing for a while. Running virtual Linux boxes, each one with its own virtual IP, is a trivial feat nowadays. If you really, really want to put 400 (or 1000) boxes inside the same hardware, buy an IBM mainframe - one of the models (I think it's the zSeries, but I don't really understand mainframes) has been designed for exactly that purpose, back in 1999, and if I remember it correctly, you could get a USD .5 million machine to host something like 10,000 Linux virtual machines - at a cost of about $50 per virtual server. Buy a couple of those for redundancy, and Second Life, without changing a line of code on the server software, would be able to host perhaps 400,000 users or so, and you'd need to mantain only one machine! With the USD $8 million VC that LL raised, this means supporting 6.4 million users... on just 16 machines which would take up much less rack space than the current grid... and without wasting any CPU cycles! So is this the ultimate solution? Well, the way I think about it is slightly different. LL has probably spent at least USD $1 million on the current hardware, and I'm quite certain that they won't throw it away to "go mainframe". But imagine that the server code is able to be licensed soon (or even better, open-sourced). Now imagine that someone has half a million dollars in their pockets without anything to do (my, I used to know these kinds of people before the recession  ) and sets up their own grid. They won't probably announce things like "we can now support 400,000 users, come and join us" - but much more likely announce things like sims with more prims per plot for the same price  ... and people living in the "mainland sims" run by Linden Lab would certainly think twice about "moving out" to the "competing grid". While, at the same time, many of us would run their own private sims connected to the Internet, hosted on lowly Pentium computers with scrapped spare parts. You see, the marketing side of the "static grid" makes sense for the metaverse. You can have old and new hardware side-by-side. You can offer better performance for less price by using alternate hardware. You can use mainframes to run virtual servers - or run Beowulf clusters to achieve similar results for a lower price. And you can still use the "older" hardware from the current grid to run "empty" sims. Is this solution infinitely scalable? Unfortunately, not really. But let us assume that you can define the size of the region by configuring the server software. So instead of having sims with 65,536 sq. m., you can have sims of only 4,096 sq. m., or even - why not? - just 16 sq. m.! Now imagine you have 15,000 prims and the "25-or-so-avatar-limit" on 16 sq. m. You can't physically fit 25 people on that tiny space, so even a club would actually have the combined CPU power of several dozen ("virtual"  computers to allow for, say, 200 or 300 people on a 2,048 sq. m. plot. Of course, 16 sq. m. per server is taking things to an extreme limit - since you can just view things on neighbouring sims, this would make a very short "horizon", ie. you would probably just see people in a radius of about 6 meters - not fun! But I think that the concept is valid! So, what changes need to be done for this to become "reality"? 1) Allow distribution of the central databases - asset server and login/user server - since these will be "global" databases and work in a way very similar to how DNS works today (more precisely, how OpenLDAP works). According to the latest reports, that's exactly what happened with the "new" asset server. 2) Allow a different prim limit per sim. Seems to be trivial and just a configuration parameter somewhere. 3) Allow sims to have non-standard sizes. Currently hard to implement - just think about all the LSL functions relying on 256 x 256 sized sims, or the way terrain maps are implemented. However, "hard to implement" does not mean "impossible". That's perhaps a 15-day work with 15 days of testing for 2-3 programmers  Heh. So I'm actually combining both Morgaine's and Hiro's (or rather LL's) ideas - use a clustered/mainframe approach for "almost region-less sims" for "continents" (LL-maintained or maintained by external companies) and have "islands on the net" using the current model. Glue it all together with the (currently existing) distributed/clustered approach used for the backend databases (asset and user). And change the "usage patterns" by the residents: want to host a 200/300-people event (or set up a mega-mall without lag)? Well, pick up a cluster running small sims. Want to have your quiet and peaceful home somewhere, just for a few chats with your friends? Use a "standard" sim. And, of course, pay your land usage fees accordingly 
|
Morgaine Dinova
Active Carbon Unit
Join date: 25 Aug 2004
Posts: 968
|
11-05-2004 23:22
Gwyneth, that was a terrific and insightful post, very well analysed and very precise in picking out possible problem areas and addressing them. Rate++  From: Gwyneth Llewelyn While Morgaine has the Solution with capital S - abolish regions, run the whole of SL in a small cluster of machines, distribute the load heavilly with caches and nifty tricks (very similar to what Akamai does with streaming) - Hiro raised a very important point. This solution works wonderfully well (and scales to zillions of users) if Second Life is something "restricted" to Linden Lab, and the whole metaverse "is" just Second Life and nothing else. That's entirely true. However, bear in mind that Hiro addressed (perfectly accurately) only the original cacheless proposal, because caches weren't mentioned until later. Yes indeed, the pure dynamic architecture scales well and cheaply only at a single physical site. The scalability suffers a very important slowdown if extended to multi-site operation, not because its order of scalability is any less but because of the worsening of the constant of scalability resulting from extended RTTs. Caches came into the picture here in a rather odd way, not because I particularly felt the need for them to be on the horizon yet, but because Philip's latest weblog entry indicated that he was considering moving zone servers abroad. From the perspective of scalability, this told me that he hadn't yet absorbed fully the fact that his basic static grid wasn't scalable for mobile objects/events, since the move would be compounding his original mistake. The regionalized zone hardware would be even less able to contribute its power to events in the CA grid, it would have to rely on the zone servers in its region alone if it changed to a dynamic architecture (and hence its support for large events would be reduced), and zone handover problems would (presumably) become even worse than now when moving between the foreign zones and CA ones owing to longer RTTs and international backbone congestion and packet loss over which LL have no control. As a consequence of this, I brought caches into the dynamic architecture, not to satisfy any current nor near-term technical requirement, but as a bit of a situation-saver for Philip in case he had already gone ahead with it. Zone servers abroad will still be usable as part of the million-customer dynamic scalability apparatus if reassigned into caching clusters. That's got to be a worthwhile $$$-saving observation.  That said, and while I am extremely interested in thinking ahead about the issues that caches might introduce and how to overcome them, we're so close to major fire-fighting through non-scalability of the existing grid right now that I'm focussing more on getting that resolved. If the monthly increase in customer numbers is as large as is being suggested then we're about to hit non-scalability limits pretty much any time now, not in some distant future. In fact, we may already be in the red zone, saved from disaster only because we allegedly creative lot seem to be unable to put together attractive large-scale events that would stress the existing toy system.  From: someone My point is, it's hard to get both views - the technical issues and the marketing issues - working together. Will this mean that LL will fail because it can't address both issues at the same time? My experience with people from marketting departments suggests that they could successfully market dead lightbulbs as daytime excess light absorbers if given half a chance.  Seriously though, while marketting considerations can and must influence the choice of one viable option over other viable options, they cannot override technical issues that determine viability. Static assignment of servers doesn't scale for events, and no amount of marketting considerations can override that. Your "reduce zone size" suggestion for dealing with increased event load densities in a static grid doesn't overcome the basic scaling problem, and introduces new difficulties by increasing the amount of handover traffic. Of course, LL may decide to NOT scale for events, and that would be a perfectly sound business decision, at least in the absence of competition that does better. Philip hasn't given us the word on that yet, and if the word is NO to event scaling then we won't hear about it either, since I wouldn't expect him to wave dirty linen in public.
|
Morgaine Dinova
Active Carbon Unit
Join date: 25 Aug 2004
Posts: 968
|
"Old land" hurdle for SL's dynamic scalability.
11-06-2004 07:17
This note follows from Hiro and Gwyneth's lead in considering how SL might grow into a community metaverse. (I go there via distributed caches, but that's for another post.) I'm going to deal with two related issues which weren't raised before, but which could become a small stumbling block.
First the setting: I'm placing this in the context of a dynamic architecture for SL, on the assumption that SL will either change its architecture to scale up for events, or else it will die when another metaverse system that does scale for events takes the lead from it. The following discussion doesn't apply if the ship is sinking, so I'm not going to go there.
Let's assume that Cory's R&D has come up with a working design so that Philip can start virtualizing the grid, progressively taking two adjacent zones and merging their sims into a cluster than can support the home activity of both of them, plus somewhat less than double the event size of either. Ie. let's assume "we have the technology".
Now comes the first of the two problems, neither of which is technical. As the merging of grid servers progresses, the events that can be held by people owning land in the merged zones can become more and more ambitious: higher av populations, higher prim counts, higher script loads, less zone-handover overhead and associated artifacts. Land owners (especially event holders) on the merged sims will rejoice.
But what about those land owners who have purchased their own sims? Now the benefits of lack of contention for CPU resources turn into a resource peak limitation for them. Compared to land owners on merged zones, their islands can support only puny events. Visitors to their sites will notice the ever greater disparity between mainland capabilities and private sim zones, and sim owners who think in terms of "investment" will see that somewhat intangible metric decrease rather tangibly if they try to sell.
(Disclaimer: many sim owners have no interest in large events nor in investment, and buy whole sims only for the privacy, seclusion, and peace and quiet that stem from large separate plots of land. They would not be affected by the improvements on the mainland, so this does not apply to them.)
So, there is a static land depreciation issue resulting from improved dynamic scaling elsewhere. That is just the beginning though, because there is another even greater effect in operation, mentioned previously in this thread.
The only purpose of tying prim counts to land size was to pay for statically assigned servers. LL can't use a grid server's CPU and bandwidth to assist elsewhere in the static grid, so the owners of the land to which it is assigned must pay its entire cost, even if they use zero prims and are never present in the zone. It's a simple ledger issue, the server doesn't stop depreciating just because nobody is using it. That no longer holds in a dynamic architecture --- every box in an object-serving cluster is permanently busy chipping away at the overall workload. The only cost associated with land empty of prims and avatars is its disk storage cost, and that is miniscule. This is an inherent property of virtualized zones. *Not* reducing land prices to reflect the 10-mile drop in costs through change of technology would be an artificial situation, and rapidly punished by competitors who accept the change.
So, we have a double whammy for whole-sim owner/investors. Not only is their land obsolete technically, but the bottom has completely fallen out of the price of acreage.
Let me describe slightly philosophically what is really happening here: whole-sim owners have bought into a system of scarcity created quite sensibly but artificially to finance a static architecture, and that investment entirely loses its underlying meaning when the shackles of the static assignment are released. (It's a little like the RIAA and MP3s.)
I think we can safely conclude that many such owners will not be impressed, and that brings to mind an obvious question: will the large amount of money that they pay to the Lindens each month force LL to remain stuck in a static past in order not to disadvantage their whole-sim plots? In part, I'm sure that the answer is "no", because LL can of course swipe the static server from under them and re-interpret the meaning of "your own sim" into anything they like, eg. a minimum guaranteed object service rate.
As I said in my previous note, I have great faith in the abilities of marketting people to solve marketting problems, and this is a problem of marketting. We should expect major regressive advocacy from those with large land investments who have more interest in not losing their dollars than in technical progress, but that's a bit of a side issue. The more far-sighted land owners would (I hope) focus their energies on making LL aware that they will require a different benefit to make up for the dramatic changes that technology will bring.
|
Morgaine Dinova
Active Carbon Unit
Join date: 25 Aug 2004
Posts: 968
|
State of play?
04-19-2005 04:23
It's been a few months since the last post here, and I haven't seen any updates on SL scalability on Philip's blog either. Has anyone seen any other forum or blog threads on the topic? And, has anyone noticed any increase in the number of people being allowed into a zone yet?
The customer base continues to expand at its regular rate we are told, and the grid keeps getting larger as new sims are added, so the number of people who could potentially be attracted to a popular event keeps rising too. If the number of people that a sim can admit for events is not rising in proportion to that increase, then SL is not scaling for mobile objects like avatars, attachments and vehicles. "Not scaling for events" is probably the simplest description.
The owners of the most popular event sites should know how things stand. I'll try to ask one or two of them.
|