A brief note on a statistic.

It used to be that the daydream of every programmer was to write the next great Unix shell or the next great text editor. Nowadays it seems to be writing the next great web framework.

But how well do they perform? Consider this little ditty, involving a benchmarking effort over several popular frameworks written in PHP (the same programming language as WordPress). After several rounds of tweaking and tuning, the benchmark author coaxes the fastest framework in his test to spit out 118 pages per second without going to the database.

Scroll up to the top for the important part, though: plain jane HTML files, served by the arguably rather top heavy Apache web server, get pumped out at a rate of 1,327 per second. I’ve never tested the Club Troppo server on this basis, but I expect it could easily serve 7-9,000 pages per second on the same terms. Yet I can bring our brutish monster to its knees with 200 simultaneous requests.

This suggests that a lot more needs to be done. Previously I have argued that the classic ‘LAMP stack’ is a tightly coupled monster riddled with duplications and unnecessary functionality. I reckon this example goes some of the way to demonstrating my point that the business of squirting bits and bytes at web browsers is actually very easy — it’s all the moving parts behind the web servers that are making life difficult.

Next stop: the virtual server.

I am starting to think that virtual servers are the wave of the future, with implications that a lot of people haven’t realised yet. The economics point in that direction: the price of hardware continues to plummet, while the cost of administrating a traditional shared hosting arrangement continue to rise. But with a virtual server those costs are contained because the host has less to do in terms of preventing customers from interfering with each other’s quality of service. The trendlines will cross at some point and the virtual server will become cheaper than the shared host.

That’s if the trendlines haven’t already crossed. Dreamhost charge me $7.95 for my account with them; meanwhile I also have Ozblogistan on a virtual server that costs $20/month right now — and the hardware it lives on continues to get cheaper every day.

One of the things that has always held up the LAMP stack’s dominance is the wide installation of Linux, Apache, MySQL and PHP by shared hosts. But when the server is virtual, the host hasn’t defined an execution environment, it’s totally up to the client. This means that web applications can unshackle themselves from LAMP and adopt any architecture they wish, so long as it can be deployed as a virtual machine.

That’s the direction my own thinking has been taking in my quest to write the Great Australian Blog Engine — tentatively nicknamed ‘Project Wordpreth’. That I can eliminate any element I don’t need, and that I can totally specify the operating system elements and rely on them directly. It allows for a lot of simplifying assumptions — for example, I can better deal with the ‘double buffering problem‘.

It’s going to be an interesting couple of years if I’m right.

This entry was posted in Blogs TNG, Economics and public policy, Geeky Musings, IT and Internet. Bookmark the permalink.

21 Responses to A brief note on a statistic.

  1. But will it get you laid?

  2. Jacques Chester says:

    It’s getting-laid neutral.

  3. JM says:

    Maybe not. I keep hearing this idea that nerds have finally come into their own as an alternative form of babe magnet to jocks and musicians.

    Unfortunately, I haven’t seen any personal evidence of it yet.

  4. dr. faustus says:

    There seems to be a lot of buzz these days about virtual servers, especially seeing Microsoft has just released it’s latest virtual server software (or is about to, or something).

    There are a number downloadable “virtual server appliances” – pre-cooked virtual sever images to run specific programs, and a quick google reveals that there are at least a couple of appliances with LAMP stacks available.

    The only problem with a virtual server is, while you do get fine-grained control over your environment, they also have a bit of processing overhead (not much these days, especially with hypervisors, but some), and you also have to share your hardware with everyone else’s virtual server.

    Still, I agree that it’s something that we’re going to see a lot more of in the near future.

  5. David Rubie says:

    What’s interesting about those benchmarks is that moving from a desktop PC with specs about equivalent to a 2001 era machine came close to matching a much higher performance box. It’s disk performance that is the major bottleneck from what I can tell of the article. That means issues with I/O throughput that are common to the x86 machines – physically getting stuff off the disk into memory.

    If that’s the case, dumping 50 virtual servers onto a machine grinding away on the same hard disks is going to be a total waste of time – you’d need to make sure your databases and scripts are distributed across different spindles to ensure the I/O wait times don’t tie up your machine – clearly the CPU and network performance are not where the problems are. The only real issue I have with server virtualisation is the sheer number of copies of the heavyweight operating system you need to have loaded up – such a waste.

    You could mitigate it somewhat by putting many tens of gigabytes of physical memory into the machine and caching the important stuff (and Apache can do that if you muck around with it), but MySQL needs an awful lot of fiddling to do things that Oracle and SQL Server do automatically (if not 100% optimally). MySQL and PHP are shaping up as unscalable junk – fine for hobby projects, useless for actual work. I’d also bet that some of those frameworks performance would improve immensely with a different database underneath them.

  6. Jacques Chester says:

    I think a good analogy for what I’m talking about as a trend is the way AMD and Intel moved their microarchitectures to embrace RISC internals while maintaining a CISC facade.

    At first it was a bold step to give up so much die space to decoding x86 instructions just to reissue internal micro-ops. But as time went on, the amount of transistors given to this task remains about constant while the total budget increases due to Moore’s Law. In the long run it’s an obvious move.

    I think the same is true of virtualisation. The performance overhead will continue to shrink as a proportion of underlying system performance because it will remain about constant. But what’s nicest about VMs is that the cost of administering them is much lower because the administrator only deals with opaque, high level units which are easy to constrain and schedule.

    In general most VM hosts use the same sorts of servers as shared hosts — multi socket, multi core, oodles of RAM, fast RAID etc etc. The difference is that it will take fewer administrators per customer because it is easier to automate the management and deployment of VMs. It’s also easier to load balance, backup etc compared to the classic shared-hosting scenario.

    So like I say, it’s a matter not so much of the current state of things, but the way in which the trendlines are heading.

  7. Jacques Chester says:

    Also, David, it depends a lot on the exact virtualisation strategy. Emulation needs more than ‘classic’ virtualisation, which requires more than paravirtualisation, which requires more than kernel-level virtualisation.

    There are some nice kernel-level virtualisation projects for Linux, and Solaris has its Zones and Containers. In those cases only one actual kernel is running on the machine, but the hardware is still virtualised.

    And of course there’s hardware-level virtualisation like you get with IBM Lpars and Sun LDoms, but that’s not really what I’m talking about.

  8. JM says:

    David (5) “MySQL and PHP are shaping up as unscalable junk”

    I wouldn’t be so sure about that – a couple of old friends of mine worked for Yahoo for about 2 years and they tell me that Yahoo use MySql and PHP (but recode the heavy-use PHP in C++ when it needs optimizing)

    I also had some benchmarking done about 3 years ago on MySQL and was surprised to find that it came close to rivalling an in-memory database (albeit the application didn’t involve huge amounts of data)

    However, this was with transactional support turned off – which is what Yahoo do. If you need transactions though I agree that MySQL slows right down, and doesn’t do it all that well either.

  9. Jacques Chester says:

    I suppose it depends on what is meant by scalability.

    There’s the brute reality of how many concurrent people can be served. In that respect PHP and MySQL can be made to do it.

    But there are also questions of how efficiently and easily this can be done. In this respect PHP and MySQL are a real hassle — you need to do a lot to make them scale. It’s not “out of the box”.

    So if scalability means “very large firm got it to work”, then yes. If it means “happens even semi-automatically”, then hell no.

  10. David says:

    Jacques – your comment about AMD and Intel moving over to RISC under the bonnet is interesting. The last time I paid any attention to chip architecture (about 10 years ago) the chip wars were said to have been won by CISC. I’ll have to read up on it.

  11. Jacques Chester says:

    They were won by volume, really, more than anything. The insight of RISC was that when the instruction set essentially constrains the design, RISC is better because you can optimise the common case.

    What Intel and AMD did was to decouple the instruction set from the microarchitecture. From then on it was down to economics.

    What’s kept the remaining RISC architectures alive has been twofold: the rise of embedded processors, where the chip budget can’t justify the decode-reissue hardware; and private third-party manufacture by firms like TSMC and IBM. So in fact MIPS, ARM, Power, PowerPC and SPARC are still around; though Alpha and PA-RISC died due to Intel press releases about the Itanium.

    My advice is that you should head over to Ars Technica and read everything by Jon Stokes. He’s a brilliant writer and very good at explaining what changes between generations and why. I studied chip architecture at a high level for computer science, but I understood it better thanks to Stokes.

  12. Stephen says:

    David: Why do PHP and MySQL still run an awfully large number of websites across the Internet?

    Because they’re simple to understand, easy to set up and they work.

    The Internet is almost entirely “niche” — filled with sites only of interest to a comparatively tiny minority. For this purpose PHP and MySQL are perfect.

    Read Tim Bray’s discussion of the 80/20 point for another take on why imperfect technologies succeed.

    I’ll take the “unscalable junk” of PHP over JSP for my next niche project any day, thank you very much.

  13. Jacques Chester says:

    They run on a lot of sites because MySQL was fast and PHP was easy back in 2001. Subsequently every $10 shared host installed them and here we are today.

    Never mind the formidable Tim Bray, the original essays to consult are the Worse is Better writings by Richard Gabriel.

    I agree that most sites are ‘niche’ — the long tail. But scalability matters because nowadays massive traffic is one front page away. Dugg, Slashdotted, Reddited … now any and every site could be facing a million visitors without warning. And in that situation PHP and MySQL are not much help without a lot of tuning and caching and pissfarting about.

  14. David Rubie says:

    Stephen wrote:

    Ill take the unscalable junk of PHP over JSP for my next niche project any day, thank you very much.

    No skin off my nose Stephen, but most people I know who disdain PHP and MySQL don’t program for a hobby, they do it for a living, and fighting your tools is a sure fire way of losing money on a contract. I don’t want to explain to a customer that my tools are “very cool” but their program doesn’t work.

    I have no doubt that for small, simple projects, they work OK, (and I’ve used them for that myself) but if you’ve got some monster project that has to re-use your code in a big back-end batch or spit it up onto a web page, I’ll take Java or .NET and I’ll be finished with the cash in my pocket before you. It’s not a technical decision, it’s a financial one.

  15. Jacques,
    I would have thought that a lot of the disk thrash will be taken out over the next few years as the solid state HDDs become more common. They are already bigger (for an equivalent price) than the 160 or 320 SCSI HDDs were about 6-7 years ago and much, much faster on the random seeks.
    Once the HDD bottleneck is reduced in this way I would have thought the architecture again becomes less relevant – ease of use comes back to the fore.

  16. Jacques Chester says:

    Andrew;

    Quite possible. It comes down to a lot of factors including the way the file and virtual memory subsystems work.

    My argument is still from the trend and not from the current situation, if that makes sense. I think I will need to follow this post up with one with diagrams.

  17. David says:

    Thanks for the suggestion/reference, Jacques.

    I’m aware, of course, that RISC-based machines are still around – the application I’m responsible for has its back end on a Sun box – and I am also aware of the theoretical superiority. It’s just, as you pointed out, the economics that marginalised RISC.

  18. Jacques Chester says:

    David — no worries. I’d be interested in your feedback on my latest post.

  19. Dave Roberts says:

    Jacques, you’re probably right here, but don’t fall into the trap of comparing performance numbers in the abstract. Saying that raw Apache can do more than 1000 requests per second, but then PHP only delivers in the hundreds of requests per second is an interesting factoid that could be used to plan a web site, but it’s hardly the basis for spitting on PHP. Let me say that I’m no PHP fanboy and I won’t argue with it if you don’t like it, but I’m a realist when it comes to technology.

    Here are some other factoids. I have had a corporate web site Slashdotted three times where I was able to monitor the load on the server. The server was a low-end dual-core machine with a bunch of pages created in PHP. At the highest load during the Slashdotting, we’d find something like 30 requests per second hitting the machine. On a typical day, we’d average about 5 requests per second. We found that even with that load, our server was chugging along with only 15% CPU utilization. Obviously, there were other bottlenecks. The first time we got Slashdotted, we held up under the load, but the page load times did get pretty long. We figured out that we were swapping and boosted up the RAM. On the latest incident, because we offer a free download of our software that runs more than 100 MB, our 10 Mbps colo pipe was pegged.

    The takeaways here are:

    1. Internet traffic distribution has a very long tail. There are a few mega-sites that have huge sustained request loads (e.g. Yahoo, Google, MSN, Amazon, etc.). But once you get past the few thousand sites that run a sustained load greater than 100, there are 10s of thousands of sites that run at a load between 1 and 100, and then there are millions of sites that run at less than one request per second. Heck, I heard a statistic that most blogs have a readership less than 1 person. To put some numbers around this, our corporate site is actually ranked in the top 100,000 sites on Alexa and we do about 5 requests per second. That’s the reality for the majority of sites out there.

    2. Given the low load seen by most sites, PHP will handle it fine. The best flash-crowd that an average site could hope for is a good Slashdotting or “Reddit-dotting”. That’s it.

    So, if you want to write the Great Australian Blog Engine and do it in another, non-LAMP language, running on a VPS, that’s cool. I have thought about doing the same thing myself (Great American Blog Engine, though ;-) ). But do it knowing that you’re really doing it to feed your own interests and quest for knowledge (both still noble motivations, IMO), not because PHP can’t handle the load for serving up most web sites.

  20. Dave Roberts says:

    Whoops, forgot my 3rd takeaway:

    3. You’ll often bottleneck something else in the system before you’ll find that PHP is your bottleneck. If I was running a PHP-based site (which I actually do), I’d check my RAM, check my disk speed, check the database, and check the Internet pipe before I’d actually check PHP itself. The reality is, most of the time the bottlenecks are in those other places and they’d still be there even if you used another language to write to code. Put another way, if you have bad programmers and bad sysadmins, you’re screwed no matter what language you choose and whether you run on a VPS, a shared host, etc.

  21. Jacques Chester says:

    Dave;

    Ta for the remarks. My hate for PHP is mostly informed by programming in it. But I look at the raw performance of Apache vs PHP and wonder if there isn’t some kind of architectural flaw at work here — ditto going out to MySQL.

    Static HTML has its advantages. I’ve been musing about whether the gap between the Movable Type and WordPress approaches (static pre-generation vs dynamic on-the-fly) could be closed with a long-running task that runs in parallel to the web server, rebuilding pages out of a prioritised queue. Pages with more traffic promoted and so forth.

    All part of my GABE/Project Wordpreth thinking :)

Leave a Reply to Andrew Reynolds Cancel reply

Your email address will not be published. Required fields are marked *

Notify me of followup comments via e-mail. You can also subscribe without commenting.