Monday, 19 April 2010

Web Application Architecture - do I really need Apache?

Following on from my earlier post on web application architecture, there's another aspect of the common pattern I'm curious about - do I really need Apache?


I had a frustrating time recently performance tuning an application. This was partly because we were shooting at a moving target (performance test servers on a shared virtual environment where resources may be claimed by another server at any moment are not a great idea...), and partly because I still need to improve my knowledge of Apache, Jetty, Linux and performance tuning methodologies. However, I did feel that having Apache in the mix added an extra layer of variables, measurement and configuration complexity; matching available sockets/open files on the operating system to numbers of apache workers to numbers of Jetty connections, bearing in mind cache hits and misses at the Apache level, all got quite complicated.

It made me question why we need Apache. It serves three purposes in our architecture.
  1. Block external access to certain URLs
  2. Put in place redirects when we, or systems we depend on, have problems
  3. Act as our HTTP cache
1 & 2 we could easily do via filters at the application server layer - indeed we're moving that way anyway, as it's then trivial to implement admin pages to toggle these settings and it's also considerably easier to functionally test them.

Which leaves the HTTP cache. I've been reading up on my HTTP caching recently, particularly in an excellent book called "High Performance Web Sites". It got me thinking, and it seemed to me that it wouldn't be that hard to implement an internal HTTP cache in a servlet filter. A quick google revealed that unsurprisingly I wasn't the first to think of this, it being the subject of an O'Reilly article, and indeed that Ehcache have something along those lines.

The advantages I see to using such a cache are fivefold:
  1. You get the reduced complexity of taking a layer out of your architecture. You no longer need to worry about how many connections & threads both Apache and the application server need - there's just the one to get right.
  2. You get to escape from Apache's rather arcane configuration (or is it just me who winces in distaste whenever delving into httpd.conf?)
  3. You can easily get a distributed cache. Astonishingly Apache doesn't seem to have an off-the-shelf distributed implementation of mod_cache, which has been an issue for us - we have requests for pretty rapidly changing data which we really need to cache for short periods of time to protect ourselves from a user perpetually refreshing
  4. The cache will be shared across all threads in the application server, regardless of what form you choose to store it. Did you know that mod_mem_cache is a per process cache if you run Apache on a process-per-request basis (as RedHat does out of the box)? So loading up the cache for a resource has to be done on every single process - and what's worse, by default those processes are configured to die when idle or after a set period of time even if not idle to avoid a memory leak. So that resource you thought was being cached eternally is actually getting requested rather a lot. Apache recommend using mod_disk_cache instead, where the cache is shared across all processes, and relying on the operating system to cache it in memory to speed things up, but my measurements saw a drastic drop off in performance when we tested this. I may of course simply have got something wrong.
  5. A personal one - as a Java dev I'm a lot happier digging into Ehcache's codebase to work out what it's actually up to than I would be in Apache's.
It would be quite reasonable to read that list and snort that perhaps I need to improve my Apache skills - and perhaps that is indeed the case. However there's a level at which that's my point; if I need to read up and gain painful experience in order to know how to bend Apache to my will and really know what it's up to, isn't it reasonable to look for an alternative that I will understand better and faster?

So what are the negatives? Well, I've seen it suggested that Apache, being native code, is faster at shoving binary data to a socket than a JVM is. On the same basis I guess that if you are using a disk based store for your cache Apache may be more efficient at reading data from the disk. I've no proof for either of those theories, though, and it may equally be that modern JVMs are pretty competitive at these things - you'd think they'd be the sort of thing Sun had been optimising like crazy all these years. More reading and/or testing needed.

On I presume the same basis Apache has a reputation for serving static content much faster than an application server (this doesn't really apply to my current project as we use JAWR for much needed efficiency savings, and that means the application server has to serve our static content up the first time; hopefully thereafter it's being served from a cache anyway). I seem to remember being told, however, that modern servlet containers are actually quite competitive on this score now (a good wikipedian would add a nasty little [citation needed] or [where?] to that sentence!).

There may also I guess be security issues with running a servlet container or application server on port 80; Apache has had the benefit of being the market leader and hence the target of so many hackers that you'd hope most of the holes have been stopped up. Though there may be an element of the safety of obscurity by stepping away from Apache.

I'd certainly be interested in experimenting with dropping Apache, running the servlet container on port 80 and using Ehcache or similar to do caching & gzipping down at the Java layer. Perhaps if I do I will find out why no-one else does!

10 comments:

  1. Robert

    Ehcache web caching has been in production use on some of Australia's busiest and highest performing e-commerce sites for the last 6 years. We recently released version 2.01. On the sites I have architected I have not had Apache in front for years.

    Ehcache web has amazing performance. The responses are gzipped and stored as byte[] in the cache. One big advantage of servlet filter approaches is that can also do fragments. And both whole pages and fragments can calculate the precise key they want. For example affiliate Ids often get added to an URL which would cause a cache miss. Using calculateKey() you can only include the parts of the URL that are relevant.

    Apache in front is an article of faith among sys admins. But Java app servers simply don't have the same buffer overflow vulnerabilities that a C based server does.

    I think there is a place for an IPS in front, but this tends to be in front of the entire hosting installation. For static content serving, you really want to use a CDN. These are getting very inexpensive.

    ReplyDelete
  2. Apache is often used in front of app servers to serve the static content, leaving the app servers to doing all the dynamic stuff.

    That said, you could just set up caching on the app servers, which has other benefits as Greg has listed, especially cahing of fragments. Then the box running apache can be added as another app server :)

    ReplyDelete
  3. I posed the same question to myself several years ago when architecting a Java web app, and ultimately decided *not* to front our servlet container (Caucho Resin) with Apache. There was never a single instance where we wished we'd had Apache. It worked great, and really simplified things when we needed to tune for higher volumes.

    Another reason you might want to use Apache in front of your servlet container is if you need to do HTTPS. As far as I know, Apache is still much better and faster at this than doing it directly in the servlet container.

    We terminated the SSL on our BIG-IPs, so that wasn't an issue for us, but something to keep in mind if your HTTPS connections go all the way to your web servers.

    ReplyDelete
  4. @Greg Luck & @Alex - thanks, very interesting to hear people are doing this successfully. Makes me keen to try the switch. A CDN would make sense for our static content.

    @Alex - good point about SSL. Our load balancer takes care of that, so we're back in unencrypted http by the time Apache & Jetty are involved, but it would certainly be a good reason to have Apache if this were not the case.

    ReplyDelete
  5. Question, how do you get the application server listening to port 80 without having apache sitting in front? do you download Tomcat and disable HTTP server, and modify App instance to listen on port 80?

    ReplyDelete
  6. Web servers such as nginx or lighthttpd are leaps and bounds ahead of Apache if used as a reverse proxy and serve static content. These have very lightweight memory footprints and have been reliably used in several high throughput websites.

    Consider this - a nginx worker process has a 3 mb footprint - whereas an httpd(apache) process has a 60 MB footprint. Unless - i need to serve dynamic content using as perl or php - i would not even look at apache.

    For web caching - consider squid or varnish - a typical deployment would look like this

    LBR -> Varnish Cache -> nginx (reverse proxy) - Jetty(or Tomcat)

    ReplyDelete
  7. @Andy - yes, turn off or uninstall Apache. Then you've basically got two options - configure your app server to listen to port 80 and run it as root, or run another bit of software that forwards all traffic on port 80 to your app server. Here's how Jetty recommend doing it - I'm sure there are similar guides for Tomcat etc. http://docs.codehaus.org/display/JETTY/port80

    ReplyDelete
  8. @Anonymous - I guess the title is a little misleading, I'm really interested in not having a separate cache layer or reverse proxy, whether Apache or otherwise, and doing it all in the application server.

    ReplyDelete
  9. The most valuable reason to use apache (IMHO) is performance. Apache spawns http threads better than Tomcat. We made performance/load tests and saw very drastically difference in speed between pure tomcat and apache+tomcat.

    ReplyDelete
  10. Apache is the swiss knife of the web servers.

    You can use it as a front end for one or several applications, you can use it for load balancing, caching but also for gzip compression ...

    It's free, it's standard, it's stable and mature...

    ReplyDelete