Scalability and Performance
This document is a short summary of our infrastructure and software for developing high performance web apps. When a new project is being considered/written, it should be planned with the following in mind, so it can scale well.
Zeus is the replacement for Netscaler
One of the best ways to lower the load on a server is to reduce its traffic, which is essentially the goal of the netscaler. The netscaler sits between the webservers and provides:
- SSL offloading: SSL connections are made with the netscaler (which has hardware dedicated to setting up SSL connections) instead of the webservers. Connections to the webservers are always over http (if you're writing code that is expecting https, realize that the webheads never see the ssl connection)
- Content Caching: Outbound content (webhead -> internet) is cached according to rules on the netscaler. To get the current set of rules, you'll need to talk to IT. The default set of rules is on page 336 of the 2nd install guide
- Some monitoring/balancing of the web servers. Webservers that respond faster are given more traffic. If a webserver becomes unresponsive, the netscaler will discontinue sending requests to it until it comes back to life.
Production Webserver Cluster
Behind the Netscaler is a cluster of 12 webservers (+/- a few at any given time). Most sites are on one or more webservers. When developing code, realize that requests can go to any of the webservers at any time, so it's important to make your code independant of the specific server (use the db for sessions, etc.)
- Load graphs (ldap login) for the webheads are available
There are 4 (+/- a couple) database servers running MySQL.
- A read-only slave is available, and is generally only a couple seconds (<10) behind the master.
- When doing large batch jobs, or expensive queries that will lock db tables, it's best to use the read-only slave so the master can keep working.
- Load graphs (ldap login) for the db servers are available
Staging servers are non-virtual machines, but not on a cluster.
- Sites can be setup to SVN up themselves via cron jobs so changes pushed to a tag can be seen automatically.
Currently, development is done on standalone virtual machines running all the software (mysql, apache, etc.). Generally, development servers are only available in the VPN.
Reasonably efficient code should always be a first step when considering performance issues (don't put queries in loops, etc.). Due to the large amount of traffic we get, we need to supplement our code with additional caching software.
This is a php accelerator (caches compiled code).
- We've had issues with segfaulting when this was enabled. Currently none of our projects use this product.
This is a php accelerator that basically compiles the php, and then stores that value. When the page is requested, it can skip recompiling and just serve what it already has.
- AMOv3 is using this, and we're relatively happy with it. We had some strange issues with apache segfaulting (bug 375300) but after removing a file from eAccelerator's cache, it has stopped. The file was doing nothing special, and it's still a mystery why it caused seg faults.
This has been a great app for us, since it's simple and effective. Any data that is accessed often and can be hashed is a good candidate for memcache.
- In AMOv2 we stored the complete page output in memcache using the URL as a the key
- In AMOv3 (remora) we're storing db query results using the query as a key
- Additional Memcache Info
If you're using up a lot of CPU on the web servers, profiling the code is a great way to tell where the bottlenecks are. You should be able to get similar profiles no matter what machine you run on, so the development machines are fine.
- They provide documentation on using it
I'm adding links to the two most popular php profiling tools because I've had mixed results with both. If one is giving you seg faults, try the other. Both generate files that can be read by KcacheGrind - a good tool for visualizing the data. Otherwise they both have command line utilities as well.
Once a site is written, it's a good idea to load test it to get an idea of how many hits per second it can handle. Something to remember with all the programs is the infrastructure you're testing on. If you're testing from one machine to one server, that will give you an idea what your code can do, but it's definitely not the same as a cluster of machines. Same with server->database connections.
A pretty basic/simple benchmarking program. It works, but it's important to realize that it isn't distributed, so you might be maxing out the source machine and not the server.
This is a good idea (distributed benchmarking), but we didn't get it to work as advertised. Overall disappointing - maybe revisit when it matures.
This is a python script that oremj wrote. Given a log, it will replay the hits. This gives you the advantage of replaying an actual set of hits across multiple pages, giving you a good distribution of hits. This is mostly useful if you have actually had people using the site.