ReleaseEngineering/Development Best Practices
Much of the work of release engineering involves development of a variety of tools and automation. With a lot of tools written by a number of people over a long time period, each one becomes is different from the next. This makes it harder for newcomers to work on a tool, and increases the likelihood that a bug or problem in one tool will remain un-fixed in another tool.
This document outlines the best known choices for building a new application. It's not ironclad, and is subject to change by consensus. However, if you find yourself making a decision contrary to what's described here, give some extra thought as to why, and how it may cause you or someone else pain down the road.
Build many, simple-to-understand tools that communicate using well-defined interfaces and have clear dependencies. Define the boundaries to these tools clearly:
- dedicated version control repo
- dedicated package for deployment
- dedicated puppetagain class
- dedicated documentation
Avoid the temptation to toss new projects into the build/tools repository, even if they are small.
Programming Language & Framework
For server-side stuff, use Python 2.7, aiming to be as Python-3 compatible as possible, as we'll switch someday.
- for web services: flask
- for DB interfaces: sqlalchemy
- for an HTTP client: requests
- for messaging: kombu (talking to RabbitMQ)
- for job scheduling: celery
..and the don'ts:
- don't manage daemonization yourself; plan to use supervisord in production
- do not use async libraries like twisted or gevent if you can help it
- target MySQL only, or if you're ambitious, MySQL in prod and SQLite for development, noting that the latter can get you in trouble
Note: What JS libraries do we commonly use?
- For DB's: MySQL
- If you want to use SQLite for development, that's OK, but be aware that you must test thoroughly on MySQL as well - the two are not the same!
- Caching: memcached
- Messaging: rabbitmq
- Workers: celery
Think about resiliency from the beginning, as reliable resiliency is generally embedded in the architecture of the tool, making it hard to change later.
- If you're building a daemon, build it so that it can be restarted it at will without any serious ill effects.
- To this end, do not store state in memory, including in the call stack. If you have a long-running process that you don't want to restart during a transient failure, then break that task up, persist intermediate state somewhere, and be able to pick up where you left off. This is often most easily accomplished with an explicit state machine. It's impossible to achieve with a function.
- Retry everything: HTTP requests, DB queries, DNS lookups, etc.
- Fail softly: don't let a small failure cause a much larger outage. One way to do this is to build multiple levels of retry. For example, after you've retried that HTTP fetch 10 times, fail the job, but then retry the entire job.
- When talking to external services, support connecting to multiple endpoints and handling endpoint failure gracefully (db, redis, rabbitmq, pypi, etc). This allows the production config to point at several servers without requiring a load balancer in between. Then we can take down one of those servers at a time with zero impact to production.
- Avoid using remote data "live". For example, if your service is based on data from inventory or LDAP, periodically sync the data locally, and use the local cache. When the sync fails, send an email or some other low-priority alert, but know that the service will continue to operate, perhaps with stale data, until the sync succeeds.
- Support resiliency and horizontal scaling by building your app to run in multiple, parallel instances on different hosts.
Talk about your app to as many people as possible, as early as possible. This helps avoid nasty surprises and back-to-the-drawing-board moments, while also increasing awareness of the project. Specifically, talk to:
- potential users
- other relengers
- people from relops
Server-side deployments will either be to dedicated boxes managed by PuppetAgain, or as webapps in the releng cluster.
- Install Python apps as pip-installable tarballs, with semantic versions, rather than from hg repositories
- List dependencies explicitly, and err on the side of requiring the specific versions that you test against
- Use supervisord to run daemons - do not try to daemonize on your own
- Install with Puppet
- Use the Python logging toolkit, and log to stdout by default, where supervisord can pick things up and log appropriately
- Minimize the complexity of the deployment by making simple interfaces. For example, if the app needs a crontask installed, create a simple script with setuptools' entry-points that takes no or very few arguments, and gets its config from a config file.
- Plan to run under mod_wsgi in Apache
- Plan to run across multiple webheads with dedicated disk, if possible; shared netapp storage is available too if necessary
- Do not use a complex JS build process.
All LDAP authentication for web apps should be handled by Apache in the releng cluster. Apache will make the LDAP username -- but not the password -- available to the application.
- Build the source for every application as if others will hack on it, no matter how unlikely you think that is. A detailed README, a guide to the source, instructions for submitting patches (github? bugzilla?), and good comments all make the codebase much more approachable later.
- Clearly document, in the codebase, the security model your tool follows. For example, if all operations are performed with an unprivileged account, future hackers will need to know that lest they perform actions without switching to that userid first.
- For packages with releases (including the python apps that get deployed with puppet), include a changelog describing changes in each version
- Check that relops has created operational documentation for the service, so that other relengers and IT people can jump in quickly when there are problems
- Add a page to ReleaseEngineering/Applications
Be conscious of the security implications of your application, and realize that the best approach to writing secure code is layered security and many eyes. Talk to other relengers, to members of relops, and to the security team during the early phases. This will allow you to identify the security-sensitive bits of your application, and describe the general security architecture. Once you have this established, you'll know what parts of the app need no further security review, where you need to be careful, and probably what *not* to do.
A few general points to get this thinking started:
- Passwords available to your application are assumed to be compromised when the application is. If this would represent a privilege escalation, then your entire application becomes security-sensitive. Also, as employees come and go, the passwords must be rotated, so plan ahead for easy password changes.
- Don't re-use generic SSH keys, e.g., 'cltbld' or 'id_dsa'. Make a purpose-specific key, document it, and if possible limit its capabilities using authorized_keys on the destination host.
- Handle secrets carefully, so that they don't end up checked into repositories, pasted into pastebins or etherpads, or sitting in world-readable logfiles
- The Releng Network is isolated from the Internet and the rest of the company, and parts of it deny all but requested flows. Still, this is only one layer. Consider, too, that we often allow less-trusted individuals onto the network for debugging purposes. You should consider the Releng Network a hostile environment: encrypt, authenticate, resist spoofing, and so on.
- This generally goes without saying at Mozilla, but don't rely on an attacker not knowing something that isn't explicitly handled as a secret.
Systems don't really scale up very well anymore: you can't get substantially faster processors, faster disk, or faster network. So build your application to scale horizontally -- by adding more instances.
This is easy for a webapp, as webapps must be designed from the start to run across multiple webheads. But it's important for a service, too. Multiple hosts, particularly if they are not in the same location, can achieve both scalability and availability.
Multi-threading can help with performance on a single host, but is really only useful in IO-bound situations, where the threading model makes it easy to deal with blocking operations.
Applications should read from a single config file, generally in a ConfigParser-compatible format. It's also OK to use YAML. JSON is frowned upon because some parsers are too picky (e.g., about trailing commas).