ReleaseEngineering/Funsize/Troubleshooting

From MozillaWiki
Jump to: navigation, search

Deployment and Troubleshooting

Senbonzakura/Funsize is still a fairly new application and there's probably a lot of kinks in it.
If you're changing things around or deploying it somewhere, it's more than likely sooner or later you'll run into errors and have the application stuck in an un-usable state. This document will list out how to get out of that state.

Important Notes

If you're dealing with this application when it's deployed via docker (locally or on Elastic Beanstalk), you will not have direct access to the container itself, you will probably only have shell access to the host. This means you cannot SSH into the container itself.

To make it easier to extract the cache, database and logs, a folder from the host is mounted within the container. Please take a look at the dockerrun.aws.json file for the exact locations. Typically the relevant host folder is /var/funsize

Normally you will not be able to "stop" different services in the container. Thus the best option when dealing with containers is to shut the them down and/or destroy them.

Nuclear Option

TL;DR

# Stop everything
killall -9 python python2.7 # Kill Flask and celery
kill -9 $(ps aux | grep rabbitmq | grep -v "grep" | awk '{print $2}') # kill rabbitmq-server
rm -rf <cache location>/* # Cleanup cache
mysql -u root -e "Truncate partial;" # For MySQL
rm <database file>.db # For SQLite
# Make sure virtualenv is active
<root of repo>/startup.sh

Please look at the "Things to keep in Mind" section below to help with debugging.

It's still suggested you read through at least the rest of this section before copy pasting the commands above, unless of course you're aware of the pitfalls/consequences.

Full

If you don't want to spend time figuring out which bits need to be cleaned out and only want to get the application back in a working state ASAP, do the following:

  1. Stop everything that's running, this means stop:
    1. Stop flask (should be running as api.py)
    2. Stop celery
    3. Stop the rabbitmq-server

    A good way to do this is:

    killall -9 python python2.7 # This should get rid of python and celery
    # Don't worry if one of python2.7 or python are not found by kill, just confirm no python is running.
    # Confirm with "ps aux | grep celery" and "ps aux | grep api.py"
    
    # Killing rabbitmq is a little trickier
    # The things you need to killed are a "beam.smp" and empd
    kill -9 $(ps aux | grep rabbitmq | grep -v "grep" | awk '{print $2}') # Should ideally kill both.
    # Confirm with "ps aux | grep rabbitmq"
  2. Clean out the Cache
    You can simply do: # The location of the cache is specified in "default.ini" and "worker.ini" inside under senbonzakura/configs/ in the app dir. # The application dir in docker is /app/ by default, on your local machine it's wherever you cloned the repo # On docker the default cache is /perma/cache rm -rf <cache location>/\* # Note: not rm -rf <cache location>, we need that folder to exist

  3. Clean out the Database You simply need to delete/empty the table that contains the data

For a MySQL: mysql -u root -e "Truncate partial;"

For SQLite: rm <database file>.db

  1. Restart everything You can either start everything manually by hand, or use one of the existing scripts to start things up.
    You need to have the virtualenv which contains the repository activated before anything else.

If you're inside docker just run the ./docker_init.sh script. If you're on your own machine run ./startup.sh. Both these are inside repository at the top level

If you want to restart things manually by hand, you can use multiple tabs/panes/terminals or use &. You need to run the following 3 commands essentially.

The following instructions assume you're in the root of the repository. ``` rabbitmq-server # add -detached if you want to daemonize instead

celery worker -A senbonzakura.backend.tasks -l INFO # Use -f <log file location> for logging to file, --detach to run as a daemon

python senbonzakura/frontend/api.py ```

Other Methods

Crucial Bits

Essentially the only things that keep any sort of state are: 1. The Database 2. The Cache

Database

The database maintains state of the partial requests, so if the database gets corrupted, or if a partial generation aborts, then the database will prevent you from re-triggering the request.

The best, non-nuclear way to clean this is up is to stop the service and cleanup the database. To do this, stop the running services so that no new entries are added while you're editing the database.

Next find all entries in the database that have status field set to a non-zero value. You should be able to do this like so:

delete from partial where status!=0;

Cache

The cache implictly maintains some state of the application because if a partial exists in the cache, it means that the partial generation request was completed. Sometimes for whatever reason, there might be a mismatch between the state tracked in database and the one in the cache. (Especially after a Database modification/cleanup/purge and so on).

The best way to resolve this is to go into the cache and remove the offending partial if you know which one it is. If don't know or don't want to know which partial is causing the problem, you can simply delete the partial sub-directory in the cache directory.

Nuking the entire cache directory also works (see Nuclear option above), but the cache directory has cached complete MARs that have been downloaded over the course of time and it's probably a good idea to keep them around unless you have reason to do otherwise.

Things to keep in Mind

If you're working with a deployed version of the application and do not want to debug it, but would still like someone else to be able to do so later, you can do some of the following steps to help:

  1. Shutdown everything
  2. Backup the database in it's current state
    i.e get a dump of the database, and current configuration file being used, eg. my.cnf for MySQL
  3. Make a copy of the cache in the current state.
    Find the location of the cache in the .ini files; It should be /perma/cache by default.
  4. Save a copy of the flask and celery logs as they are.
    The location of the log files is also mentioned in the .ini files (worker.ini and default.ini under senbonzakura/configs)
    Stored in /var/log/celerylog.log or /usr/local/var/log/celerylog.log by default.
  5. ... Anything else?