Auto-tools/Projects/ActiveData/Redshift

From MozillaWiki
Jump to: navigation, search

Redshift

Preamble

For Q1 of 2015, the goal was to perform a basic sanity check that Elasticsearch is a reasonable backend for the ActiveData service. Redshift appears to be designed for data warehouse applications, but is too new to have many third party performance summaries. When looking at Redshift performance, we have specific demands: Our team receives about 300 million test results per week, so queries should be performed with a billion records, at a minimum. In the longer term, the ActiveData project is to provide low latency responses to aggregation and filter queries for a multitude of unknown applications and yet-to-be-known data.

Summary

Redshift is a database optimized for data warehouse workloads.

Features we can use

  • Excellent query language based on Postgres 8 PSQL: with some exceptions
  • Excellent monitoring - see Amazon docs for summary
  • No machine management – The machines supporting the database are wholly managed by Amazon. This is good because a lot of time can be sunk into managing the machines in the ES cluster. Redshift’s administrative tool allows you to grow and shrink the number of nodes, so you have some cost control, but we lose the ability to leverage the low priced spot instances.
  • Can handle a billion records with ease

Missing features

  • Manually managed schema - Redshift is a database, and like most databases, has a fundamentally different philosophy about data than a document store does. ActiveData should be agnostic about the shape of the data it stores; leaving it to the application endpoints to determine that. Redshift demands these schemas be managed manually, or a program written to extend the schema as new fields are encountered.
  • JSON is just a string – Redshift does not index the properties found in JSON, but does provide JSON functions so they can be used in a query. We must pull out the interesting JSON properties to their own columns before the full benefit of Redshift is realized.
  • No concept of ID – There is no UNIQUE checking in Redshift, which requires extra programmer diligence when upserting records. Even NoSQL solutions have some way to distinguish if a document/record is being overwritten, so I consider this an odd omission by the Redshift team.

Configuration

Tests were performed using a 3-node cluster, in an attempt to mimic the 3-node cluster we are using for Elasticsearch. The test data could fit on one node, but I also wanted to include any network coordination costs, and multi-machine benefits.

There are three nodes, each has

CPU 2
ECU 4.4
MEMORY 15G
DRIVE 2T
NETWORK 0.30GB/s

Note: the cluster was reporting 20% full with 1.1 billion records

Pricing

This only includes Redshift, and does not include the peripheral machines required for ETL and web services. If we do compare costs, then we can compare to the cost of the EC2 instances and EBS volumes that make up the Elasticsearch cluster.

3 x $0.850 per Hour = 3 x $612/month = $1,836/month

Importing Data

The speed of loading data is technically out of scope for this product review, but I found getting a billion records into Redshift is not trivial. Three methods of import were attempted:

  • Direct insert via INSERT statements and psycopg2 (jdbc) – 30 Python processes were used to insert data, but gave us about 1,000 record per second. This is fast enough to keep up with current load, but prevents us from re-processing old data or dealing with future increases. In retrospect, the major bottleneck here was non-optimized Redshift work queues: Each work queue executes requests in order, and with only one work queue, most time was spent waiting for other threads’ work to complete: What would usually take 3 seconds, took 30 seconds from the client perspective.
  • Redshift’s COPY command - Requires all records or JSON documents be formatted as flat-files in S3. The ETL process had to be rearranged to do this. Importing ½ billion records took under 7 hours, (20,000 records per second) which is excellent. The only issue here is it’s an all-or-nothing import; there is no incremental importing feature.
  • Incremental COPY – The COPY command is flexible enough to accept a list of specific S3 files to COPY into the Redshift database. A single-process program was written that compares the records in Redshift with S3 files to generate a list that require import. The COPY command is then issued on those files to get them into Redshift. In this case, the files were sent in small blocks of about 500, so that success of the COPY command can be confirmed approximately every minute. With this hybrid model we get about 10,000 records per second.

More on Loading flattened data into redshift

Performance

These queries were run multiple times with approximately the same result. Despite the accuracy of the reported numbers, only the first digit is significant.

Let’s start with a simple count!

sql> select count(*) from test_results
1,156,126,589
1 row(s) retrieved starting from 0 in 7/6154 ms

Here is a slightly more complicated query. This is a case where the work queue got in the way: The process performing COPY commands was still running, and blocked the query from running for the majority of the 35 seconds.

sql> select distinct "etl.source.source.id" from test_results where "build.date" is null
500 row(s) retrieved starting from 0 in 194/35642 ms

During loading I realized a majority of the records were corrupted by the ETL. Good to know you can wipe out your database quickly!

sql> DELETE FROM test_results where "etl.source.source.id" in (
	SELECT DISTINCT
		"etl.source.source.id"
	FROM test_results
	WHERE "build.date" IS NULL
)
702313854 row(s) affected in 70943 ms
sql> select count(1) from test_results
441,619,970
1 row(s) retrieved starting from 0 in 5/1891 ms

I do not know what pathologically bad thing happened here. Again, probably multiple COPY commands filling the work queue.

sql> select * from test_results where "build.date" is null limit 100
3 row(s) retrieved starting from 0 in 262/2708556 ms

We use the incremental COPY to reload our data, and then confirm the numbers

sql> select count(*) from test_results
1,156,126,589
1 row(s) retrieved starting from 0 in 7/6154 ms

Finally, the canonical representative: Chunk timings by platform from the past full week:

sql> SELECT
	"machine.platform",
	"machine.os",
	"run.suite",
	"run.chunk",
	COUNT("run.stats.duration"),
	AVG("run.stats.duration")
FROM
	test_results
WHERE
	"etl.id" = 0 AND -- SINCE THE CUBE IS OF TESTS, WE PICK TEST 0 AS SUITE REPRESENTATIVE
	date_trunc('week', dateadd(DAY, -7, GETDATE())) <= TIMESTAMP 'epoch' + "run.stats.start_time" * INTERVAL '1 second' AND
	TIMESTAMP 'epoch' + "run.stats.start_time" * INTERVAL '1 second' < date_trunc('week', GETDATE())
GROUP BY
	"machine.platform",
	"machine.os",
	"run.suite",
	"run.chunk"
ORDER BY
	"machine.platform",
	"machine.os",
	"run.suite",
	"run.chunk"
;
364 row(s) retrieved starting from 0 in 268/257515 ms

More Optimizations

This analysis only scratches the surface, and gives us no idea of how much more we can get out of Redshift.

Further optimizations are possible,

  • Separate work queues can be setup to handle differing response time characteristics
  • Query planner might help with optimization, but I do not believe it will help in this situation; Redshift already indexes the columns for fast filtering and aggregation, but in the case of joins you can control what node you data resides to minimize communication overhead between nodes.
  • SSD drives might improve query performance.
  • Other hidden “shallow optimizations” – I have the sense the number of unknowns in Redshift is still quite large to me. One simple oversight, and all my numbers are irrelevant. “With enough eyeballs, all [optimizations] are shallow”.
  • More nodes – I have no doubt more nodes can make the whole thing faster, but this must be balanced with cost.
  • More efficient data shape – There is an endless set of transformations you can apply to your data to get better query performance. The ActiveData philosophy is against putting effort into this time sink: Software is good enough that it should be performing this in the background given the data volume, data shape, and given the queries performed on it.

Summary

This analysis is of limited use for comparing to Elasticsearch. Both Redshift and Elasticsearch have many more opportunities to optimiz performance. This project gives little to indicate which one is the ultimate winner.

The success here, I hope, is for those reading this and have an order-of-magnitude (or two) LESS data: Redshift is wiz-bang fast, with minimal setup, and reasonable cost (on a single node).

Philosophical Lesson

It is apparent to me now that a performance comparison cannot be thorough without a product specialist to guide the optimization process. Absent of a specialist how can one decide if a new product is better? We run the risk of over analysis, driven by fear of missing a shallow optimization. Clearly, if the product performs better out-of-the-box than your existing, (assumed optimized) solution, then the answer is easy. But, even “out-of-the-box” can take a significant amount of time, like this case of Redshift.

More References

These links were a help, and are more relevant to scaling Redshift in general.