Changes

Jump to: navigation, search

Telemetry/Custom analysis with spark

13 bytes removed, 20:15, 7 April 2017
Use spark instead of sqlContext
=== Spark SQL and Spark Dataframes/Datasets ===
Spark also supports traditional SQL, along with special data structures that require schemas. The Spark SQL API can be accessed with the `sqlContextspark`object. For example:
longitudinal = sqlContextspark.sql('SELECT * FROM longitudinal')
creates a DataFrame that contains all the longitudinal data. A Spark DataFrame is essentially a distributed table, a la Pandas or R Dataframes. Under the covers they are an RDD of Row objects, and thus the entirety of the RDD API is available for DataFrames, as well as a DataFrame specific API. For example, a sql-like way to get the count of a specific OS:
===Accessing the Spark UI===
Go to localhost:8888/spark after sshing ssh-ing into the spark cluster to see the Spark UI. It has information about job statuses and task completion, and may help you debug your job.
== The MozTelemetry Library ==
=== How can I load parquet datasets in a Jupyter notebook? ===
Use sqlContextspark.read.loadparquet, e.g.:
dataset = sqlContextspark.read.loadparquet("s3://the_bucket/the_prefix/the_version", "parquet")
=== I got a REMOTE HOST IDENTIFICATION HAS CHANGED! error ===
3
edits

Navigation menu