Changes

Telemetry/Custom analysis with spark

13 bytes removed, 20:15, 7 April 2017

Use spark instead of sqlContext

=== Spark SQL and Spark Dataframes/Datasets ===

Spark also supports traditional SQL, along with special data structures that require schemas. The Spark SQL API can be accessed with the `~~sqlContext~~spark`object. For example:

longitudinal = ~~sqlContext~~spark.sql('SELECT * FROM longitudinal')

creates a DataFrame that contains all the longitudinal data. A Spark DataFrame is essentially a distributed table, a la Pandas or R Dataframes. Under the covers they are an RDD of Row objects, and thus the entirety of the RDD API is available for DataFrames, as well as a DataFrame specific API. For example, a sql-like way to get the count of a specific OS:

===Accessing the Spark UI===

Go to localhost:8888/spark after ~~sshing~~ ssh-ing into the spark cluster to see the Spark UI. It has information about job statuses and task completion, and may help you debug your job.

== The MozTelemetry Library ==

=== How can I load parquet datasets in a Jupyter notebook? ===

Use ~~sqlContext~~spark.read.~~load~~parquet, e.g.:

dataset = ~~sqlContext~~spark.read.~~load~~parquet("s3://the_bucket/the_prefix/the_version~~", "parquet~~")

=== I got a REMOTE HOST IDENTIFICATION HAS CHANGED! error ===

Amiyaguchi

3

edits

Changes

Telemetry/Custom analysis with spark

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

How to Contribute

MozillaWiki

Around Mozilla

Tools