How Spark runs on a cluster and How to write Spark Applications

Rahul_S_3 · ‎09-16-2015

Get an idea and looking for hands on experience in how Spark runs on a cluster and How to write Spark Applications usefull video https://www.youtube.com/watch?v=CMJLRs4ehEk and Details link: http://www.osscube.in/hadoop/training/cloudera-developer-training-for-apache-spark

yaswanth_k_ · ‎09-19-2015

Spark is a fast and general-purpose cluster computing system which makes parallel jobs easy to write.

Installing Spark on DCOS

Prerequisite: It is recommend that you have minimum of two nodes with at least 2 CPU and 2 GB of RAM available for the Spark Service and running a Spark job.

Install the Spark service and CLI (these may take up to 15 minutes to download and complete deployment, depending on your internet connection):
```
$ dcos package install spark
```
Tip: It can take a few minutes for Spark to initialize with DCOS.
Verify that Spark is running:
- From the DCOS web interface, go to the Services tab and confirm that Spark is running. Click the spark row item to view the Spark web interface.
- From the DCOS CLI: dcos package list
- From the Mesos web interface at http://<hostname>/mesos, verify that the Spark framework has registered and is starting tasks. There should be several journalnodes, namenodes, and datanodes running as tasks. Wait for all of these to show the RUNNING state.

Spark CLI Usage

You can run and manage Spark jobs by using the Spark CLI.

--help, -h: Show a description of all command options and positional arguments for the command.
--info: Show a brief description of the command.
--version: Show the version of the installed Spark CLI.
--config-schema: Show the Spark CLI configuration schema.
run --help: Show a description of all dcos spark run command options and positional arguments.
run --submit-args=<spark-args>: Run a Spark job with the required <spark-args> specified.
status <submissionId>: Show the status of a Spark job with the required <spark-args> specified.
kill <submissionId>: Kill the Spark job with the required <spark-args> specified.
webui: Show the URL of the Spark web interface.

Usage Example

In this example, a Spark job is run.

Prerequisite: Before running a Spark job, you must upload the Spark application .jar to an external file server that is reachable by the DCOS cluster. For example: http://external.website/mysparkapp.jar.

To run a job with the main method in a class called MySampleClass and arguments 10:

$ dcos spark run --submit-args=’--class MySampleClass http://external.website/mysparkapp.jar 10’

You can also upload jar files into HDFS, and reference those jars in your job. For example:

$ dcos spark run --submit-args='-Dspark.mesos.coarse=true --driver-cores 1 --driver-memory 1024M --class org.apache.spark.examples.SparkPi http://downloads.mesosphere.com.s3.amazonaws.com/assets/spark/spark-examples_2.10-1.4.0-SNAPSHOT.jar 30'

View the Spark scheduler progress by navigating to the Spark web interface as shown with this command:
```
$ dcos spark webui
```

Set the --supervise flag when running the job to allow the scheduler to restart the job anytime the job finishes.

$ dcos spark run --submit-args=’--supervise --class MySampleClass http://external.website/mysparkapp.jar 10’

You can set the amount of cores and memory that the driver will require to be scheduled by passing in --driver-cores and --driver-memory flags. Any options that the usual spark-submit script accepts are also accepted by DCOS Spark CLI.
To set any Spark properties (e.g. coarse grain mode or fine grain mode), you can also provide a custom spark.properties file and set the environment variable SPARK_CONF_DIR to that directory.

Uninstalling Spark

From the DCOS CLI, enter this command:
```
$ dcos package uninstall spark
```
Open the Zookeeper Exhibitor web interface at <hostname>/exhibitor, where <hostname> is the Mesos Master hostname.
1. Click on the Explorer tab and navigate to the spark_mesos_dispatcher folder.
2. Choose Type Delete, enter the required Username, Ticket/Code, and Reason fields, and click Next.
3. Click OK to confirm your deletion.

For more information:

yaswanth_k_ · ‎09-19-2015

MapReduce and its variants have been highly successful
in implementing large-scale data-intensive applications
on commodity clusters. However, most of these systems
are built around an acyclic data flow model that is not
suitable for other popular applications. This paper fo-
cuses on one such class of applications: those that reuse
a working set of data across multiple parallel operations.
This includes many iterative machine learning algorithms,
as well as interactive data analysis tools. We propose a
new framework called Spark that supports these applica-
tions while retaining the scalability and fault tolerance of
MapReduce. To achieve these goals, Spark introduces an
abstraction called resilient distributed datasets (RDDs).
An RDD is a read-only collection of objects partitioned
across a set of machines that can be rebuilt if a partition
is lost. Spark can outperform Hadoop by 10x in iterative
machine learning jobs, and can be used to interactively
query a 39 GB dataset with sub-second response time.

yaswanth_k_ · ‎09-19-2015

A new model of cluster computing has become widely
popular, in which data-parallel computations are executed
on clusters of unreliable machines by systems that auto-
matically provide locality-aware scheduling, fault toler-
ance, and load balancing. MapReduce pioneered this
model, while systems like Dryad and Map-Reduce-
Merge generalized the types of data flows supported.
These systems achieve their scalability and fault tolerance
by providing a programming model where the user creates
acyclic data flow graphs to pass input data through a set of
operators. This allows the underlying system to manage
scheduling and to react to faults without user intervention.
While this data flow programming model is useful for a
large class of applications, there are applications that can-
not be expressed efficiently as acyclic data flows. In this
paper, we focus on one such class of applications: those
that reuse a working set of data across multiple parallel
operations. This includes two use cases where we have
seen Hadoop users report that MapReduce is deficient:
• Iterative jobs: Many common machine learning algo-
rithms apply a function repeatedly to the same dataset
to optimize a parameter (e.g., through gradient de-
scent). While each iteration can be expressed as a

MapReduce/Dryad job, each job must reload the data
from disk, incurring a significant performance penalty.
• Interactive analytics: Hadoop is often used to run
ad-hoc exploratory queries on large datasets, through
SQL interfaces such as Pig and Hive . Ideally,
a user would be able to load a dataset of interest into
memory across a number of machines and query it re-
peatedly. However, with Hadoop, each query incurs
significant latency (tens of seconds) because it runs as
a separate MapReduce job and reads data from disk.

This paper presents a new cluster computing frame-
work called Spark, which supports applications with
working sets while providing similar scalability and fault
tolerance properties to MapReduce.
The main abstraction in Spark is that of a resilient dis-
tributed dataset (RDD), which represents a read-only col-
lection of objects partitioned across a set of machines that
can be rebuilt if a partition is lost. Users can explicitly
cache an RDD in memory across machines and reuse it
in multiple MapReduce-like parallel operations. RDDs
achieve fault tolerance through a notion of lineage: if a
partition of an RDD is lost, the RDD has enough infor-
mation about how it was derived from other RDDs to be
able to rebuild just that partition. Although RDDs are
not a general shared memory abstraction, they represent
a sweet-spot between expressivity on the one hand and
scalability and reliability on the other hand, and we have
found them well-suited for a variety of applications.
Spark is implemented in Scala , a statically typed
high-level programming language for the Java VM, and
exposes a functional programming interface similar to
DryadLINQ . In addition, Spark can be used inter-
actively from a modified version of the Scala interpreter,
which allows the user to define RDDs, functions, vari-
ables and classes and use them in parallel operations on a
cluster. We believe that Spark is the first system to allow
an efficient, general-purpose programming language to be
used interactively to process large datasets on a cluster.
Although our implementation of Spark is still a proto-
type, early experience with the system is encouraging. We
show that Spark can outperform Hadoop by 10x in itera-
tive machine learning workloads and can be used interac-
tively to scan a 39 GB dataset with sub-second latency.

robert_m_6 · ‎11-19-2015

Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threadsSpark is agnostic to the underlying cluster manager.
The driver program must listen for and accept incoming connections from its executors throughout its As such, the driver program must be network addressable from the worker nodes.
class OrdersFunctions(@transient sc: SparkContext, orders: Iterator[OpenBookMsg]) extends Serializable {

private val ordersRDD = sc.parallelize(orders.toSeq)

def countBuyOrders(): Map[String, Long] = countOrders(OrderFunctions.isBuySide)

def countSellOrders(): Map[String, Long] = countOrders(OrderFunctions.isSellSide)

private def countOrders(filter: OpenBookMsg => Boolean): Map[String, Long] =

orersRDD.filter(filter).

map(order => (order.symbol, order)).

countByKey().toMap

Gaurav_Gogia · ‎03-02-2017

yaswanth k. wrote:

Spark is a fast and general-purpose cluster computing system which makes parallel jobs easy to write.

Installing Spark on DCOS

Spark CLI Usage

Usage Example

Uninstalling Spark

Installing Spark on DCOS

Prerequisite
It is recommend that you have minimum of two nodes with at least 2 CPU and 2 GB of RAM available for the Spark Service and running a Spark job.
Install the Spark service and CLI (these may take up to 15 minutes to download and complete deployment, depending on your internet connection):
$ dcos package install spark
Tip: It can take a few minutes for Spark to initialize with DCOS.
Verify that Spark is running:

From the DCOS web interface, go to the Services tab and confirm that Spark is running. Click the spark row item to view the Spark web interface.

From the DCOS CLI: dcos package list

From the Mesos web interface at http://<hostname>/mesos, verify that the Spark framework has registered and is starting tasks. There should be several journalnodes, namenodes, and datanodes running as tasks. Wait for all of these to show the RUNNING state.
Spark CLI Usage

You can run and manage Spark jobs by using the Spark CLI.

--help, -h

Show a description of all command options and positional arguments for the command.

--info

Show a brief description of the command.

--version

Show the version of the installed Spark CLI.

--config-schema

Show the Spark CLI configuration schema.

run --help

Show a description of all dcos spark run command options and positional arguments.

run --submit-args=<spark-args>

Run a Spark job with the required <spark-args> specified.

status <submissionId>

Show the status of a Spark job with the required <spark-args> specified.

kill <submissionId>

Kill the Spark job with the required <spark-args> specified.

webui

Show the URL of the Spark web interface.

Usage Example

In this example, a Spark job is run.

Prerequisite: Before running a Spark job, you must upload the Spark application .jar to an external file server that is reachable by the DCOS cluster. For example: http://external.website/mysparkapp.jar.
To run a job with the main method in a class called MySampleClass and arguments 10:
$ dcos spark run --submit-args=’--class MySampleClass http://external.website/mysparkapp.jar 10’
You can also upload jar files into HDFS, and reference those jars in your job. For example:
$ dcos spark run --submit-args='-Dspark.mesos.coarse=true --driver-cores 1 --driver-memory 1024M --class org.apache.spark.examples.SparkPi http://downloads.mesosphere.com.s3.amazonaws.com/assets/spark/spark-examples_2.10-1.4.0-SNAPSHOT.jar 30'
View the Spark scheduler progress by navigating to the Spark web interface as shown with this command:
$ dcos spark webui
Set the --supervise flag when running the job to allow the scheduler to restart the job anytime the job finishes.
$ dcos spark run --submit-args=’--supervise --class MySampleClass http://external.website/mysparkapp.jar 10’
You can set the amount of cores and memory that the driver will require to be scheduled by passing in --driver-cores and --driver-memory flags. Any options that the usual spark-submit script accepts are also accepted by DCOS Spark CLI.

To set any Spark properties (e.g. coarse grain mode or fine grain mode), you can also provide a custom spark.properties file and set the environment variable SPARK_CONF_DIR to that directory.
Uninstalling Spark
From the DCOS CLI, enter this command:
$ dcos package uninstall spark
Open the Zookeeper Exhibitor web interface at <hostname>/exhibitor, where <hostname> is the Mesos Master hostname.

Click on the Explorer tab and navigate to the spark_mesos_dispatcher folder.

Choose Type Delete, enter the required Username, Ticket/Code, and Reason fields, and click Next.

Click OK to confirm your deletion.
For more information:

Spark documentation

Example: Installing a DCOS Service

Is there a way to add this to my Azure based server?