Running Spark in standalone mode in Cloudera
Cloudera Version : CDH-5.8.0-1.cdh5.8.0.p0.42
Set Master node ip in spark-env.sh sudo vi /opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/etc/spark/conf.dist/spark-env.sh
Start Master with /opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/spark/sbin/start-master.sh
Start Slave with : /opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/spark/sbin/start-slave.sh -d /var/run/spark/work spark://webanalytics09:7077
One of the Machine is not able to start worke:
when executing start-slave.sh : its throwing below error
Worker Fails with below error :
Spark Command: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.65-3.b17.el7.x86_64/jre/bin/java -cp /opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/spark/conf/:/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/spark/lib/spark-assembly-1.6.0-cdh5.8.0-hadoop2.6.0-cdh5.8.0.jar:/etc/hadoop/conf/:/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/spark/lib/spark-assembly.jar:/usr/lib/hadoop/lib/:/usr/lib/hadoop/:/usr/lib/hadoop-hdfs/lib/:/usr/lib/hadoop-hdfs/:/usr/lib/hadoop-mapreduce/lib/:/usr/lib/hadoop-mapreduce/:/usr/lib/hadoop-yarn/lib/:/usr/lib/hadoop-yarn/:/usr/lib/hive/lib/:/usr/lib/flume-ng/lib/:/usr/lib/paquet/lib/:/usr/lib/avro/lib/ -Xms1g -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 18081 --port 7078 -d /var/run/spark/work spark://webanalytics09:7077
Error: A JNI error has occurred, please check your installation and try again Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2701) at java.lang.Class.privateGetMethodRecursive(Class.java:3048) at java.lang.Class.getMethod0(Class.java:3018) at java.lang.Class.getMethod(Class.java:1784) at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526) Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 7 more
The script is not able to configure classpath , workaround i have done is to paste below line in: export SPARK_DIST_CLASSPATH=$(paste -sd: "/etc/spark/conf.cloudera.spark_on_yarn/classpath.txt") in /opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/etc/spark/conf.dist/spark-env.sh
Now worker comes up but with port 7180.
Finally got the problem -- symlink for /etc/spark/conf is pointing to wrong directory to -> /opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/etc/spark/conf.dist/spark-env.sh but it should point to /etc/spark/conf.cloudera.spark_on_yarn/spark-env.sh i have changed it now it runs without any change in spar-env.sh
rm /etc/alternatives/spark-conf ( Remove Old Simlink) ls /etc/alternatives/ ( check if removed) ln -s /etc/spark/conf.cloudera.spark_on_yarn /etc/alternatives/spark-conf ( create new simlink to the correct folder)
now start-slave.sh will work no changes are required.
Submitting PySpark jobs using Spark Rest URL
(Spark Rest URL : http://
Request to submit a job :
Post --> http://
{ "action":"CreateSubmissionRequest", "appArgs":["file:/root/SparkSetup/lalit/pysparkframework/SparkFramework.py","/root/SparkSetup/lalit/pysparkframework/resources/newSparkStart.xml"], "appResource":"file:/root/SparkSetup/lalit/pysparkframework/SparkFramework.py", "clientSparkVersion":"1.6.0", "mainClass":"org.apache.spark.deploy.SparkSubmit", "environmentVariables":{"PYSPARK_PYTHON":"python3.4","SPARK_ENV_LOADED" : "1"}, "sparkProperties":{ "spark.driver.supervise":"false", "spark.app.name":"SparkClientJob", "spark.eventLog.enabled":"false", "spark.submit.deployMode":"client ", "spark.master":"local[*]", "spark.submit.pyFiles":"file:/root/SparkSetup/lalit/pysparkframework/dist/SparkTensorApp-0.1-py3.4.egg", "spark.jars.packages":"com.datastax.spark:spark-cassandra-connector_2.10:1.6.0", "sparkRestServer":"10.203.238.214:6066"
}
}
Response ::
{ "action": "CreateSubmissionResponse", "message": "Driver successfully submitted as driver-20160903005315-0002", "serverSparkVersion": "1.6.0", "submissionId": "driver-20160903005315-0002", "success": true }
Request for status::
Get --> http://10.203.238.214:6066/v1/submissions/status/driver-20160903022151-0012 Response
{ "action": "SubmissionStatusResponse", "driverState": "FINISHED", "serverSparkVersion": "1.6.0", "submissionId": "driver-20160903022151-0012", "success": true, "workerHostPort": "10.203.238.222:39339", "workerId": "worker-20160902221408-10.203.238.222-39339" }
Running MultiModule Pyspark job on command line:
- Add spark
- export PYSPARK_PYTHON=python3
- python3 setup.py bdist_egg
- spark-submit SparkFramework.py resources/TrainedInceptionModel.xml
spark-submit --py-files pysparkframework/dist/SparkTensorApp-0.1-py3.4.egg SparkFramework.py pysparkframework/resources/TrainedInceptionModel.xml
http://rustyrazorblade.com/2015/05/spark-streaming-with-python-and-kafka/