run spark kmeans with daal throw jvm error

zuo_z_ · ‎11-17-2015

I run SampleKmeans.java in spark cluster with daal:( data size Kmeans10w.csv 174m)

bin/spark-submit \
--class daal.SampleKmeans \
--master yarn-client \
--num-executors 10 \
--executor-memory 30g \
--driver-memory 30g \
--executor-cores 2 \
--jars /opt/zzb/algebird-core_2.10-0.5.0.jar,/opt/zzb/daal.jar \
/opt/zzb/grider_2.10-0.1.0.jar \
yarn-client /hdfs/matrix/Kmeans10w.csv 100 20 5 1000

throws error :

#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007fc7f0d9892f, pid=19166, tid=140499757270784
#
# JRE version: Java(TM) SE Runtime Environment (7.0_71-b14) (build 1.7.0_71-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.71-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C [libJavaAPI.so+0x255b92f] daal::algorithms::kmeans::init::internal::KMeansinitStep2MasterKernel<(daal::algorithms::kmeans::init::Method)1, double, (daal::CpuType)3>::compute(unsigned long, daal::data_management::interface1::NumericTable const* const*, unsigned long, daal::data_management::interface1::NumericTable const* const*, daal::algorithms::kmeans::init::interface1::Parameter const*)+0x35f
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /opt/spark/hs_err_pid19166.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

Heap
PSYoungGen      total 1529856K, used 759185K [0x0000000795500000, 0x0000000800000000, 0x0000000800000000)
eden space 1311744K, 55% used [0x0000000795500000,0x00000007c19ba6a0,0x00000007e5600000)
from space 218112K, 15% used [0x00000007e5600000,0x00000007e76aa088,0x00000007f2b00000)
to   space 218112K, 0% used [0x00000007f2b00000,0x00000007f2b00000,0x0000000800000000)
ParOldGen       total 3495424K, used 80K [0x00000006bff80000, 0x0000000795500000, 0x0000000795500000)
object space 3495424K, 0% used [0x00000006bff80000,0x00000006bff94020,0x0000000795500000)
PSPermGen       total 56320K, used 56063K [0x00000006aff80000, 0x00000006b3680000, 0x00000006bff80000)

object space 56320K, 99% used [0x00000006aff80000,0x00000006b363fd68,0x00000006b3680000)

then i set jvm params:

bin/spark-submit \
--class daal.SampleKmeans \
--master yarn-client \
--num-executors 10 \
--executor-memory 30g \
--driver-memory 30g \
--executor-cores 2 \
--conf "spark.driver.extraJavaOptions=-Xms5g -Xmx5g -XX:PermSize=1g -XX:MaxPermSize=1g" \
--jars /opt/zzb/algebird-core_2.10-0.5.0.jar,/opt/zzb/daal.jar \
/opt/zzb/grider_2.10-0.1.0.jar \

throws errors still。The log i can see::

GC Heap History (2 events):
Event: 34.345 GC heap before
{Heap before GC invocations=1 (full 0):
PSYoungGen total 1529856K, used 1311744K [0x0000000795500000, 0x0000000800000000, 0x0000000800000000)
eden space 1311744K, 100% used [0x0000000795500000,0x00000007e5600000,0x00000007e5600000)
from space 218112K, 0% used [0x00000007f2b00000,0x00000007f2b00000,0x0000000800000000)
to space 218112K, 0% used [0x00000007e5600000,0x00000007e5600000,0x00000007f2b00000)
ParOldGen total 3495424K, used 0K [0x00000006bff80000, 0x0000000795500000, 0x0000000795500000)
object space 3495424K, 0% used [0x00000006bff80000,0x00000006bff80000,0x0000000795500000)
PSPermGen total 1048576K, used 40067K [0x000000067ff80000, 0x00000006bff80000, 0x00000006bff80000)
object space 1048576K, 3% used [0x000000067ff80000,0x00000006826a0d88,0x00000006bff80000)
Event: 34.382 GC heap after
Heap after GC invocations=1 (full 0):
PSYoungGen total 1529856K, used 33582K [0x0000000795500000, 0x0000000800000000, 0x0000000800000000)
eden space 1311744K, 0% used [0x0000000795500000,0x0000000795500000,0x00000007e5600000)
from space 218112K, 15% used [0x00000007e5600000,0x00000007e76cba88,0x00000007f2b00000)
to space 218112K, 0% used [0x00000007f2b00000,0x00000007f2b00000,0x0000000800000000)
ParOldGen total 3495424K, used 80K [0x00000006bff80000, 0x0000000795500000, 0x0000000795500000)
object space 3495424K, 0% used [0x00000006bff80000,0x00000006bff94020,0x0000000795500000)
PSPermGen total 1048576K, used 40067K [0x000000067ff80000, 0x00000006bff80000, 0x00000006bff80000)
object space 1048576K, 3% used [0x000000067ff80000,0x00000006826a0d88,0x00000006bff80000)
}

then i set jvm params bigger like :--conf "spark.driver.extraJavaOptions=-Xms10g -Xmx10g -XX:PermSize=1g -XX:MaxPermSize=1g" \

throws errors still。The log i can see:

Event: 39.652 GC heap before
{Heap before GC invocations=1 (full 0):
PSYoungGen total 3058688K, used 2621952K [0x000000072aa80000, 0x0000000800000000, 0x0000000800000000)
eden space 2621952K, 100% used [0x000000072aa80000,0x00000007cab00000,0x00000007cab00000)
from space 436736K, 0% used [0x00000007e5580000,0x00000007e5580000,0x0000000800000000)
to space 436736K, 0% used [0x00000007cab00000,0x00000007cab00000,0x00000007e5580000)
ParOldGen total 6990848K, used 0K [0x000000057ff80000, 0x000000072aa80000, 0x000000072aa80000)
object space 6990848K, 0% used [0x000000057ff80000,0x000000057ff80000,0x000000072aa80000)
PSPermGen total 1048576K, used 42037K [0x000000053ff80000, 0x000000057ff80000, 0x000000057ff80000)
object space 1048576K, 4% used [0x000000053ff80000,0x000000054288d580,0x000000057ff80000)
Event: 39.695 GC heap after
Heap after GC invocations=1 (full 0):
PSYoungGen total 3058688K, used 35054K [0x000000072aa80000, 0x0000000800000000, 0x0000000800000000)
eden space 2621952K, 0% used [0x000000072aa80000,0x000000072aa80000,0x00000007cab00000)
from space 436736K, 8% used [0x00000007cab00000,0x00000007ccd3b9a8,0x00000007e5580000)
to space 436736K, 0% used [0x00000007e5580000,0x00000007e5580000,0x0000000800000000)
ParOldGen total 6990848K, used 80K [0x000000057ff80000, 0x000000072aa80000, 0x000000072aa80000)
object space 6990848K, 0% used [0x000000057ff80000,0x000000057ff94020,0x000000072aa80000)
PSPermGen total 1048576K, used 42037K [0x000000053ff80000, 0x000000057ff80000, 0x000000057ff80000)
object space 1048576K, 4% used [0x000000053ff80000,0x000000054288d580,0x000000057ff80000)
}

the SampleKmeans.java is the example include in daal project。the size of the test data "Kmeans10w.csv" is 174m and i can run it with with miLlib very well。

It seems daal does not work well, anybody can help me , thanks!!!

Ilya_B_Intel · ‎11-17-2015

Hi Zuo

Can you also tell how did you modify SampleKmeans.java to load your data file?

Did you modify SparkKmeans.java in the following part?

private static final long nBlocks = 1;
private static final long nClusters = 20;
private static final int nIterations = 5;
private static final int nVectorsInBlock = 10000;

Do I understand correctly, that your data has 174*10^6 rows, what number of columns is there?

zuo_z_ · ‎11-17-2015

Hi ILYA B，thanks for your reply.

i modify the Main method of SampleKmeans.java as follow:

public static void main(String[] args) {
long start = System.currentTimeMillis();
DaalContext context = new DaalContext();

/* Create a JavaSparkContext that loads defaults from system properties and the classpath and sets the name */
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("Spark Kmeans").setMaster(args[0]));

/* Read from distributed HDFS data set from path */
StringDataSource templateDataSource = new StringDataSource( context, "" );
DistributedHDFSDataSet dd = new DistributedHDFSDataSet(args[1], templateDataSource );

long nBlocks = Integer.valueOf(args[2]);
long nClusters = Integer.valueOf(args[3]);
int nIterations = Integer.valueOf(args[4]);
int nVectorsInBlock = Integer.valueOf(args[5]);

JavaPairRDD<Integer, HomogenNumericTable> dataRDD = dd.getAsPairRDDPartitioned(sc, Integer.valueOf(nBlocks + ""), nVectorsInBlock);
long start1 = System.currentTimeMillis();

// JavaPairRDD<Integer, HomogenNumericTable> repartition = dataRDD.repartition(Integer.valueOf(nBlocks + ""));

/* Compute k-means for dataRDD */
SparkKmeans.nBlocks = nBlocks;
SparkKmeans.nClusters = nClusters;
SparkKmeans.nIterations = nIterations;
SparkKmeans.nVectorsInBlock = nVectorsInBlock;

SparkKmeans.KmeansResult result = SparkKmeans.runKmeans(context, dataRDD);
long end = System.currentTimeMillis();

/* Print results */
//HomogenNumericTable Centroids = result.centroids;
//printNumericTable("First 10 dimensions of centroids:", Centroids, 20, 10);
System.out.println("load time " + (start1 - start)/1000 + "s.");
System.out.println("run time " + (end - start1)/1000 + "s.");
System.out.println("part " + dataRDD.partitions().size());

sc.stop();
context.dispose();
}

and SparkKmeans.java as follow :

static long nBlocks = 1;
static long nClusters = 20;
static int nIterations = 5;
static int nVectorsInBlock = 10000;

my data has 100 * 1000 rows。 there is 100 features per row. the capability size of the data file is 174.3m 。

any errors in my code ,thanks!