Spark has a sophisticated machine learning offering in the form of mllib. Here, we’ll use mllib to do classification with an SVM.
Add in mllib in build.sbt:
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.4.0",
"org.apache.spark" %% "spark-sql" % "1.4.0",
"org.apache.spark" %% "spark-mllib" % "1.4.0"
)
Add the imports:
import org.apache.spark.SparkContext
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.util.MLUtils
Main skeleton:
object Main {
def main(args: Array[String]){
val sc = new SparkContext("local[*]", "hello-spark")
sc.stop()
}
}
Load the rdd:
val data = MLUtils.loadLibSVMFile(sc, "./data/svm/sample_libsvm_data.txt")
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)
Build the model:
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)
model.clearThreshold()
val scoreAndLabels = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()
println("Area under ROC = " + auROC)
You can save the model to disk and reuse it later:
model.save(sc, "./data/svm/model")
val sameModel = SVMModel.load(sc, "./data/svm/model")