HTTPS SSH

README

This is a quick guide on how to use the experiment framework and the parallel Radon machine. In the sources, you will find 2 implementations: one using PySpark and one in Scala. We start with the PySpark variant.

PySpark Radon Machine

Contents

  • Get the code running
  • Run an experiment
  • Use the parallel Radon machine independently

Get the code running

  • Please ensure that you have the following software installed:
  1. Spark with PySpark and hadoop
  2. python with numpy, scipy, sklearn, findspark, pandas (and all necessary dependencies. If I forgot a package in this list, please let me know.)
  • Download the entire folder "ParallelRadonMachine_python". Now, you're good to go.
  • To test if the software is running, go to /experiments/SPARK/sparkDatasets and execute exp.py (e.g., by typing python exp.py in the console)
  • Now an experiment comparing the parallel Radon machine against Spark MLlib parallel learners is executed on a number of classification datasets

Run an experiment

  • As you have seen above, an experiment is run by simply executing an experiment script, e.g., /experiments/SPARK/sparkDatasets/exp.py
  • each experiment script specifies a list of learners, metrices, parallelisation methods and datasets.
  • then it creates an Experiment-Object (ParallelRadonExperiment, or SparkParallelRadonExperiment) with these parameters
  • the experiment is started calling the run()-method.

Use the parallel Radon machine independently

  • There are two implementations of the parallel Radon Machine: radonPoints.radnonBoosting.RadonBoost and spark.radonBoosting.RadonBoost.
  1. Use radonPoints.radnonBoosting.RadonBoost for a sequential version of the algorithm entirely in python that is good for running in controlled, simulated environments. This class gets the whole training dataset as two numpy arrays X,y.
  2. Use spark.radonBoosting.RadonBoost for a parallel implementation of the algorithm on Spark using PySpark. This class gets the whole training set as an RDD of LabeledPoints.
  • The RadonBoost class takes as initialization parameter an instance of the sklearn machine learning algorithm to be parallelised. You can use any algorithm that has a linear model (and thus the coef_ attribute).

Known issues

The implementation of the learning framework is not perfect, and there are some issues we experienced with it, which I will list below. If you experience other problems, please let me know. I'll try to fix all issues as soon as possible.

  • Stratified sampling on Spark is slow, and sometimes produces wrong results.
  • When executing the Spark algorithm on a real cluster, reading files from a shared filesystem does not work. For that, please upload the files to a HDFS in a folder called "sparkDatasets" in your home folder.
  • Applying base learners to small samples of the data is unusually slow. This seems to be a problem with PySpark, since it needs to serialize the data from the RDD in order to feed it to the python learning algorithm. I'm currently working on a Scala implementation of the parallel Radon machine which will solve this issue, and probably some of the others as well.

Radon Machine in Scala

Contents

  • Get the code running
  • Run an experiment
  • Use the parallel Radon machine independently

Get the code running

  • Please ensure that you have the following software installed:
  1. Spark with hadoop
  2. Scala
  • Download the entire folder "ParallelRadonMachine_scala".
  • Import the project within the folder "annotations" as java project, build it, and install it into your local maven
  • Import the project within the folder "parallelradonmachine" as maven project
  • To test if the software is running, please run one of the test cases in the folder "unittests"
  • Build the project as a "fat jar" using maven

Run an experiment locally

  • take one of experiment descriptions in the package "experiments/mlj/classification/"
  • set the base path of the experiments (the location of the experiments folder) in the variable expBasePath
  • ensure that clusterMode is set to "false"
  • run the scala file as scala application

Run an experiment on a cluster

  • take one of experiment descriptions in the package "experiments/mlj/classification/"
  • set the base path of the experiments (the location of the experiments folder) in the variable expBasePath
  • ensure that clusterMode is set to "true"
  • ensure that the dataset you want to use is put to the cluster's HDFS (you might have to adapt the base path to the dataset in the scala file HdfsDataUtil.scala)
  • built a fat jar using maven
  • copy the fat jar to the cluster's file system
  • submit a spark application using this fat jar and setting the selected experiment description class as entry point
  • results will be written to the cluster's file system, not the HDFS

Use the parallel Radon machine independently

  • import eu.ferari.spark.prm.core.ParallelRadonMachine
  1. either import a base learner (wrappers for Spark and WEKA learners can be found in the core.learner.wrapper package),
  2. or use your one base learner. In the latter case make sure that it extends the class Learner[M <: Model]

Known issues

The implementation of the learning framework is not perfect, and there are some issues we experienced with it, which I will list below. If you experience other problems, please let me know. I'll try to fix all issues as soon as possible.

  • the runtime of the experiments highly depends on the cluster- and Spark-settings. Please make sure that your Spark settings provide a large number of available processors per worker. Please also make sure that all workers have enough RAM to process their local chunks of data.
  • we have used a Spark cluster that is managed by YARN. We haven't tested this implementation on any other cluster. If you have issues, please contact us.
  • if you want to write your own wrapper for a base learner, the biggest difficulty is often on how to extract the model parameters from the base learner's implementation and later overwrite them with new ones. We used scala reflections and some blatant resetting of object privacy settings for the WEKA learners. If you have indeed written a wrapper for some learner class, it would be great if you could commit it to this repository. Thanks a lot!

License

Copyright 2017

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.