in AI / ML, Hacks

Weka for Java noobs

For the one project I talked about a wanted to do some prediction. The data set had about 50k entries and 200 attributes and first I tried caret for R. It was incredible slow. Matrix operations are fast but it other algorithms are just slow. So I looked around a bit for alternatives. I knew some people use C# for such tasks but not working on Windows I didn’t thought that this was the ideal setup.

Lately, the JVM had received a lot of attention so I looked for ML libraries for Java, Clojure and Scala and settled on Weka. I used it a few years ago but only with its GUI.

I never really programmed in Java thus here are the first step in Weka for Java noobs.

Install Java and Weka

I work on Linux for most of the data work. If you work on Debian or ubuntu you have to install the following packages:

sudo apt-get install java-common

This package installs the JRE (Java Runtime Environment) which allows you to run .jre files.

sudo apt-get install openjdk-7-jdk openjdk-7-source openjdk-7-doc openjdk-7-jre-headless openjdk-7-jre-lib

This installs JDK (Java Development Kit) which includes the headers and all that stuff.

sudo apt-get install weka

Finally you can install weka which includes tons of different algorithms. Here’s a short overview:

Classification algorithms (examples)

[table]

Bayes, Functions, Trees, Meta

NaiveBayes,LibSVM,J48, Additive Regression

BayesNet,MultilayerPerceptron,M5 Model trees, Bagging

Bayesian Logistic Regression, Linear Regression, RandomForest,Random Subspace

[/table]

Full list: Classification schemes in Weka

Then there are clusters (EM, Simple K Means, Hierarchical Clusterer, …), algorithms for attribute selection like PCA,Stepwise, Forward Selection and for preprocessing: Resampling, Stratification, Normalization, etc. It’s pretty mature.

Set the right class path

export CLASSPATH=".:/usr/share/java/weka.jar"

Now I had, at least, to set the CLASSPATH for java to find my libraries. Also you have to add the current directory (“.”) so that java finds the files you want to open.

Convert your data

Weka uses its own file format called arff. It’s basically a csv file with a header which defines the data-types of each column. If you are working with CSV files Weka provides an easy way to convert your files.

java weka.core.converters.CSVLoader data.csv > data.arff

Write your environment

Now you can load the file into java, apply preprocessing and estimate and output a model. Here’s a complete example using a M5 rules to estimate a numeric value.

import weka.core.Instances;
import weka.classifiers.rules.M5Rules;
import weka.core.converters.ConverterUtils.DataSource;


public class M5Test {
    public static void main(String[] args) throws Exception {
        // read file
        DataSource source = new DataSource("./data.arff");
        Instances data = source.getDataSet();
    
        // set outcome (which column? starts with 0)
        // for last column use data.numAttributes() - 1
        data.setClassIndex(0);

        // set up classifier (see the doc)
        String[] options = new String[1];
        options[0] = "-R";


        M5Rules rule = new M5Rules();
        rule.setOptions(options);
        rule.buildClassifier(data);
        System.out.println(rule.toString());  



    }   
}

You need to import weka.core.Instances and import weka.core.converters.ConverterUtils.DataSource for reading the file. If you file is read the next step is to set the outcome, i.e. which column do you want to predict. Afterwards you set the parameters for your classifier. You have to first import it with import weka.classifiers.rules.M5Rules in this example. Then set up the options which you can find in the dev doc.  If you are done with this you can create a classifier object, set its options and build the classifier. Afterwards you can easily output your model using the toString() method. This is the a basic file which just works.

Compile and run your code

Now you can save your .java file, compile and run it with:

javac ClassName.java && java ClassName

After a few seconds you should get your model. And that’s all.

More information

For more examples and ideas on how to use preprocessing and testing check out: Use Weka in your Java code.

Also the development docs come in very handy which lists all functions and its API specifications.

Write a Comment

Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.