in AI / ML, Hacks

Weka for Java noobs

For the one project I talked about a wanted to do some prediction. The data set had about 50k entries and 200 attributes and first I tried caret for R. It was incredible slow. Matrix operations are fast but it other algorithms are just slow. So I looked around a bit for alternatives. I knew some people use C# for such tasks but not working on Windows I didn’t thought that this was the ideal setup.

Lately, the JVM had received a lot of attention so I looked for ML libraries for Java, Clojure and Scala and settled on Weka. I used it a few years ago but only with its GUI.

I never really programmed in Java thus here are the first step in Weka for Java noobs.

Install Java and Weka

I work on Linux for most of the data work. If you work on Debian or ubuntu you have to install the following packages:

This package installs the JRE (Java Runtime Environment) which allows you to run .jre files.

This installs JDK (Java Development Kit) which includes the headers and all that stuff.

Finally you can install weka which includes tons of different algorithms. Here’s a short overview:

Classification algorithms (examples)

Bayes Functions Trees Meta
NaiveBayesLibSVMJ48 Additive Regression
BayesNetMultilayerPerceptronM5 Model trees Bagging
Bayesian Logistic Regression Linear Regression RandomForestRandom Subspace

Full list: Classification schemes in Weka

Then there are clusters (EM, Simple K Means, Hierarchical Clusterer, …), algorithms for attribute selection like PCA,Stepwise, Forward Selection and for preprocessing: Resampling, Stratification, Normalization, etc. It’s pretty mature.

Set the right class path

Now I had, at least, to set the CLASSPATH for java to find my libraries. Also you have to add the current directory (“.”) so that java finds the files you want to open.

Convert your data

Weka uses its own file format called arff. It’s basically a csv file with a header which defines the data-types of each column. If you are working with CSV files Weka provides an easy way to convert your files.

Write your environment

Now you can load the file into java, apply preprocessing and estimate and output a model. Here’s a complete example using a M5 rules to estimate a numeric value.

You need to import weka.core.Instances and import weka.core.converters.ConverterUtils.DataSource for reading the file. If you file is read the next step is to set the outcome, i.e. which column do you want to predict. Afterwards you set the parameters for your classifier. You have to first import it with import weka.classifiers.rules.M5Rules in this example. Then set up the options which you can find in the dev doc.  If you are done with this you can create a classifier object, set its options and build the classifier. Afterwards you can easily output your model using the toString() method. This is the a basic file which just works.

Compile and run your code

Now you can save your .java file, compile and run it with:

After a few seconds you should get your model. And that’s all.

More information

For more examples and ideas on how to use preprocessing and testing check out: Use Weka in your Java code.

Also the development docs come in very handy which lists all functions and its API specifications.

Write a Comment

Comment