Bottle, Flup and lighttpd

This is just a short post because I spent a few hours yesterday to get it running although it’s pretty easy.

Bottle application

In case you didn’t know Bottle is a micro web-framework for Python. Here’s a simple application (which is totally insecure):

If you want to run your bottle application with lighttpd and FastCGI you need to install flup.

Flup

What is flup? Flup is a random assortment of WSGI server and most importantly for us fcgi. You can install it with pip:

Now we can run our application with flup.

Run bottle with flup

As you can see you just add server="flup". That’s it. Also notice that I route to /app/. You will see in the next step why I do this. The next step is to install and configure lighttpd.

Get lighttpd running

Again if you’re running a Debian-based distribution it’s just:

Start by making a backup copy of the lighttpd.conf which is in /etc/lighttpd/lighttpd.conf:

Now we can edit it and connect it to our bottle application. First you need to add mod_fastcgi with:

Now you can add the fastcgi handlers which look like this:

In line 2 you can see that we define a directory (/app/). That means that every request to our server in /app/ will be handled by bottle. The host is line 5 is your bottle server which runs on localhost. The port in line 6 should be the same as in your bottle app and check-local just checks first if there’s an /app/ directory. You want bottle to handle these requests so disable it.

Get everything working together

Start lighttpd:

Next you want to start your bottle application:

And that’s it. If you connect to your http server with http://localhost:80 you will see the standard lighttpd welcome page. If you now go into http://localhost/app/friend you will see bottle’s output.

Why did this take you so long?

Good question. The problem was that a lot of specifications don’t use fastcgi.server with the host:port setting but with bin-path and socket. In this case lighttpd will start your bottle server and connect to a UNIX socket. The problem is that in the latest version of bottle, bottle will try to bind to a host:port setting. It tried it out but it doesn’t seem that you can overwrite this option. So yeah, that’s the simple solution.

Let’s start with the updates. Yesterday, I announced that I will publish more projects on github. The first one is the norepost bot for Reddit (blog post):

Norepost bot for Reddit (on github)

This bot crawls the /new page of a subreddit to detect reposts. It uses special filtering for youtube links otherwise it searches if the same URL was submitted before.

The second one is the mlg soundboard (blog post):

MLG Soundboard using Web Audio API (on github)

This features a simple soundboard using Web Sound API and distortion effects.

Let’s go to the main feature: BaddieBot

Yesterday, I saw a request in /r/RequestABot about a bot which posts other submissions by the same user and allows people to subscribe to other user’s submissions. This was quite interesting though the author mentioned an other bot who does the same. So I looked up the bot’s source code just for interest and found a github repository which could be related to the bot. It was written in Java. Somehow, I don’t know exactly why, I thought: “Maybe I should rewrite it in Python because I like Python and it would be fun“.

Using old source code

The two main requirements for the bot were:

List:

2. Allow users to subscribe for a pm on Reddit when a certain user posts a new story in /r/badpeoplestories (via a link in the comment showing past submissions)

The first point was easily done because my other bot does a similar thing. I knew that I can use reddit’s search to find other people’s submissions. I looked up modifiers and saw that there’s an “author:” modifier.

I copied my other bot and started deleting stuff I longer needed. Edited the search parameters and the first step and changed the message it should post. Done.

The second point was a bit more interesting. I expanded my database to add subscribers, added a loop looking for new messages and parsing them and added a unsubscribe function. While doing this I felt like all the time I worked on bots or scrapers in the past just combined into the moment and I knew exactly what I should do and what I should expect. It was fantastic.

I tested it a bit and messaged the requester. He quickly created an account and it tested it again. So far, it works fine and the best thing is I took me less than 2 hours from start to finish – including testing.

The source code is online on Github: BaddieBot for Reddit

Notes on Data Mining Cookbook

I remember a post on hacker news about 4 years ago about some guy who build a cool app where he examined when you should ideally post on hacker news to get your post to the front page. He recommended one book called Data Mining Cookbook by Olivia Paar Rud. I had a copy lying around since then and never looked at it.

Chapter 1

She describes using genetic programming for model selection. I found this idea really interesting and actually never saw it. I may try it out.

Chapter 2: Selecting the Data

Offer History Database: The idea is to log all offers you made to a specific person. You can track the customer id, campaign id and the response.

If you build a new model look if the data is filtered. Also an interesting observation. If you use data which was pre-selected you can’t really build a model on the whole population.

If you have multiple mailings to smaller groups. For example you are mailing 50k prospects, then the 50% based on some score and then again the best 25%. You can combine the data together and then create columns for each response. You can still build your models however the probability is no longer correct, ranking still works.

Chapter 4: Selecting and transforming the variables

Find interactions with tree-based algorithms and use them in your logistic models

Chapter 6: Validating the model

For discrete outcomes: Sort by model score and create percentile groups; compare outcomes and attributes in these groups

Chapter 7: Implementing and maintaining the model

Calculate the model life-time and recheck every period

Chapter 8: Understanding your customer: Profiling and Segmentation

Market-driven segmentation: use customer attributes to segment your data

Penetration analysis: compare demographic data of your customers against your market. Calculate your penetration index (% market / % customer) * 100 and try to acquire more customers in the segments where your penetration index is the highest

Customer Value Analysis: 2×2 matrix (risk vs. revenue) then split up each cell into its demographics and/or behavioral attributes which can lead to groups like “business builders” or “Risky Revenue”

Chapter 9: Target New Prospects: Modeling Response

For each continuous attribute check if there are possible segments and transformations; regress with stepwise on your outcome and select best fitting variables

Chapter 12: Targeting Profitable Customers: Modeling Lifetime Value

$\text{discount rate} = ((1+\text{credit})(\text{risk factor})) ^ {(\text{year} + AR/365)}$

Conclusion

The book was written in 2001 and for that it’s fantastic. I was too young to be interested in data mining or analytics in 2001 but if I had been older this book would have been a gem. If you never worked with data before I can recommend this book to you. The author focuses less on the model (she uses mainly logistic regressions, stepwise, best subset) and more on the work around models. That is finding outliers, fixing missings, finding good attributes and presenting the results. In my opinion most books neglect this and lots of beginners know about SVMs and Random Forests but have no idea how to properly apply them.

Weka for Java noobs

For the one project I talked about a wanted to do some prediction. The data set had about 50k entries and 200 attributes and first I tried caret for R. It was incredible slow. Matrix operations are fast but it other algorithms are just slow. So I looked around a bit for alternatives. I knew some people use C# for such tasks but not working on Windows I didn’t thought that this was the ideal setup.

Lately, the JVM had received a lot of attention so I looked for ML libraries for Java, Clojure and Scala and settled on Weka. I used it a few years ago but only with its GUI.

I never really programmed in Java thus here are the first step in Weka for Java noobs.

Install Java and Weka

I work on Linux for most of the data work. If you work on Debian or ubuntu you have to install the following packages:

This package installs the JRE (Java Runtime Environment) which allows you to run .jre files.

This installs JDK (Java Development Kit) which includes the headers and all that stuff.

Finally you can install weka which includes tons of different algorithms. Here’s a short overview:

Classification algorithms (examples)

Bayes Functions Trees Meta
BayesNet MultilayerPerceptron M5 Model trees Bagging
Bayesian Logistic Regression Linear Regression RandomForest Random Subspace

Full list: Classification schemes in Weka

Then there are clusters (EM, Simple K Means, Hierarchical Clusterer, …), algorithms for attribute selection like PCA,Stepwise, Forward Selection and for preprocessing: Resampling, Stratification, Normalization, etc. It’s pretty mature.

Set the right class path

Now I had, at least, to set the CLASSPATH for java to find my libraries. Also you have to add the current directory (“.”) so that java finds the files you want to open.

Weka uses its own file format called arff. It’s basically a csv file with a header which defines the data-types of each column. If you are working with CSV files Weka provides an easy way to convert your files.

Now you can load the file into java, apply preprocessing and estimate and output a model. Here’s a complete example using a M5 rules to estimate a numeric value.

You need to import weka.core.Instances and import weka.core.converters.ConverterUtils.DataSource for reading the file. If you file is read the next step is to set the outcome, i.e. which column do you want to predict. Afterwards you set the parameters for your classifier. You have to first import it with import weka.classifiers.rules.M5Rules in this example. Then set up the options which you can find in the dev doc.  If you are done with this you can create a classifier object, set its options and build the classifier. Afterwards you can easily output your model using the toString() method. This is the a basic file which just works.