Explaination vs. Prediction

Again, some older post I had lying around. Nonetheless, the topic is still prevalent.

I recently read a great paper named To Explain or To Predict? by Galit Shmueli. She explains the differences between the “old-school” explanatory statistics and predictive statistics. I saw lots of her observations by myself.

That means predictions are often regarded as unscientific and therefore there’s a bit of a lack of good literature – lately the situation became better with the uprising of machine learning.
Nonetheless, most students don’t learn how to make predictions and you see how people use $R^2$ to validate models.

Sure, there are some departments that teach how to predict but they are still in the minority. Of course, there’s this other trend with Big Data. I’m personally not really excited by Big Data rather by data at all.

More Info: http://galitshmueli.com/explain-predict

I wrote this post more than 2 years ago. Now machine learning became some kind of commodity on a smaller level and something strange happened. Some of the people who work with data but didn’t learn good statistical techniques started to try to explain data which is pretty terrible. It even seems that they try to reinvent statistics. I read a post yesterday called Why big data is in trouble: they forgot about applied statistics which captured this pretty nicely.

The table at the bottom is just unbelievable. It lists different fields and the application of “big data” or “data science”. They also list that in 2012 they finally start to enter fields like biology, economics, engineering, etc. Which is more sad than hilarious. So yeah, I didn’t expect this turn.

Furthermore, I saw more and more “data science” boot camps / programs popping up. Still neglecting statistical foundations. Resulting in even more terrible studies. This trend will probably follow the Gartner Hype Cycle. As far as I can tell the peak is already reached, now it will begin to be disappointing and in a few years actually reach its plateau. Here the latest Hype Cycle from July 2013:

I see the term “prescriptive analytics” on there and just looked it up. It’s astonishing that people reinvent new terms for so much stuff and it still works. Even stuff like business intelligence is basic statistics, then came predictive analysis (still statistics), data science (hey statistics), now prescriptive analytics (still statistics).

I just have to quote one of my favorite quotes on this topic:

Someone (can’t recall the source, sorry) recently defined “data scientist” as “a data analyst who lives in California.” —baconner

#18/25: Statistics for technology

I love reading great books about statistics and Christoper Chatfield is probably one of the greatest educators in statistics. Statistics for technology is a great introduction in statistics which isn’t too theoretical. I won’t provide a summary because these topics are rather rudimentary. However, I found some interesting things in this book, e.g.

A scientific experiment has some or all of the following characteristics.

The physical laws governing the experiment are not entirely understood
The experiment may not have been done before, at least successfully
There are strong incentives to run the smallest number of the cheapest tests as quickly as possible
The experimenter may not be objective, as for example when an investor tests his own invention or when a company tests competitive products
Experimental results are unexpected or disappointing
Although experimental uncertainty may be present, many industrial situations require decisions to be made without additional testing or theoretical study

It is often equally important to know how spread out the data is. For example suppose that a study of people affected by a certain disease revealed that most people affected were under two years old or over seventy years old; then it would be very misleading to summarize the data by saying ‘average age of persons affected is thirty-five years’.

There are some great examples in this book which make statistics for students more interesting in my opinion. The examples are rather technical which is obvious reading the title.

All in all, I can recommend this book if you are want to learn a bit about (technical) statistics. Great book!!

#6/25: Problem Solving: A statistician’s guide

Rules:

Do not attempt to analyse the data until you understand what is being measured and why. Find out whether there is any prior information about likely effects.

Find out how the data were collected.

Look at the structure of the data.

The data then need to be carefully examined in an exploratory way, before attempting a more sophisticated analysis.

Use your common sense at all times.

Report the results in a clear, self-explanatory way.

Thus a statistician needs to understand the general principles involved in tackling statistical problems, and at some stage it is more important to study the strategy of problem solving rather than learn yet more techniques (which can always be looked up in a book).

What’s the objective? Which aim? What’s important and why?
How was the data selected? How is its quality?
How are the results used? Simple vs. complicated models

Check existing literature => can make the study redundant or helps to do a better data collection and don’t repeat fundamental errors

collecting

Test as much as possible in your collection, i.e. pretesting surveys, account for time effects, order of different studies, etc.
Getting the right sample size is often also difficult; sometimes it is too small, other times it is too large; esp. medical research often have rule of thumbs like 20 patients, instead of proper sizes => Tip: look for previous research
Try to iterative over and over again to make the study better
Learn by experience. Do studies by yourself. It’s often harder than you think, esp. random samples. E.g. selecting random pigs in a horde
Ancdote: Pregnant woman had to wait for 3h and therefore had a higher blood pressure -> Medical personnel thought that this blood pressure is constant and admitted her to a hospital.

Always check the environment of the study

Non-responses can say a lot, don’t ignore them
questionnaire design: important! Learn about halo effects, social desirability, moral effects, etc.
Always pretest with a pilot study, if possible
The human element is often the weakest factor
Try to find pitfalls in your study, like Randy James

phases of analysis:

Look at data
Formulate a sensible model
Fit the model
Check the fit
Utilize the model and present conclusions

Whatever the situation, one overall message is that the analyst should not be tempted to rush into using a standard statistical technique without first having a careful look at the data.

model formulation:

Ask lots of questions and listen
Incorporate background theory
Look at the data
Experience and inspiration are important
trying many models is helpful, but can be dangerous; don’t select the best model based on the highest $R^2$ or such and offer different models in your paper
alternatively: use Bayesian approach for model selection

model validation:

Is model specification satisfactory?
How about random component?
A few influential observations?
important feature overlooked?
alternative models which are as good as the used model?
Then iterate, iterate, iterate

Initial examination of data (IDA)

data structure, how many variables? categorical/binary/continuous?
Useful to reduce dimensionality?
ordinal data -> coded as numerical or with dummies?

data cleaning: coding errors, OCR, etc.
data quality: collection, errors & outliers => eyeballing is very helpful, 5-point summaries
missings: MCAR, impute, EM Algorithm

descriptive statistics

for complete data set & interesting sub groups
5-point summary, IQR, tables, graphs
Tufte’s lie factor = apparent size of effect shown in the graph / actual size of effect int he data
graphs: units, title, legend

data modification

test data transformation
estimating missings
adjust extreme values
create new variables
try box-cox transformation

analysis

significance tests are widely overused, esp. in medicine, biology and psychology.
Statistically significant effects not always interesting, esp. using big samples
non-significant not always the same as no difference, opposite of previous example
enforcement of significant levels, why five not four or one or whatever. This can lead to an publican bias.
Estimates are more important, because they communicate relationships
Often null hypothesis silly, e.g. water doesn’t affect growth of a plant

Better: Interesting resuls should be repeatable in general and under different conditions. (Nelder: significant sameness)

appropriate procedure

do more than just one type of analysis, e.g. parametric vs. non-parametric or robust
robust good methods better than optimal methods with lots of assumptions
don’t just use a method you’re familiar with just because you are familiar with it
think in different ways about the problem
be prepared to make ad hoc modifications
you cannot know everything
analysis is more than just fitting the model

philosophical

assumed model is often more important than frequentest vs. Bayesian

generally

learn your statistics software and a scientific programming language
learn using a library, google scholar, searching in general

statistical consulting

work with the people; statistics isn’t about numbers, it’s about people
understand the problem and the objective
ask lots of questions
be patient
bear in mind resource constraints
write in clear language

numeracy

be skeptical
understand numbers
learn estimating
check dimensions
My book recommendation: Innummeracy
check silly statistics: e.g. mean outside of range
avoid graph without title and labels
don’t use linear regression for non-linear data
check assumptions, e.g. mult. regression: more variables than observations
my first time working with real data saw how different the process was
=> Real work isn’t like your statistics 101 course; data is messy, you don’t have an unlimited amount of time or money
courses let you think that you got the data, look for your perfect model and you’re done – rather it is 70% searching for data & thinking about pitfalls, 25% cleaning up data and understanding it and about 5% doing the actual analysis

The second half of the book is filled with awesome exercises. I’d recommend everybody working with statistical techniques or working with data checking them out. They are insightful, interesting and stimulating. Furthermore, Chatfield shows that you can reveal insights with simple techniques.
Problem Solving: A statistician’s guide is a clear recommendation for everybody working with data on a daily basis, especially people with less than 2 to 5 years experience. I close with a quote of D. J. Finney: Don’t analyze numbers, analyze data.

25 Books in 2012

So, I decided to do an other book challenge this year. The starting date is maybe a bit late but that’s OK. This year, I want to do a reading challenge again not because I haven’t read any books but rather because I was too lazy to write some review/summary about the books I’ve read.

In comparison to last year’s challenge where I read mostly business books, this year I will read mostly books about economics and statistics. You can see the preliminary reading list in the picture. Some books may change but the volume will probably be the same.

264 days and 25 books left. Let’s start!