#6/25: Problem Solving: A statistician’s guide

Rules:

  1. Do not attempt to analyse the data until you understand what is being measured and why. Find out whether there is any prior information about likely effects.
  2. Find out how the data were collected.
  3. Look at the structure of the data.
  4. The data then need to be carefully examined in an exploratory way, before attempting a more sophisticated analysis.
  5. Use your common sense at all times.
  6. Report the results in a clear, self-explanatory way.

Thus a statistician needs to understand the general principles involved in tackling statistical problems, and at some stage it is more important to study the strategy of problem solving rather than learn yet more techniques (which can always be looked up in a book).

  • What’s the objective? Which aim? What’s important and why?
  • How was the data selected? How is its quality?
  • How are the results used? Simple vs. complicated models
  • Check existing literature => can make the study redundant or helps to do a better data collection and don’t repeat fundamental errors

collecting

  • Test as much as possible in your collection, i.e. pretesting surveys, account for time effects, order of different studies, etc.
  • Getting the right sample size is often also difficult; sometimes it is too small, other times it is too large; esp. medical research often have rule of thumbs like 20 patients, instead of proper sizes => Tip: look for previous research
  • Try to iterative over and over again to make the study better
  • Learn by experience. Do studies by yourself. It’s often harder than you think, esp. random samples. E.g. selecting random pigs in a horde
  • Ancdote: Pregnant woman had to wait for 3h and therefore had a higher blood pressure -> Medical personnel thought that this blood pressure is constant and admitted her to a hospital.
    • Always check the environment of the study
  • Non-responses can say a lot, don’t ignore them
  • questionnaire design: important! Learn about halo effects, social desirability, moral effects, etc.
  • Always pretest with a pilot study, if possible
  • The human element is often the weakest factor
  • Try to find pitfalls in your study, like Randy James

phases of analysis:

  1. Look at data
  2. Formulate a sensible model
  3. Fit the model
  4. Check the fit
  5. Utilize the model and present conclusions

Whatever the situation, one overall message is that the analyst should not be tempted to rush into using a standard statistical technique without first having a careful look at the data.

model formulation:

  • Ask lots of questions and listen
  • Incorporate background theory
  • Look at the data
  • Experience and inspiration are important
  • trying many models is helpful, but can be dangerous; don’t select the best model based on the highest R^2 or such and offer different models in your paper
  • alternatively: use Bayesian approach for model selection

model validation:

  • Is model specification satisfactory?
  • How about random component?
  • A few influential observations?
  • important feature overlooked?
  • alternative models which are as good as the used model?
  • Then iterate, iterate, iterate

Initial examination of data (IDA)

  • data structure, how many variables? categorical/binary/continuous?
  • Useful to reduce dimensionality?
  • ordinal data -> coded as numerical or with dummies?
  • data cleaning: coding errors, OCR, etc.
  • data quality: collection, errors & outliers => eyeballing is very helpful, 5-point summaries
  • missings: MCAR, impute, EM Algorithm

descriptive statistics

  • for complete data set & interesting sub groups
  • 5-point summary, IQR, tables, graphs
  • Tufte’s lie factor = apparent size of effect shown in the graph / actual size of effect int he data
  • graphs: units, title, legend

data modification

  • test data transformation
  • estimating missings
  • adjust extreme values
  • create new variables
  • try box-cox transformation

analysis

  • significance tests are widely overused, esp. in medicine, biology and psychology.
  • Statistically significant effects not always interesting, esp. using big samples
  • non-significant not always the same as no difference, opposite of previous example
  • enforcement of significant levels, why five not four or one or whatever. This can lead to an publican bias.
  • Estimates are more important, because they communicate relationships
  • Often null hypothesis silly, e.g. water doesn’t affect growth of a plant
    • Better: Interesting resuls should be repeatable in general and under different conditions. (Nelder: significant sameness)

appropriate procedure

  • do more than just one type of analysis, e.g. parametric vs. non-parametric or robust
  • robust good methods better than optimal methods with lots of assumptions
  • don’t just use a method you’re familiar with just because you are familiar with it
  • think in different ways about the problem
  • be prepared to make ad hoc modifications
  • you cannot know everything
  • analysis is more than just fitting the model

philosophical

  • assumed model is often more important than frequentest vs. Bayesian

generally

  • learn your statistics software and a scientific programming language
  • learn using a library, google scholar, searching in general

statistical consulting

  • work with the people; statistics isn’t about numbers, it’s about people
  • understand the problem and the objective
  • ask lots of questions
  • be patient
  • bear in mind resource constraints
  • write in clear language

numeracy

  • be skeptical
  • understand numbers
  • learn estimating
  • check dimensions
  • My book recommendation: Innummeracy
  • check silly statistics: e.g. mean outside of range
  • avoid graph without title and labels
  • don’t use linear regression for non-linear data
  • check assumptions, e.g. mult. regression: more variables than observations
  • my first time working with real data saw how different the process was
  • => Real work isn’t like your statistics 101 course; data is messy, you don’t have an unlimited amount of time or money
  • courses let you think that you got the data, look for your perfect model and you’re done – rather it is 70% searching for data & thinking about pitfalls, 25% cleaning up data and understanding it and about 5% doing the actual analysis

The second half of the book is filled with awesome exercises. I’d recommend everybody working with statistical techniques or working with data checking them out. They are insightful, interesting and stimulating. Furthermore, Chatfield shows that you can reveal insights with simple techniques.
Problem Solving: A statistician’s guide is a clear recommendation for everybody working with data on a daily basis, especially people with less than 2 to 5 years experience. I close with a quote of D. J. Finney: Don’t analyze numbers, analyze data.