data collection – panictank

Rules:

Do not attempt to analyse the data until you understand what is being measured and why. Find out whether there is any prior information about likely effects.

Find out how the data were collected.

Look at the structure of the data.

The data then need to be carefully examined in an exploratory way, before attempting a more sophisticated analysis.

Use your common sense at all times.

Report the results in a clear, self-explanatory way.

Thus a statistician needs to understand the general principles involved in tackling statistical problems, and at some stage it is more important to study the strategy of problem solving rather than learn yet more techniques (which can always be looked up in a book).

What’s the objective? Which aim? What’s important and why?
How was the data selected? How is its quality?
How are the results used? Simple vs. complicated models

Check existing literature => can make the study redundant or helps to do a better data collection and don’t repeat fundamental errors

collecting

Test as much as possible in your collection, i.e. pretesting surveys, account for time effects, order of different studies, etc.
Getting the right sample size is often also difficult; sometimes it is too small, other times it is too large; esp. medical research often have rule of thumbs like 20 patients, instead of proper sizes => Tip: look for previous research
Try to iterative over and over again to make the study better
Learn by experience. Do studies by yourself. It’s often harder than you think, esp. random samples. E.g. selecting random pigs in a horde
Ancdote: Pregnant woman had to wait for 3h and therefore had a higher blood pressure -> Medical personnel thought that this blood pressure is constant and admitted her to a hospital.

Always check the environment of the study

Non-responses can say a lot, don’t ignore them
questionnaire design: important! Learn about halo effects, social desirability, moral effects, etc.
Always pretest with a pilot study, if possible
The human element is often the weakest factor
Try to find pitfalls in your study, like Randy James

phases of analysis:

Look at data
Formulate a sensible model
Fit the model
Check the fit
Utilize the model and present conclusions

Whatever the situation, one overall message is that the analyst should not be tempted to rush into using a standard statistical technique without first having a careful look at the data.

model formulation:

Ask lots of questions and listen
Incorporate background theory
Look at the data
Experience and inspiration are important
trying many models is helpful, but can be dangerous; don’t select the best model based on the highest $R^2$ or such and offer different models in your paper
alternatively: use Bayesian approach for model selection

model validation:

Is model specification satisfactory?
How about random component?
A few influential observations?
important feature overlooked?
alternative models which are as good as the used model?
Then iterate, iterate, iterate

Initial examination of data (IDA)

data structure, how many variables? categorical/binary/continuous?
Useful to reduce dimensionality?
ordinal data -> coded as numerical or with dummies?

data cleaning: coding errors, OCR, etc.
data quality: collection, errors & outliers => eyeballing is very helpful, 5-point summaries
missings: MCAR, impute, EM Algorithm

descriptive statistics

for complete data set & interesting sub groups
5-point summary, IQR, tables, graphs
Tufte’s lie factor = apparent size of effect shown in the graph / actual size of effect int he data
graphs: units, title, legend

data modification

test data transformation
estimating missings
adjust extreme values
create new variables
try box-cox transformation

analysis

significance tests are widely overused, esp. in medicine, biology and psychology.
Statistically significant effects not always interesting, esp. using big samples
non-significant not always the same as no difference, opposite of previous example
enforcement of significant levels, why five not four or one or whatever. This can lead to an publican bias.
Estimates are more important, because they communicate relationships
Often null hypothesis silly, e.g. water doesn’t affect growth of a plant

Better: Interesting resuls should be repeatable in general and under different conditions. (Nelder: significant sameness)

appropriate procedure

do more than just one type of analysis, e.g. parametric vs. non-parametric or robust
robust good methods better than optimal methods with lots of assumptions
don’t just use a method you’re familiar with just because you are familiar with it
think in different ways about the problem
be prepared to make ad hoc modifications
you cannot know everything
analysis is more than just fitting the model

philosophical

assumed model is often more important than frequentest vs. Bayesian

generally

learn your statistics software and a scientific programming language
learn using a library, google scholar, searching in general

statistical consulting

work with the people; statistics isn’t about numbers, it’s about people
understand the problem and the objective
ask lots of questions
be patient
bear in mind resource constraints
write in clear language

numeracy

be skeptical
understand numbers
learn estimating
check dimensions
My book recommendation: Innummeracy
check silly statistics: e.g. mean outside of range
avoid graph without title and labels
don’t use linear regression for non-linear data
check assumptions, e.g. mult. regression: more variables than observations
my first time working with real data saw how different the process was
=> Real work isn’t like your statistics 101 course; data is messy, you don’t have an unlimited amount of time or money
courses let you think that you got the data, look for your perfect model and you’re done – rather it is 70% searching for data & thinking about pitfalls, 25% cleaning up data and understanding it and about 5% doing the actual analysis

The second half of the book is filled with awesome exercises. I’d recommend everybody working with statistical techniques or working with data checking them out. They are insightful, interesting and stimulating. Furthermore, Chatfield shows that you can reveal insights with simple techniques.
Problem Solving: A statistician’s guide is a clear recommendation for everybody working with data on a daily basis, especially people with less than 2 to 5 years experience. I close with a quote of D. J. Finney: Don’t analyze numbers, analyze data.