Reading Atlanta Analytics

All of this business about paid tools vs free tools, and dare I say the whole concept of #measure, all boils down to the fact that today, we are a tool-centric industry, often to the detriment of being an expert-centric industry. — Stop giving web analytics tools the credit YOU deserve

Atlanta Analytics is a quite interesting blog – however, there aren’t so many posts. The author, Evan LaPointe, does have some nice visions and an interesting perspective, because he comes from a finance background.
I think he makes some important points, these are:

  • It isn’t about page views or uniques – it’s about money
  • Drive actions not data
  • Be a business person not a technologist
  • Demand your share – if you increase your company’s profit by $500,000 per year, you should demand a share of it

What is web analytics?

  • Quantify today’s success and uncover usability, design, architecture, copy, product, advertising, pricing and marketing optimization that will breed even more success tomorrow
  • Web analytics isn’t:
    • WA is not the measurement of something
    • WA is not defining success but translating it
    • WA is not Omniture, Google Analytics or Clicktracks
  • Web analytics answers the following questions:
    1. Who is coming to my web site?
    2. What are they trying to do?
    3. What is the gap between what they are doing and the ideal?
    4. What are some concrete ways we can close the gaps?
    5. How can we get more of these people?
  • These answers should be answered in context of growth and profitability
  • Analyst shouldn’t become married to one discipline otherwise they are losing the big picture
  • They are central and recommendations are driven by company impact and not by personal impact
  • Even if you cannot solve a problem by yourself, you have uncovered an important problem

Three enormous wastes of your web analytics time

  1. Analytics isn’t implemented in the dev process but afterwards
  2. You care about the correct unique visitors count
  3. You are trying to match two numbers from different tools: Trends not accounting

3.5 things that keep you from finding good web analytics people

  • 1: Good WA can be in your company
  • 2: A lot of experienced WAs are actually reporting writers
  • 3: Your interview process prevents you from hiring good people: if you fear change / that your flaws will be revealed and the application is able, then you probably won’t hire them
  • 3.5: Your salary is too low: increasing your conversion rate by 0.3% can mean hundreds of thousand of dollars additional revenue per month

Web analytics sucks, and it’s nobody’s fault

This is a handmade description for yet another propellerhead analyst who will sit around and run reports for people, get in arguments with other people (or those same people), “agree to disagree” with other departments, and will eventually call everyone else an idiot and will recede into their cave before ultimately quitting for a director-level position at a different, big, resume-enhancing company where the process will repeat itself.

It’s not their fault because a good position for a web analytics person does not exist in the companies that can use these people most. The bigger the company, the more important a small difference becomes. For a site with 10,000 visits a month, an analytics person would have to improve conversion by double-digit percentages to scarcely pay for themselves. For Wal Mart, moving the conversion needle a tenth of a percent probably pays their lifetime salary in a week

The effective web analytics person knows usability, they know some design, they know information architecture, they know HTML, they are good communicators and can thusly write good web copy, and ultimately they are businesspeople who realize the purpose behind all of these crafts is cash flow […] Rather than being careful, politically aware employees, effective analytics people are data-driven, quickdraw decision makers because they have two key assets:

1. Cold, hard facts in the form of data (and I don’t mean just Omniture data)
2. The ability to not have to decide: they can TEST

Big companies are ruled by coalitions of opinions, meetings, conference calls, and semi-educated executives. Data is actually a threat. Data is what gets people fired in big companies, not what gets them bonuses. Data is scary.

What are the REAL web analytics tools?

  • Question: How can you improve the long-term cash flow?
  • Where you need a decent degree of competency:
    • Usability
    • Information Architecture
    • SEO
    • Web marketing (PPC, display, email)
    • Social Media
    • Design
    • Copywriting
    • Website technology (HTML, CSS, SQL, JS, PHP/Ruby/Python/whatever)
    • Communication skills
  • Learn business goals -> department goals -> campaign goals -> personal goals

Have you lost faith in web analytics?

  • Make decisions as often as possible – aka fail faster
  • It isn’t about the newest technology – it’s about money
  • Don’t live in a vacuum – interact with different people and viewpoints

The purpose of web (or any) analytics

  • “We talk about being data-driven businesses. But these aren’t businesses built around a culture of measurement. They’re built around a culture of accountability.”
  • “The purpose of web analytics, or any analytics, is to give your organization the confidence needed to accelerate the pace of decisions.”
  • “We’re talking about being accountable to outcomes, not to some Tyrannosaurus on a power trip. That’s a big deal.”
  • “It’s about making big decisions often.” – Iterate, iterate, iterate

#6/25: Problem Solving: A statistician’s guide

Rules:

  1. Do not attempt to analyse the data until you understand what is being measured and why. Find out whether there is any prior information about likely effects.
  2. Find out how the data were collected.
  3. Look at the structure of the data.
  4. The data then need to be carefully examined in an exploratory way, before attempting a more sophisticated analysis.
  5. Use your common sense at all times.
  6. Report the results in a clear, self-explanatory way.

Thus a statistician needs to understand the general principles involved in tackling statistical problems, and at some stage it is more important to study the strategy of problem solving rather than learn yet more techniques (which can always be looked up in a book).

  • What’s the objective? Which aim? What’s important and why?
  • How was the data selected? How is its quality?
  • How are the results used? Simple vs. complicated models
  • Check existing literature => can make the study redundant or helps to do a better data collection and don’t repeat fundamental errors

collecting

  • Test as much as possible in your collection, i.e. pretesting surveys, account for time effects, order of different studies, etc.
  • Getting the right sample size is often also difficult; sometimes it is too small, other times it is too large; esp. medical research often have rule of thumbs like 20 patients, instead of proper sizes => Tip: look for previous research
  • Try to iterative over and over again to make the study better
  • Learn by experience. Do studies by yourself. It’s often harder than you think, esp. random samples. E.g. selecting random pigs in a horde
  • Ancdote: Pregnant woman had to wait for 3h and therefore had a higher blood pressure -> Medical personnel thought that this blood pressure is constant and admitted her to a hospital.
    • Always check the environment of the study
  • Non-responses can say a lot, don’t ignore them
  • questionnaire design: important! Learn about halo effects, social desirability, moral effects, etc.
  • Always pretest with a pilot study, if possible
  • The human element is often the weakest factor
  • Try to find pitfalls in your study, like Randy James

phases of analysis:

  1. Look at data
  2. Formulate a sensible model
  3. Fit the model
  4. Check the fit
  5. Utilize the model and present conclusions

Whatever the situation, one overall message is that the analyst should not be tempted to rush into using a standard statistical technique without first having a careful look at the data.

model formulation:

  • Ask lots of questions and listen
  • Incorporate background theory
  • Look at the data
  • Experience and inspiration are important
  • trying many models is helpful, but can be dangerous; don’t select the best model based on the highest R^2 or such and offer different models in your paper
  • alternatively: use Bayesian approach for model selection

model validation:

  • Is model specification satisfactory?
  • How about random component?
  • A few influential observations?
  • important feature overlooked?
  • alternative models which are as good as the used model?
  • Then iterate, iterate, iterate

Initial examination of data (IDA)

  • data structure, how many variables? categorical/binary/continuous?
  • Useful to reduce dimensionality?
  • ordinal data -> coded as numerical or with dummies?
  • data cleaning: coding errors, OCR, etc.
  • data quality: collection, errors & outliers => eyeballing is very helpful, 5-point summaries
  • missings: MCAR, impute, EM Algorithm

descriptive statistics

  • for complete data set & interesting sub groups
  • 5-point summary, IQR, tables, graphs
  • Tufte’s lie factor = apparent size of effect shown in the graph / actual size of effect int he data
  • graphs: units, title, legend

data modification

  • test data transformation
  • estimating missings
  • adjust extreme values
  • create new variables
  • try box-cox transformation

analysis

  • significance tests are widely overused, esp. in medicine, biology and psychology.
  • Statistically significant effects not always interesting, esp. using big samples
  • non-significant not always the same as no difference, opposite of previous example
  • enforcement of significant levels, why five not four or one or whatever. This can lead to an publican bias.
  • Estimates are more important, because they communicate relationships
  • Often null hypothesis silly, e.g. water doesn’t affect growth of a plant
    • Better: Interesting resuls should be repeatable in general and under different conditions. (Nelder: significant sameness)

appropriate procedure

  • do more than just one type of analysis, e.g. parametric vs. non-parametric or robust
  • robust good methods better than optimal methods with lots of assumptions
  • don’t just use a method you’re familiar with just because you are familiar with it
  • think in different ways about the problem
  • be prepared to make ad hoc modifications
  • you cannot know everything
  • analysis is more than just fitting the model

philosophical

  • assumed model is often more important than frequentest vs. Bayesian

generally

  • learn your statistics software and a scientific programming language
  • learn using a library, google scholar, searching in general

statistical consulting

  • work with the people; statistics isn’t about numbers, it’s about people
  • understand the problem and the objective
  • ask lots of questions
  • be patient
  • bear in mind resource constraints
  • write in clear language

numeracy

  • be skeptical
  • understand numbers
  • learn estimating
  • check dimensions
  • My book recommendation: Innummeracy
  • check silly statistics: e.g. mean outside of range
  • avoid graph without title and labels
  • don’t use linear regression for non-linear data
  • check assumptions, e.g. mult. regression: more variables than observations
  • my first time working with real data saw how different the process was
  • => Real work isn’t like your statistics 101 course; data is messy, you don’t have an unlimited amount of time or money
  • courses let you think that you got the data, look for your perfect model and you’re done – rather it is 70% searching for data & thinking about pitfalls, 25% cleaning up data and understanding it and about 5% doing the actual analysis

The second half of the book is filled with awesome exercises. I’d recommend everybody working with statistical techniques or working with data checking them out. They are insightful, interesting and stimulating. Furthermore, Chatfield shows that you can reveal insights with simple techniques.
Problem Solving: A statistician’s guide is a clear recommendation for everybody working with data on a daily basis, especially people with less than 2 to 5 years experience. I close with a quote of D. J. Finney: Don’t analyze numbers, analyze data.

Fortune: The World’s Billionaires

You may know that Forbes published tons of different and interesting lists, one of them is The World’s Billionaires. I took a look at the data and deepened my R knowledge alongside.

Countries

Let’s start with the countries. There are 1007 persons in this data set with valid Country entries.

You can see that most billionaires are born in the US and then with a big gap in China, India, Germany and Turkey. The relative distribution of billionaires worldwide looks a big different:

It mostly took some time to become a billionaire therefore I expect China and probably India to become stronger in the future.

Age

There are a few famous young billionaires like Mark Zuckerberg but most of them are quite older. The first quartile actually starts at 54, the average billionaire is 63 and the oldest one in this data set is 100.

Education

Only 51% of these people got at least a bachelor degree, 20% has a master degree and only 8.7% earned a doctorate or PhD.
Interestingly, about 5% are drop outs and 40 of these 50 drop outs are from the US.

Martial Status

Marriage is still high for billionaires. About 83% are Married, only 7% are divorced. About 5% are widowed and there are 30 singles.

Children

Children are also rather numerous. There are 176 entries with no data, i.e. either missing or no children.
Most billionaires got either two or three children but there are some outlines with 10 children or more. And there is Sulaiman Al Rajhi who got 23 children.

Net Worth

Most billionaires own between between 1 and 3.6 billion USD. The median billionaire owns 2.1 USD. There are, of course, some famous outlines like Bill Gates, Warren Buffett, Larry Ellison and Carlos Slim Helu. The complete net worth of all these billionaires is just 3.7 trillion USD. For comparison, the total US Debt is 14 trillion, that is nearly four times as much.

Self Made?

This was for me, one of the most interesting questions. I asked myself if there are differences between inherited and self-made billionaires between countries.

I’ll pick some examples.
US 69% are self-made
China 96% are self-made
Germany Only 33% are self made

Generally Emerging/New Countries, more self made, what is expected, but US positive example for an older country.

Source

Let’s talk about the sources. Rather interesting is that industries like oil and software which sound really profitable are rather minor industries. I think that has three reasons.

Firstly, becoming a billionaire takes time. It would be great to look at the first time someone became a billionaire, I think that you would probably see clusters, e.g. manufacturing probably started around 1940 but now there are less manufacturing billionaires.
Secondly, some markets aren’t that big. The global real estate market is probably a lot bigger than the global software market.
Thirdly, the ability to oligopolize markets. Let’s take oil and real estate. Oil is pretty much a commodity and you can distribute it world wide without problems. Real estate is locally limited, you can’t take a building block and just put it somewhere else, this limits the market power of companies in this industry.

You can access the data.csv here.

Passionate Programmer: Programming Language and Wage Premium

In the first chapter of Passionate Programmer Chad Fowler talks about the supply of programmers for new technologies and really old ones and wage premiums.

Data

I’ll consider programming language instead of general technology because its too diverse. My first thought was to look at TOIBE which publish a programming language popularity index every month. However, that doesn’t necessary reflect the market demand. So, I looked for other sources and found this blog post which used the indeed jobtrend tool. This gives us a nice idea of trends.

Graphs / Interpretation

Long term and short term growth

I imported the data into Stata, log transformed ( \log(1+x)) it for readability and plotted it with long (6 years) and short (2 years) term growth. Here you can see the whole graph which is quite unreadable beyond C#.

Therefore I split it up into two charts. The first chart contains all languages with less than 100% growth in the last 6 years, i.e. about 12% annual growth on average. This threshold is arbitrary but helps to split the data, so that it is more readable.

For the interpretation: I.e. Cobol is at -0.6 on loglong, i.e. e^{-0.6} -1 \approx -0.45 long term growth. And zero percentage growth means a log value of 0.
We see the expected candidates here, Fortran and Cobol. Ada is quite high which was surprising, at least for me.
Here’s the other half ot the chart:

Some great newcomers are Clojure, CoffeeScript and Scala. PHP is still strong which surprised me, too.
However, it’s important to consider that e.g. that the demand of Clojure developers increased dramatically but it’s still a niche language.

Salary and growth

At the next step, I took the average salary from indeed for each language and normalized it (\frac{\text{salary} - \text{Average salary}}{\text{Std. Dev. salary}}). If we plot this normalized average salary and the log transformed short term growth we get this graph:

Interpretation: avgsalary indicated the percentage of higher/lower salary to the average (~$88,367) in std. dev (about $12,308). For example ABAP got a avgsalary of about 2, therefore the actual salary is 88,367 + 2 * 12,308 = 111,983.

Also I added a linear regression line which slope is actually significant. (\beta_1 = 0.26 and std. err. of \sigma_1 = 0.095
The data is quite fuzzy so don’t get overly excited. For example the data for ABAP is quite skewed because this also includes e.g. consultants. However, we can see a general trend for higher wages for trendier languages which is to be expected.
If we exclude our outliners, i.e. ABAP, Ada and Visual Basic, we get other data.
Average salary increases to $89,720 and its std. dev. decreases to $8,369 (about a third!). Our estimate gets a lot better (\beta_1 = 0.34 and std dev. \sigma_1 = 0.085). And our graph looks a bit different:

We can see even see some kind of clustering. One with languages with logshort > 3 and then there’s this Java, C++, C# cluster. Quite interesting!