Passionate Programmer: Programming Language and Wage Premium

In the first chapter of Passionate Programmer Chad Fowler talks about the supply of programmers for new technologies and really old ones and wage premiums.

Data

I’ll consider programming language instead of general technology because its too diverse. My first thought was to look at TOIBE which publish a programming language popularity index every month. However, that doesn’t necessary reflect the market demand. So, I looked for other sources and found this blog post which used the indeed jobtrend tool. This gives us a nice idea of trends.

Jan 05 - Nov 11, Jan 09 - Nov 11

Java: +28%, +5%
C: +23%, +1%
C++: -15%, -18%, 
C#: +120%, +24%
PHP: +325%, +140%
Objective-C: +11,000%, +8400%
Visual Basic: -8%, +24%
Python: +610%, +330%
Javascript: +160%, +95%
Perl: +20%, +10%
Ruby: +2,300%, +1,300%
SQL: +30%, +5%
Pascal: -26%, -3%
Lua: +20,000%, +10,000%
Ada: +22%, +16%
Cobol: -45%, -17%
Fortran: -27%, -51%
Erlang: +4,500%, +2,500%
Prolog: -27%, -49%
Haskell: +300%, +225%
F#: +3,250%, +3,250%
Groovy: +4,200%, +3,250%
Scala: +5,500%, +5,500%
Ada: +25%, +17%
CoffeeScript: +2,750%, +2,750%
Clojure: +12,000%, +12,000%
Lisp: +23%, -5%
Delphi: -5%, +7%
ABAP: +15%, +0%

Graphs / Interpretation

Long term and short term growth

I imported the data into Stata, log transformed ( \log(1+x)) it for readability and plotted it with long (6 years) and short (2 years) term growth. Here you can see the whole graph which is quite unreadable beyond C#.

Therefore I split it up into two charts. The first chart contains all languages with less than 100% growth in the last 6 years, i.e. about 12% annual growth on average. This threshold is arbitrary but helps to split the data, so that it is more readable.

For the interpretation: I.e. Cobol is at -0.6 on loglong, i.e. e^{-0.6} -1 \approx -0.45 long term growth. And zero percentage growth means a log value of 0.
We see the expected candidates here, Fortran and Cobol. Ada is quite high which was surprising, at least for me.
Here’s the other half ot the chart:

Some great newcomers are Clojure, CoffeeScript and Scala. PHP is still strong which surprised me, too.
However, it’s important to consider that e.g. that the demand of Clojure developers increased dramatically but it’s still a niche language.

Salary and growth

At the next step, I took the average salary from indeed for each language and normalized it (\frac{\text{salary} - \text{Average salary}}{\text{Std. Dev. salary}}). If we plot this normalized average salary and the log transformed short term growth we get this graph:

Interpretation: avgsalary indicated the percentage of higher/lower salary to the average (~$88,367) in std. dev (about $12,308). For example ABAP got a avgsalary of about 2, therefore the actual salary is 88,367 + 2 * 12,308 = 111,983.

Also I added a linear regression line which slope is actually significant. (\beta_1 = 0.26 and std. err. of \sigma_1 = 0.095
The data is quite fuzzy so don’t get overly excited. For example the data for ABAP is quite skewed because this also includes e.g. consultants. However, we can see a general trend for higher wages for trendier languages which is to be expected.
If we exclude our outliners, i.e. ABAP, Ada and Visual Basic, we get other data.
Average salary increases to $89,720 and its std. dev. decreases to $8,369 (about a third!). Our estimate gets a lot better (\beta_1 = 0.34 and std dev. \sigma_1 = 0.085). And our graph looks a bit different:

We can see even see some kind of clustering. One with languages with logshort > 3 and then there’s this Java, C++, C# cluster. Quite interesting!

SOPA: Donations and Preferences

Data

I saw on Hacker News a neat website posted called (http://www.sopaopera.org/). The comments stated some hypothesis, e.g. that donations of entertainment and internet companies predict the support or opposition of the SOPA bill.
The data is directly from sopaopera.org which itself aggregates it from various sites.

Graphs & Tests

After cleaning the data and importing it into Stata. I looked through it and plotted this box plot which shows how much contributions each group got by entertainment companies in comparison to entertainment and internet company contributions.

In case you don’t know how to read such plot. The thin bars indicate min and max values and the blue box indicates how many people are between the first and third quantile, i.e. 25% to 75% of the population. The line in the blue box shows the median.

You can see that the median for the opposition is about 35% contribution ratio in contrast to the 65% contribution ratio of the supporters. Afterwards, I wanted to test if this difference is significant. In fact, it is highly significant (95%, t = -4.73).

Furthermore, here’s a plot of absolute contributions log-transformed:

The next step is to do a logistic regression to check the prediction quality of each attribute. I regressed with age, party (is_democrat), seniority and quota of entertainment contributions (quota_ent). You can see the results:

 ------------------------------------------------------------------------------
       support |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]  
  -------------+----------------------------------------------------------------
           age |   .0258551   .0358136     0.72   0.470    -.0443382    .0960485
   is_democrat |  -1.252883   .6243361    -2.01   0.045    -2.476559   -.0292067
     seniority |  -.0262688   .0381962    -0.69   0.492     -.101132    .0485943
     quota_ent |   5.839435   1.447732     4.03   0.000     3.001933    8.676938
         _cons |  -1.968467    2.01512    -0.98   0.329    -5.918029    1.981096
  ------------------------------------------------------------------------------

We can see that is_democrat and quota_ent are significant not zero whereby quota_ent is the most significant. This isn’t so much of a surprise.