An anthropologist colleague who did a post-doc in a population center has been trying to get a group of people at his university together to think about population issues. This is something I'm all for and am happy to help in whatever little way I can to facilitate especially anthropologists developing their expertise in demography. One of the activities they have planned for this population interest group is a workshop on the R statistical programming language. The other day he wrote me with the following very reasonable question that has been put to him by several of the people in his group: Sure R is free but other than that why should someone bother to learn new software when there is perfectly acceptable commercial software out there? This question is particularly relevant when one works for an institution like a university where there are typically site licenses and other mechanisms for subsidizing the expense of commercial software (which can be substantial). What follows is, more or less, what I said to him.
I should start out by saying that there is a lot to be said for free. I pay several hundred dollars a year for commercial software that I don't actually use that often. Now, when I need it, it's certainly nice to know it's there but if I didn't have a research account paying for this software, I might let at least one or two of these licenses slide. I very occasionally use Stata because the R package that does generalized linear mixed models has had a bug in the routine that fits logistic mixed models and this is something that Stata does quite well. So I regularly get mailings about updates and I am always just blown away at the expense involved in maintaining the most current version of this software, particularly when you used the intercooled version. It's relatively frequently updated (a good thing) but these updates are expensive (a bad thing for people without generous institutional subsidies). So, let me just start by saying that free is good.
This actually brings up a bit of a pet peeve of mine regarding training in US population centers. We have these generous programs to train population scientists and policy-makers from the poor countries of the world. We bring them into our American universities and train them in demographic and statistical methods on machines run by proprietary (and expensive!) operating systems and using extremely expensive proprietary software. These future leaders will graduate and go back home to Africa, Asia, eastern Europe, or Latin America. There, they probably won't have access to computers with the latest hardware running the most recent software. Most of their institutions can't afford expensive site licenses to the software that was on every lab machine back at Princeton or UCLA or Michigan or [fill in your school's name here]. This makes it all the more challenging to do the work that they were trained to do and leaves them just that much more behind scholars in advanced industrial nations. If our population centers had labs with computers running Linux, taught statistics and numerical methods using R, and had students write LaTeX papers, lecture slides, and meeting posters using, say, Emacs rather than some bloated word-processor whose menu structure seems to change every release, then I think we would be doing a real service to the future population leaders of the developing world. But let's return to the question at hand, other than the fact that it's free -- which isn't such an issue for someone with a funded lab at an American University -- why should anyone take the trouble to learn R? I can think of seven reasons off the top of my head.
(1) R is what is used by the majority of academic statisticians. This is where new developments are going to be implemented and, perhaps more importantly, when you seek help from a statistician or collaborate with one, you are in a much better position to benefit from the interaction if you share a common language.
(2) R is effectively platform independent. If you live in an all-windows environment, this may not be such a big deal but for those of us who use Linux/Mac and work with people who use windows, it's a tremendous advantage.
(3) R has unrivaled help resources. There is absolutely nothing like it. First, the single best statistics book ever is written for R (Venables & Ripley, Modern Applied Statistics in S -- remember R is a dialect of S). Second, there are all the many online help resources both from r-project.org and from many specific lists and interest groups. Third, there are proliferating publications of excellent quality. For example, there is the new Use R series. The quantity and quality of help resources is not even close to matched by any other statistics application. Part of the nature of R -- community constructed, free software -- means that the developers and power users are going to be more willing to provide help through lists, etc. than someone in a commercial software company. The quality and quantity of help for R is particularly relevant when one is trying to teach oneself a new technique of statistical method.
(4) R makes the best graphics. Full stop. I use R, Matlab, and Mathematica. The latter two applications have a well-deserved reputation for making great graphics, but I think that R is best. I quite regularly will do a calculation in Matlab and export the results to R to make the figure. The level of fine control, the intuitiveness of the command syntax (cf. Matlab!), and the general quality of drivers, etc. make R the hands-down best. And let's face it, cool graphics sell papers to reviewers, editors, etc.
(5) The command-line interface -- perhaps counterintuitively -- is much, much better for teaching. You can post your code exactly and students can reproduce your work exactly. Learning then comes from tinkering. Now, both Stata and SAS allow for doing everything from the command line with scripts like do-files. But how many people really do that? And SPSS...
(6) R is more than a statistics application. It is a full programming language. It is designed to seamlessly incorporate compiled code (like C or Fortran) which gives you all the benefits of a interactive language while allowing you to capitalize on the speed of compiled code.
(7) The online distribution system beats anything out there.
Oh, and let's face it, all the cool kids use it...
Well-written. I was trying to explain R to someone who just completed an introductory stats course using SPSS right before I read this - I surely would have done a better job if I'd read this first.
An up-and-coming R help resource which, frankly, spanks R-help in terms of friendliness is the R section of stackoverflow.com (http://stackoverflow.com/questions/tagged/r). Uncannily, I might be asking a question there about the best packages for mixed models in the next couple of days...
Agreed! I think it's a huge disservice to teach students statistics and econometrics in proprietary software. When they get a job, they'll have to learn whatever software that specific company uses. So too will the R student, but how hard would it be to convince a company to install some free software? They can hit the ground running from day 1.