Coding for Data Science: A Get-Started Guide for Social Welfare academics

Greetings from Princeton University!

I’ve been  attending the Computational Social Science Summer Institute, (SICSS) hosted by the Russell Sage Foundation. This two week program has been pretty intense so far, but the amount of learning I’ve done in the last few days has been fantastic . When I leave, I’ll have a couple of method books/articles to read, at least 10 R packages to learn, and about 15 new papers to get started.

phd110714s

Looks like I’m gonna need to meditate more.

Anywho, I promised a few weeks ago that we would work our way through the 5 basic data science skills social welfare academics need. Today, we’re gonna talk about the most fundamental of those skills: coding.

Coding? Are you talking qualitative analysis to me?

When social welfare academics think about coding, they think about manual text analysis. Coding in that sense refers to tagging text to indicate themes of interest to the researcher.

Coding in data science means something very different. Remember that data science is the study of computer assisted methods to collect, store, and analyze large (usually unstructured) datasets. In order to get a computer to help us wrangle all of this data, we have to communicate with it directly.

59932657

In order to get the computer’s help, we need to use a computer programming language. BUT, we also need the capacity to retrieve our data from storage, analyze it, and visualize our results. The best way to meet both of these needs is with a computer language like R or Python. Why? Because both of these languages are object oriented: that is, they store data in fields that can be easily accessed by code. Plus, these objects can  interact with one another, if  we need them to.

Why do we care about object based programming? Let’s compare it to SAS (written in the C computer language) and SPSS (written in Java). Both of these software packages are pretty easy to learn, because everything you need to do you can access via a pull down menus. But, let’s say you want to write some of your own functions, or better yet, manipulate your data beyond your options in the pull down menu.  In SPSS, you can write your own functions, but guess what languages you need? That’s right – R and Python. It’s probably possible to manipulate your data in JAVA, but the jury is out on how easy JAVA is to learn (even compared to R and Python). In SAS, in order to manipulate your data beyond the pull down menus, it looks like you need to learn the SAS language. That’s pretty confusing – are we talking C or some other language SAS came up with?

Unknown

More importantly, SAS and SPSS cost money. Lots of money. A subscription to SPSS costs about $99 per month, per user. It’s unclear from their website how much a subscription to SAS costs, but I suspect its close to the SPSS cost. R and Python are open source, which means anyone, anywhere in the world can use them and modify them however they want…for free.

1rd7o5

So, let’s say you agree with my poorly developed argument above and are ready to start learning R and Python: which one do you learn first?  There are many debates about which language is better: no one’s cornered the market yet. Personally, I prefer R, but I’ve been working with it longer.

In general, if you are trying to decide which language to learn, you want to think about  the kind of data you want to analyze. Python tends to be better with text data, although R has gotten some new packages that have really improved its text capabilities. Both are great for data visualization and data retrieval,  but R is (arguably) better at statistical analysis. Both have fairly steep learning curves (about 6 months to have working knowledge; a solid year to be moderatly proficient), especially if you don’t have a computer programming background. If you’re interested in learning about each language more in depth, drop me a line and I’ll write a longer post for each.

Resources for learning more

Because R and Python are open source, there are plenty of free tutorials to get your hands dirty. Here are some of my favorites:

Try-R should be your first stop. This cool tutorial from Code School gives you hands on programming experience, helping you figure out the syntax and usage for R’s basic functions. Plus, you earn badges along the way: winning!

Quick-R is my go-to site for all things basic R. The book, R in Action, is even nicer, especially if you feel like you need a hard copy reference manual in your life.

SICSS introduced me to Data Camp, and I really can’t recommend it enough. This is more like coding college (including tuition), but if you want to learn R or Python from home, this is a fantastic resource. They even have ‘career tracks’ to help you build the basic skills you need to move from no coding experience to Data Scientist.

Finally, the SICSS organizers (Matt Salganik and Chris Bail)  are committed to open source data and making code available for all, for free. As a result, they have made all of the materials from the summer institute  available online.  A lot of the material is pretty advanced, in terms of coding, but I still think beginners can find a lot of useful information.

In particular, the SICSS website has a lot of important work thinking through the use of Big Data (which they call digital trace data) in social science. My next four posts will be all about this type of large, unstructured  data: how to store it, how to analyze it, and how to visualize your results.

See you in two weeks!