Coding for Data Science: A Get-Started Guide for Social Welfare academics

Greetings from Princeton University!

I’ve been  attending the Computational Social Science Summer Institute, (SICSS) hosted by the Russell Sage Foundation. This two week program has been pretty intense so far, but the amount of learning I’ve done in the last few days has been fantastic . When I leave, I’ll have a couple of method books/articles to read, at least 10 R packages to learn, and about 15 new papers to get started.


Looks like I’m gonna need to meditate more.

Anywho, I promised a few weeks ago that we would work our way through the 5 basic data science skills social welfare academics need. Today, we’re gonna talk about the most fundamental of those skills: coding.

Coding? Are you talking qualitative analysis to me?

When social welfare academics think about coding, they think about manual text analysis. Coding in that sense refers to tagging text to indicate themes of interest to the researcher.

Coding in data science means something very different. Remember that data science is the study of computer assisted methods to collect, store, and analyze large (usually unstructured) datasets. In order to get a computer to help us wrangle all of this data, we have to communicate with it directly.


In order to get the computer’s help, we need to use a computer programming language. BUT, we also need the capacity to retrieve our data from storage, analyze it, and visualize our results. The best way to meet both of these needs is with a computer language like R or Python. Why? Because both of these languages are object oriented: that is, they store data in fields that can be easily accessed by code. Plus, these objects can  interact with one another, if  we need them to.

Why do we care about object based programming? Let’s compare it to SAS (written in the C computer language) and SPSS (written in Java). Both of these software packages are pretty easy to learn, because everything you need to do you can access via a pull down menus. But, let’s say you want to write some of your own functions, or better yet, manipulate your data beyond your options in the pull down menu.  In SPSS, you can write your own functions, but guess what languages you need? That’s right – R and Python. It’s probably possible to manipulate your data in JAVA, but the jury is out on how easy JAVA is to learn (even compared to R and Python). In SAS, in order to manipulate your data beyond the pull down menus, it looks like you need to learn the SAS language. That’s pretty confusing – are we talking C or some other language SAS came up with?


More importantly, SAS and SPSS cost money. Lots of money. A subscription to SPSS costs about $99 per month, per user. It’s unclear from their website how much a subscription to SAS costs, but I suspect its close to the SPSS cost. R and Python are open source, which means anyone, anywhere in the world can use them and modify them however they want…for free.


So, let’s say you agree with my poorly developed argument above and are ready to start learning R and Python: which one do you learn first?  There are many debates about which language is better: no one’s cornered the market yet. Personally, I prefer R, but I’ve been working with it longer.

In general, if you are trying to decide which language to learn, you want to think about  the kind of data you want to analyze. Python tends to be better with text data, although R has gotten some new packages that have really improved its text capabilities. Both are great for data visualization and data retrieval,  but R is (arguably) better at statistical analysis. Both have fairly steep learning curves (about 6 months to have working knowledge; a solid year to be moderatly proficient), especially if you don’t have a computer programming background. If you’re interested in learning about each language more in depth, drop me a line and I’ll write a longer post for each.

Resources for learning more

Because R and Python are open source, there are plenty of free tutorials to get your hands dirty. Here are some of my favorites:

Try-R should be your first stop. This cool tutorial from Code School gives you hands on programming experience, helping you figure out the syntax and usage for R’s basic functions. Plus, you earn badges along the way: winning!

Quick-R is my go-to site for all things basic R. The book, R in Action, is even nicer, especially if you feel like you need a hard copy reference manual in your life.

SICSS introduced me to Data Camp, and I really can’t recommend it enough. This is more like coding college (including tuition), but if you want to learn R or Python from home, this is a fantastic resource. They even have ‘career tracks’ to help you build the basic skills you need to move from no coding experience to Data Scientist.

Finally, the SICSS organizers (Matt Salganik and Chris Bail)  are committed to open source data and making code available for all, for free. As a result, they have made all of the materials from the summer institute  available online.  A lot of the material is pretty advanced, in terms of coding, but I still think beginners can find a lot of useful information.

In particular, the SICSS website has a lot of important work thinking through the use of Big Data (which they call digital trace data) in social science. My next four posts will be all about this type of large, unstructured  data: how to store it, how to analyze it, and how to visualize your results.

See you in two weeks!

Data Science for Social Welfare Academics

Iii’mm baaack!!

After successfully making it through my first year on the academic tenure track, I realized two things: (1) organiznation is key, and (2) I need to get back to this blog!

I had  high hopes for what I could accomplish here at  last year and I got in my own way by not making it a part of my strategic plan. No more! We are going to kick this summer semester off right with a series on a subject near and dear to my nerdy little heart: data science!

What’s data science?

Data science is one of the new buzzwords being whispered in the darkest corners of  academic locker rooms across the world.

Academic locker rooms are a thing….trust me 😉

Together with Big Data, the phrase ‘data science’ conjures up images of unkempt, slouched geeks tapping away at a keyboard, with a screen looking like a scene from the Matrix.

images <- cool computer bro

The reality is much simpler. Data science is the study of computer assisted methods to collect, store, and analyze large data sets. By computer assisted, I mean that the data analyzed in data science problems requires a researcher to understand how computers process and analyze information, in order to get the most meaning from the data. This is because the data in data science problems is usually unstructured – in other words, it’s not produced with research in mind.

Where does unstructured data come from? The information age [think 1970 to present] has created hundreds of thousands of new devices: those devices produce data. Most of the data is about how humans live their lives: cell phones, activity trackers, social media, websites – things that are part of our everyday life now that weren’t even 15 years ago. Data science helps us use this data to answer social science questions.

For example, say you noticed a hashtag trending on Twitter that relates to your research interests. Ideally, you’d like to collect those tweets, store them securely, and analyze them to answer some research question. But how do you get the data? There are over 100k tweets using this hashtag – how to you store it all? Also, tweets have super complicated content – emojis, urls, videos, pics, not to mention a whole lot of abbreviated words because of the 140 character limit: what’s the best way to analyze all this stuff?

Answer: data science!

You had me at data! Let’s get analyzing!


Remember that data science is about collecting, storing and analyzing unstructured data.  Before you can use data science as part of your research, you need to learn 5 things:

 5 Basic Data Science Skills

  1. How to code in the Python and R computer languages
  2. How to store large amounts of data
  3. How to process a lot of data (ex. parallel processing, map reduce, batch processing)
  4. Methods to analyze large amounts of data (ex. machine learning)
  5. How to  visualize your results (imagine using a table to summarize 100k tweets?! Eek!)

Every two weeks, I’ll post a new article  discussing each of the five skills (don’t worry, I have an alert set!).

I’m off to nerd out at a computational social science institute. In the meantime, stay curious!


Housing the City in the Trump Era

Although I built the infrastructure of this blog a few weeks ago, I had a hard time figuring out what my first blog post should be. And then, Donald J Trump was elected President of the United States.  Hilary Clinton may have won the popular vote, but the electoral college is expected to go for Trump on December 20th. And my first post became crystal clear.

Like many in the academic world, I have been wondering what this election result means for federal social programs. Although Trump has promised not to touch things like Social Security,  Obamacare is on the chopping block, as well as Dodd-Frank.

Housing has not been explicitly discussed by Trump, but several news articles and interviews offer clues about how he might feel. Trump’s father, Fred C. Trump, was known for the  affordable rental housing he built around New York City, though he was far from altruistic in his reasons for doing so. Donald, from what I can tell, has never built an affordable housing unit in his life.

Donald is opposed to HUD’s Affirmatively Furthering Fair Housing Rule , which adds some teeth to the Fair Housing Act of 1968 by standardizing the  fair housing assessment and planning process. In fact, Donald is  anti-regulation of any kind, wants to repeal the CFPB, and wants only private capital in the mortgage market. And we all know how he feels about marginalized populations who might trouble accessing the mortgage market, or affordable rental housing.

Based on these, it doesn’t look good for affordable housing in the city for the next four years. But, the protests that have erupted in the days directly after the election suggest that progressives will not give up without a fight. ‘Don’t mourn; organize!’ and its variations have been a constant refrain in the days since this historic election. Now, more than ever, we need to be focused on solutions.

Over the next several weeks, I’ll be writing about strategies to safeguard the progress we have made on affordable housing and tactics for building momentum to secure the American dream of homeownership for those who still seek it.