R is for everything

R is a free open-source statistical programming software descendant from S that came out of Bell Labs. Rstudio is a commonly used user interface for R. Both can be downloaded for Mac, Windows, or Linux. R is widely used and established–it is highly unlikely that it will disappear anytime soon.

R is great for custom data visualizations and advanced statistical analysis. It also forces you to be structured and repeatable in your data analysis–the process of interacting with your data requires explicitly writing out the steps of interaction, unlike Excel or similar approaches. Once you have powered through the learning curve you can quickly summarize and visualize your data.

Lots (a majority?) of statisticians use R and share their most recent work through R packages that extend the functionality of “base R” (the initial installation). Packages that I commonly use include: RColorBrewer, plyr, ggplot2, lattice, stringr, reshape2, and there are many other useful packages out there. Some additional suggestions can be found here and googling will lead to many more results. R also offers a variety of open source datasets both as a part of a package or the purpose of the package, such as the census data. R also includes communities supporting particular aims, such as the rOpenGov project.

R does a good job of handling situations common to real data analysis such as missing values or cleaning strings. It can handle large data (and even Big Data) through a variety of packages such as pbdr. It can also be used with qualitative or social science data. It can be used to create maps. It can be used with LaTex (via, for example, Sweave) and websites (via, for example, shiny) so your analysis can be directly embedded in your output files. This can be very convenient and reduce errors as your data processes update or your datasets change based on new information.

R is somewhat difficult to learn, though there are extensive online resources the helps the process. Resources include:

The R-help mailing list. A great resource, but use with caution–google first! Someone has probably asked your question already (especially in the beginning).
A collection of R blogs. Great for keeping up with new work in the area and getting a scan of what’s out there.
Blogs for starting off with R, for example or resource lists.
Blogs for newer R users, for example, or this, or many others.
R FAQ. Useful, but not the most easily accessible document when you’re first starting.
The R Conference. An intense group, but a lot of fun and very informative.

R does some fun things too, like:

displaying your favorite xkcd cartoon
creating animations
telling your fortune
playing games (minesweeper, sliding puzzles…) with the fun package
talking to twitter

I would (and have!) definitely recommend R to a friend. I’d like to do something more physical than visual for my final data story, but I plan to use R for the initial data exploration and cleaning…and it’s possible I’ll get so sucked in to that work that I’ll end up staying the visualization space.

Timeline.js – creating interactive timelines

What can you do? What kind of stories is it good for?

With timeline.js, one can quickly make interactive timelines that contain various types of embedded media, such as images, maps, videos, and tweets. The timeline is automatically generated from a google spreadsheet, so one just needs to enter in the data in the right format.

This tool would be useful for stories where one needs to create a timeline quickly. The creators recommend choosing stories with a “strong chronological narrative,” as opposed to those that involve jumping around in the timeline.

The media ends up being the focus of each event in the timeline, so one should have a lot of media in mind for the timeline; otherwise, it will look bare/repetitive with only text. The creators of timeline.js also recommend that

I can also see a lot of uses for non data story contexts – you could make a timeline of a person’s life, a timeline of a breaking news event, a timeline of government policies, etc. There are many examples of real publications using this tool on the website.

spreadsheet_screenshot — Timeline.js template with examples

houston-example — Timeline of Whitney Houston’s life. Source: http://timeline.knightlab.com/examples/houston/

How do you get started?

The documentation on the timeline.js website makes it very easy to get started. I was able to set up their provided template timeline for editing in ~5 minutes.

To publish a new timeline:

1) Open a copy of the template and edit the data in the spreadsheet
2) Publish the spreadsheet to the web
3) Copy the URL of the spreadsheet into the online generator box
4) Imbed the iframe into your website. I was able to make a new timeline.html document, paste the generated code into the document, and open the file locally in chrome to see the timeline.

How easy/hard is it?

The tool is pretty straightforward. There are two main things to learn: the structure of the spreadsheet (i.e. where to paste items, and how the spreadsheet corresponds to the generated UI), and setting up the test html page to see your changes. The example timeline and the corresponding template make this pretty easy to figure out. One thing to note is that there cannot be any empty rows in the spreadsheet.

One nice thing is that once you’ve set up the html document with the iframe, any changes you make in the spreadsheet will be reflected if you refresh the page – no need to generate new iframes/copy paste every time.

No coding is necessary to use this tool (besides pasting the iframe into a webpage). Though the tool is open source, and one can download the source code to further customize timelines, the online interface already provides many options for customization, such as font choice, default zoom level, which slide to start at, etc, making it suitable for most use cases.

 Would you recommend this to a friend? Will you consider using it for your final data story?

I’d recommend this tool to a friend! It’s straightforward to set up, and you can embed many different types of media. I’d consider using it for my final data story if there was a need for a timeline.

Also, though this is nominally a tool to make timelines, it’s also a nice slideshow viewer. I can imagine downloading the source code and modifying the display to only show the slideshow parts, while hiding the actual timeline ticker

spring_break_example — A card I made: spring break

nltk: all the computational linguistics you could ever want, and then some

What can you do?

nltk is a Python module that contains probably every text processing module you’ve ever had a vague inkling of a need for. It contains corpuses of language for machine learning/training; word tokenizers (splits sentences into individual words or ngrams); part-of-speech taggers; parse parts of speech in sentences (with trees!); and much, much more. It’s good for analyzing lots of text for sentiment analysis, text classification, and tagging mentions of named entities (people, places, and companies).

How do you get started?

The creators of nltk have published a book for free online that explains how to use many of the features nltk has. It explains how to do things like access the corpora that nltk has; categorize words; classify text; and even build grammars. Basically, the best way to get started is install nltk, then go through the book and try the examples they present. They include code examples in the book so you can follow along and practice using different functions and corpuses. There’s also a wiki attached to the github and stackoverflow, where programmers go when they’re lost, is of course a useful (but often very specific) resource. The learning curve required to become comfortable leveraging the different functions available is fairly steep because they are so many and so specialized, and in my opinion the best way to gain that comfort level is to simply play around with nltk and build cool things to gain experience. Simply reading the book, while interesting, won’t be enough to become good at using nltk.

How easy or hard is it?

Well, it’s certainly easier than writing all of this from scratch, no matter how competent a programmer you are. The one thing that can be difficult with Python modules is that you’re not entirely sure what’s under the hood unless you get cozy with the source code. That means you might not be sure what’s causing a performance issue, why it doesn’t like your input, or why your output looks a certain way. Also, figuring out exactly which function to use for a specific task might be somewhat confusing as well unless you have a certain amount of experience in machine learning or know exactly what you want (it’s hard to go wrong with tokenization). For example, the built-in classifier is only as good as the features you feed it; giving it too many high-dimensionality items might result in overfitting or just horrendously slow code, and giving it low-dimensionality items might mean it can’t classify the items effectively. Experience with Python datatypes and object-oriented programming is also very, very important; if you don’t understand what a function is, what list comprehensions look like, and how Python dictionaries work, the example code given in the book will be incomprehensible. Even though the printouts from the example code look very nice and fancy and clean, the knowledge behind their creation (how do you print things that look nice? what is a development set? how do you use/leverage helper functions like tokenizer and the nltk function that gets the n most common words/letters? how do decision trees work?) is far from simple. Anyone with programming experience can use the simpler functions very effectively and the less simple functions with probable success, but in my opinion knowing how classifiers and parsers work is important to use them well. The bottom line is that they’re only as good as what you feed them, and understanding how definitive or accurate their output is requires a degree of understanding of what’s under the hood.

Would I recommend this to a friend?

If that friend had a similar programming background to me (can write Python code pretty well; knows a little bit about machine learning) I’d recommend it with little reservations other than a warning about the learning curve and the overwhelming abundance of options. I’d still suggest they at least skim the book and keep stackoverflow close at hand (although that’s true for most programming projects that venture into unknown territory). If my friend wasn’t comfortable with machine learning, I’d suggest they read up on Wikipedia about whatever classifiers they use so they have an idea of why the classifier misbehaves, if it does, or what errors it’s likely to make. And if they weren’t comfortable with programming, I’d suggest they look into other natural language processing tools. This is a tool that’s made by programmers and scientists, and it shows in the documentation, the resources, and the wealth of options available to those who know how to use them.

tl;dr: nltk has a ton of really cool natural language processing tools. However, they are by no means idiot-proof, and you will be sad if you don’t know Python. One does not simply download nltk and spit out useful results in five minutes.

RAW: Create Simple Visualizations Quickly

What is it?

RAW is an online drag-and-drop tool for uploading csv data and creating common visualizations such as scatterplots, treemaps, and circle packing diagrams.

RAW is open source and provides guides for adding your own visualization types (using D3.js).

What is it good for?

RAW has 16 visualization types which are built using drag-and-drop and can be customized to a minor degree. If you need to generate several common visualizations to support your data story, RAW can make them very quickly.

Be warned that RAW runs in a web browser and cannot handle large datasets (i.e. more than a few MB). Furthermore, since many of the visualizations display all the data points, a visualization produced from a large dataset will be cluttered and unreadable.

Thus, RAW is good for stories that require several simple visualizations built on a dataset consisting of small to medium sized csv files.

How do you get started?

Since RAW is simple to learn, you can jump right in and start using it. For a quick intro, consult the video tutorial. For further information, consult the Github wiki.

If you are a developer trying to add a new chart type to RAW, consult the developer guide.

Is it easy? What skills do you need?

RAW guides you step-by-step through building the visualization. Therefore, it’s easy to learn. Beyond understanding what each visualization means, RAW requires no additional skillset, which makes it very easy to use.

The primary challenge in using RAW is understanding each type of visualization. For example, if you don’t know what a Voronoi Tessellation is, then RAW gives you no guidance on how to interpret the visualization.

For developers, extending RAW requires a knowledge of the JavaScript language and the D3.js library. Familiarity with Scalable Vector Graphics (SVG) and Angular.js may also be useful.

Would I recommend it?

I would highly recommend RAW as a tool for building visualizations to support a data story or for finding possible stories. Visualizations can be built quickly with RAW, so it’s useful for exploring your dataset by building visualizations. Furthermore, since the visualizations can be exported as SVG, HTML, PNG, and JSON, it’s easy to embed them into an article or similar data story.

If you are working with a large dataset (ex. several MB or more), RAW may not be able to handle all your data. Furthermore, the visualizations may be too cluttered.

If you want precise control over your visualization, RAW may be too restrictive for you. Although it’s possible to add features to the code, it may be quicker to build the visualization using a different tool.

Would I use it?

I think I will use RAW to help me generate ideas as I peruse my datasets. Since I am interested in maps, games, and interactive data stories, I don’t think I will use RAW to create my final product.

Usage

Here’s how RAW can make a circle packing diagram using a dataset about the 2014 Global Hunger Index around the world.

International Food Policy Research Institute (IFPRI); Welthungerhilfe (WHH); Concern Worldwide, 2014, “2014 Global Hunger Index Data”, doi:10.7910/DVN/27557 International Food Policy Research Institute [Distributor] V1 [Version]

Analyzing Text Data

After a great introduction to text analysis and PMF from Allen Downey from Olin College (author of Think Stats), students had a chance to play with quantitative text analysis. Grabbing lyrics from a website, and analyzing it with our WordCounter tool, the students looked at the words and phrases used most often by various artists.

Here’s some notes on running this text analysis activity.

Here are some pictures of what they sketched out in the 20 minutes I gave them:

which artists talk about “you”, “I” and “me”

which parts of the body Nicki Minaj and Eminem talk about

the repeating chorus refrains of the Indigo Girls

how much different artists talk about love

Our Questions about Food Security

We’ll be forming teams for final projects. Each project will explore some topics related to food security. To help start forming teams, we brainstormed topics and questions you are interested in exploring. Read this list and decide which topic is most interesting to you:

Local Food / Sustainability

role of small farming in current global food economy
farmers market prices vs. other supermarkets (how can we make local products more accessible, esp to low-income households)
local food & community-building
effectiveness / scalability of CSA/local food (also freegans)
how can we address local food suppliers? what problems are they facing?
how far do people around the world have to travel to get food?
comparative look at success of community garden types (public, private, school, government-run, non-profit, etc)

Environment / Climate Change

how has climate change affected food security?
climate change & crop production in the midwest USA
relation between climate change & food security & what we can do to affect either / both
impact of projected climate change on food suppliers
amount of food aid needed for a community fluctuations as seasons change
environmenal / toxilogical food security impact from long term contamination of the food staples in populations

Nutrition

nutrition distribution (ex % fat, % protein) around the world
how and what are we gonna eat in 20 years? food vs. future
what is the total monetary and nutritional value at the cut-off value for food insecurity, and how do these values vary by state or by country
what are the best practices for feeding children in a safe and healthy way?
how can we improve nutrition in local public schools?

Economics / Indicators

compare / analyze food insecurity level and other health stats (obesity, heart disease, mental disorder-depression, addiction, etc)
food security as an indicator of other measurements / inequalitites of wellbeing (ie. economics, education)
relationship between economic condition and what is considered a average / balanced meal (beliefs about food / nutrition)
price fluctuations and impact on poor
relationship between economic condition and nutrition availability – what can you eat on a certain budget (by country)
how does food insecurity impact education / academic performance among children in the US?
relationship between school food programs and school performance by economic condition
quantifying social & economic outcomes of food security?

Outreach / Education

food security education around Boston area – in school (for kids), for parents
how can we better education people about food security?

Policy

investigating gov polies that exacerbate (or help address) food insecurity, what are the drivers / forces at work? who benefits from such policies?
comparisons of different strategies of tackling food security issue
how governments been keeping their promises wrt food security goals?
how to identify policy gaps and other factors outside of food supply levels that correlate with food insecurity?
do efforts directed at food security end up being counter-productive? (the source of the problem lies elsewhere?)

Waste

how can you confirm that food is properly utilized (ex. eaten)?
how can we improve food usage and limit waste?
how can we use data to characterize and understand the causes and scope of food waste around the world?
how do the locations with the highest food waste compare to those with the highest food insecurity? Is there a correlation between food waste and food security?

Food System

genetically engineered crops (starting w/green revolution) and food security
effect of GMOs on food supply / safety
pesticides / GMO and food security
animal agriculture / factory farming
how does availability of natural resources (water, land, etc) affect food security of a community?
how is agriculture being impacted (urbanization, desertification) and how does it affect food security?

Other Stuff

relating yelp (micro) and dining data to macro scale / views such as food security indicators published by UN
how to solve food deserts
food stamps and how that affects # of visitors to soup kitchens, etc per week (I think they’re given out @ the beginning of the month)
prison food security
how food insecurity changes the decisions you make every day
how do you ensure accuracy of data? (regarding nutrition, diet, etc)

Demographics of Boston Districts and Neighborhoods

Author: Tami Forrester

I chose to look at one dataset that showed the race distribution by city council districts in 2010, based on information from the census. Unfortunately, it was presented in pdf form, which limited interactivity, though I found it interesting that of a total population of 6.5 million, white people accounted for around 81 percent of them, and led the population totals in all but three districts – Districts 4, 5 and 7. After looking at the dataset, I thought of the following questions

Neighborhoods vs Districts? Could these be mapped out?

Looking through this table and other datasets left me confused as to how city council districts compared or related to neighborhoods. According to a link on the City of Boston website, the districts are mapped out as so:

I was also able to find another map showing crowdsourced neighborhood boundaries based on a survey.

I tried to overlay the two images to see if it would make for an easy comparison, though it it mostly confusing to look at.

overlaymaps — Dark black lines refer to city council district boundaries, and shadings refer to the crowdsourced neighborhoods

I also searched through datasets on the City of Boston site, but most only contained data about neighborhoods, and didn’t show relationships between them and districts. I was able to find a document on the City of Boston site, which compared the racial distribution over both districts and neighborhoods. However, trying to convert this data into an interactive form proved very tedious because it was locked in a pdf. Even after using an online tool to convert pdfs to excel spreadsheets, the formatting made it difficult to work with in Tableau.

How have these demographics changed over time?

Another google search led me to yet another pdf of data showing how racial demographics have changed for specific years 1990, 1993 and 2002. I wasn’t sure why the specific years were chosen, and I didn’t try to analyze this in tableau, but was able to look over and see trends. I found it interesting that the amount of people identifying as various races in 1993 and 2002 was exactly the same both years, though the distribution over all districts in each time were different. For example, 140,305 people identified as black in both 1993 and 2002. The amount of black people per district was not the same between both years, however.

What are some characteristics of the different districts?

Two characteristics I looked at specifically listed the public schools in Boston, and the crime incidents as reported by Boston police in 2012. Unfortunately, the schools were not mapped to their zones, but the crime incidents also included the zip-code and region area they were reported in. Using Tableau, I mapped the number of incidents that were reported in a particular region, and created a pie chart with neighborhoods mapped to the percentage or incidents reported.

Screen Shot 2015-03-09 at 11.58.10 PM — Mapping of zip-codes colored by the number of crime incidents from Tableau. Regions that were more “green” had the most reports.

Pie chart as generated in Tableau. I couldn't figure out how to place the unlabeled sections (which did actually have regions) — Pie chart as generated in Tableau. I couldn’t figure out how to place the labels for the currently unlabeled sections (which did actually have neighborhoods)

The crime incident reports also had a field called “reptdistrict”, which was presumably another metric used to characterize a particular region, though it was unclear what it meant.

Snow and Icy Sidewalks of Cambridge

Authors: Desi Gonzalez, Stephen Suen

One interesting finding from looking at the data:

We choose to look at two open datasets from the city of Cambridge: the first documented unshoveled and icy sidewalk complaints since January 1, 2008, and the second recorded snow and ice sidewalk ordinance violations since December 1, 2007. Looking at the datasets, we noticed that snowfall complaints seem to be grouped around a day or a span of a few days. This made sense, considering that these entries likely correspond to major snowfalls. However, we noticed a few entries that are unusually out of the season—one in September here, one in May there—which might be due to human error when entering data.

Are schools more likely to be closed when there are more unshoveled/icy sidewalks?

Public school closures – We found this data by using Twitter search (which was recently updated to include all historical tweets) on the Cambridge Public Schools account for “Cambridge Public Schools will be closed,” the boilerplate language the CPSD uses to announce school closings. However, these results only go as far back as the Twitter account and do not cover the entire range of the sidewalk data set.

(2015) Jan 27-28; Feb 2-3, 9-10
(2014) Jan 3, 22; Feb 5
(2013) Feb 8, 11
(2012) Oct 29 – Hurricane Sandy (not relevant)

University closures – Once again, we used Twitter search on @MIT, but this time there was no standard template so we just searched for “closed” and manually went through the tweets to include/exclude dates as appropriate. This process could be repeated for every university; another option would be to use the Twitter API to automate this given a list of university Twitter handles.

(2015) Jan 27-28; Feb 9-10
(2014) Jan 2
(2013) Feb 8
(2012) Oct 29 – Hurricane Sandy (not relevant)

How does the frequency of unshoveled/icy sidewalks relate to weather data (temperature/precipitation)?

Weather Underground has tables of temperature, precipitation, and events (e.g. “snow”) going back to 1920. The maximum query is about 13 months from the specified start date, so 7 different queries would be required to get all the data since 12/1/2007. The tables can be downloaded as CSVs and combined into a single table. At this link, we tracked down a query from 12/1/2007 to 1/1/2009.

Are the major roadways that are deemed “snow emergency routes” more or less likely than smaller streets to have snow or icy sidewalk complaints or violations?The City of Cambridge has identified several major arteries on which, during a snow emergency, cars are not allowed to park. A quick Google search led to cambridgema.gov’s map of snow emergency parking restrictions. We also found a PDF of that lists the streets from the intersection where the restriction starts until the intersection where it ends as well as whether the sides affected are the odd-numbered buildings, the evening-numbered buildings, or both sides of the streets. Neither data is easy to access or plug into visualization tools like Tableau, so we would have to do some creative copy-and-paste work or research which building numbers are included within these parameters.

Boston’s Urban Orchards

The dataset we looked at was a record of fruit-bearing trees available for urban foraging (with the caveat that you should ask for permission before foraging). The dataset included the GPS coordinates of the tree and the address near where it was found, as well as the organization responsible for the tree in cases where such an organization existed; the species of the tree; and its condition.

The data questions we came up with were primarily about the characterization of neighborhoods containing more fruit trees. One interesting thing we noticed was that many of the trees were near schools (the location label included a school name); maybe this was a consequence of many schools having gardens. We found the following school locations and school gardens datasets (from data.cityofboston.gov) that would help answer this question- we could color or highlight the locations of school trees, or use overlaid heat maps of school density and tree density in order to show these relationships. We also wondered whether there was a correlation between fruit tree density and income, specifically whether higher income neighborhoods were more likely to have more fruit trees, and found the following economic characteristics of Boston dataset. However, we discovered an even easier way to get economic and population data within the Tableau Public app.

We mapped the Urban Orchards data using Tableau Public, coloring the trees by fruit and overlaying maps of per capita income and also the density of housing u nits. We found, surprisingly, that per capita income appeared to be negatively correlated with the presence of fruit trees; this could be a result of selection bias, or the that schools and other public community buildings in Boston are not in high income residential neighborhoods, or other reasons we have not thought of. As expected, we see few fruit trees in very densely populated residential areas, and we see that the areas with lower income and fewer trees appear to have lower housing density as well, suggesting neighborhoods that may have been designed to be low-cost public housing.

Data Hunt: Food Pantries

Team: Mary Delaney, Edwin Zhang

We began by selecting a dataset on food banks and food pantries in the city of Boston. This data set included the names, addresses, and hours for food pantries throughout the city. In total, it had eighty-three unique food pantries and food banks.

One interesting fact that we noticed in looking at the data is that many of the food pantries were centralized to a few zip codes. Over one-quarter of all the listed food pantries were located in either the 02118 or 02139 zip codes, corresponding to Boston and Cambridge, respectively.

When looking at the data, we sought to answers three questions.

How are food pantries distributed geographically throughout the Boston area?
How do food pantry locations compare with the locations where food is grown?
How does food pantry density compare with the income of an area?

To answer the first question, we only looked at the Food Pantries dataset. We found that the food pantries were distributed among twenty-seven zip codes. However, further examination showed that twenty-three of the eighty-three food pantries are localized to two zip codes, and fourteen zip codes had only one food pantry. On average, there were about three food pantries per zip code.

Answering the second question required finding an additional dataset that contained information about where food is grown. We found this data in the Urban Orchards dataset on the Boston City Data Portal. Urban orchards aren’t intended for large-scale food production, but rather indicate a community emphasis on growing fruit trees for learning or preservation.

We then reduced the data to the number of food banks and the number of urban orchards in each zip code. Using zip code for location revealed that that urban orchards were also largely localized to a few zip codes, much like food pantries were. However, urban orchards and food pantries were centralized in different locations. In addition, five zip codes that contained food pantries did not have any urban orchards.

By graphing the data, we can also see a vague relationship with the number of urban orchards and the number of urban orchards by area. Generally, areas with more food pantries have urban orchards.

This seems to also indicate that food pantries also exist where a sense of community is more prevalent – as the upkeep of both urban orchards and food pantries take the willpower of a community.

We looked at getting income information by zip code from city-data.com, which provides information like median household income and population around Boston and Cambridge. While the page exists as a map, the information is provided also in text form and can be scraped and then compared to both the data on urban orchards and food pantries.

Sources:

Food Pantries (https://data.cityofboston.gov/Health/Food-Pantries/vjvb-2kg6)

Urban Orchards (https://data.cityofboston.gov/Health/Urban-Orchards/c7cz-29ak)

Boston Income (http://www.city-data.com/zipmaps/Boston-Massachusetts.html)