While many worry about the privacy of data, it is incredible the volume of public data. Much of this data comes from government agencies. Health, crime, transportation and other data is there for the taking by anyone with an internet connection. For developers this is a good source of data both for testing and for applications. For marketers, people interested in public policy or the simply curious, you may be able to do some interesting data analysis using such data.
One major player in Open Data is the company Socrata. They make cloud based open data platforms for government bodies of all sizes. Much of their work is at the municipal level. I’ll be using two sets of this local data for Seattle Washington and Chicago Illinois in this and in upcoming lessons. For this lesson we’ll discuss the basic of exporting and using data from the portal. In upcoming lessons, we’ll work with more complex data and how to use different data formats.
Biking Through the Center of the Universe
The Fremont neighborhood of Seattle Has proclaimed itself the Center of the Known Universe. This quirky neighborhood has its own troll, a rocket, and has a statue of Vladimir Lenin. Adobe and Google both have offices here. Fremont also has one of the few bridges between the northwest neighborhoods of Seattle and downtown. In October 2012, Seattle installed bike sensors on the sidewalks of the bridge to count the number of bicycles that cross the bridge. That data is a rather simple data set we can look at to see patterns on bicycle traffic over this bridge.
Looking at Data on a Portal
Open a new web browser page and go to the Seattle data portal at data.seattle.gov. You’ll get a home page something like this:
While there are lots of good stuff on the home page, for each city it’s different. There will be a search box there on all pages. In the upper right, you’ll see a search box. Type Fremont in the box
Press the search button. The search results appear. Scroll down until you find Fremont Bridge Hourly Bicycle Counts by Month October 2012 to present
Each entry in the search box has several attributes to be aware of
The Category breaks down the data into major categories, such as Heath, Transportation, and Public safety. The data information section tells you how updated this data is. Most importantly is the type of data. Scrolling through the list you may have seen maps, filtered views, charts. and datasets. If you want data already set up for quick analysis, use maps, filtered views and charts. For the raw data, you need to select an entry with a Dataset type.
Click open the Fremont Bridge Hourly Bicycle Counts by Month October 2012 to present You’ll get the table of data like this:
When we talk about data we talk about Rows and fields. A field, sometimes called a column is one piece of information we are tracking. In this data set we have three fields: Date, Fremont Bridge West Sidewalk, and Fremont Bridge East Sidewalk. Each set of fields is a row, grouping the data together. Row one on our table has the October 3rd 2012 data for 12:00AM, when 4 bikes were on the West sidewalk, and 9 on the east side walk.
You can use the scroll bar on the right of the rows to scroll down the data set. Scrolling tot he bottom tells us how many rows we have.
At the time of this post, there were 31,344 rows. Some data sets will tell you in their search description how many rows there are, some do not. Scrolling to the bottom of the data set will always give you the size of the data set.
That’s a lot of data. You may not be interested in all of it. On the upper right side of the page you’ll find a control bar.
One of the most important buttons here is the Filter button, which reduces your data set. Click the filter button, and a new window appears:
Click the Add a New Filter Condition button. When a condition is true we keep the row, when a condition is false we ignore the row.
Each filter has three or four parts. The first is a field name, which defaults to the first field in the row. We’ll find all the rows for April 2016, so the Date field is the one we want. Next is the logical expression field, which has several choices we can pick from by clicking it:
Since want a range of April 1 2016 to April 30 2016, We’ll pick is between. For most rows, you’ll pick the one criterion that you are looking for. For example If I only wanted April 1 at midnight, I’d pick Date is 04/01/2016 12:00AM. However in between is a range, so it needs two fields. I chose this data
A convenient calendar shows up, but sets the time for the day to 12:00AM. That would exclude April 30, so I changed the time here to 11:00 PM to get all of April 30.
You’ll notice the check mark next to the filter. When the check is there, the filter is active. With the check there, our data changes to show the new filter.
Another button on the control bar is the export button
Press the export button, and you get a window like below. If the Download window isn’t open. Click the arrow to open it.
Most of this might not make sense, so let me quickly go through these choices. CSV, JSON, and XML are three ways to represent rows and fields in a text file. The CSV separates the fields into rows by a new line character, and fields usually by commas. There are variations in this, and one of the most common is the CSV for Excel.
XML uses a series of tags to represent rows and columns. There is a lot of ways of presenting data in XML. Newer Excel files are XML files and that is why these files have a X on the end. So XLS is an excel spreadsheet in the old format, XLSX is a spreadsheet in the XML format.
Finally we have PDF, or Portable Document format. If you want a printable format that is difficult to edit and have no need for number crunching, this is your choice. It needs a special reader. However many browsers and e-book application have a PDF reader built-in.
We’ll use the standard CSV for our data set. Click the CSV button. You will likely get a question of what to do with the file. You can open CSV files directly into Excel, even the ones not directly formatted for Excel. CSV files will work with any spread sheet and are importable into almost any database. I tend to download them then start an application. Save the file.
With notepad or some similar application, open the file. You’ll find this:
Let’s look a little closer at this data. In a CSV file, the first line of data is the header, which has titles for the fields. Excel is smart enough to take these three titles and make them the titles for our rows. For rows 2 on we have our data in the same order as the header.
If you are a developer, you can use this file in your app for testing. For everyone else, you are probably going to want to look at it in Excel. Find the file on your disk, and double-click it. Most likely it will open automatically into your preferred spreadsheet such as Excel or Numbers.
You’ll notice Excel has hash marks ######## where some of the dates are. Excel didn’t adjust for the width of the column, where numbers did. Click and drag the separator between Column A and B to adjust the width. Do the same between B and C and C and D so you can see both your titles and data easily
I’m going to use Excel for the rest of this post. This data may make more sense as a bar chart. Click on the A1 cell.
Now scroll down to the end of the data. Shift-Click on the last cell in column C to select all the data:
In the charts ribbon, select a Clustered Column bar chart. This will compare data between bikes on the west side of the bridge and bikes on the right side.
When you click the button, we get a small chart:
By dragging the blue bars we can make a bigger chart.
Excel did a few things here automatically. It formatted the chart for you. It also does something we may not have wanted. It grouped the data for the chart into days instead of hours. Excel’s smallest date unit for an axis is days. If you want hours, you’ll need to change the axis type in the format axis dialog. Right click the labels on the x axis. You’ll get a menu.
Click Format Axis… Click the Scale option and you’ll get a dialog box like this one, depending on your version.
We have an automatic axis type, which is a date, so we see hours. A hack to get around this is to change the axis type to Text.
I made one other change when I did this. We’ll have hundreds of labels that will just mush into one another. To clean things up I changed the Interval between labels to 24 so I only have one label per day. Press Okay, and you get a very different chart:
Both our charts give us some patterns seen in a graph not seen in the raw data. The Fremont bridge has its highest traffic on weekdays, suggesting that it is people commuting to work. The other bit of data is the higher frequency for the east side of the bridge than the west side. If bikes followed traffic, they should be equal, but there is a significant difference. Even between the morning commute and the afternoon commute, there’s a difference. One could explore why, by looking at data by the day like this:
You could go to other sources of data. Looking at a map for geographic or topographic information might be helpful. We might do some more number crunching to find a numerical ratio between east and west traffic. For the month of April 2016 that ratio is 46905:46734 or 1.003. We do have pretty close to equal traffic by that number, so we can keep digging from there to explain the peaks we see.
Exploring data can be fun. Figuring out mysteries like the inequality of traffic on the bridge can be entertainment, or tell you which side of the road is best for a bike shop and what hours to have more help than others. There’s more we can do with the Socrata platform. In our next lesson we’ll go into another way to quickly get data from the platform.