- 1 Knight-Mozilla-MIT "Insider/Outsider" Hack Day
- 1.1 Logistics
- 1.2 HackDash: Project Teams and Ideas
- 1.3 Data "White Whales"
- 1.4 Tools & APIs
- 1.5 Communications
- 1.6 Places to Eat and Drink
Knight-Mozilla-MIT "Insider/Outsider" Hack Day
The Knight-Mozilla OpenNews project is sponsoring at 24-hour hack day as a lead-in to the 2013 MIT-Knight Civic Media Conference. While the conference is invite-only, the hack day is open to talented developers who want to spend their weekend working with others to build amazing things.
Following the conference theme of "Insider/Outsider," this hack day will be focused on civic data and opening up challenging data sets.
If you're tweeting about this hackday, please use the #datahack hashtag.
Full writeup of the hack day is also on Source.
- Where: MIT Media Lab 5th Floor 75 Amherst Street Cambridge, MA 02139 USA
- When: 3pm sharp Saturday June 22 to 4pm Sunday June 23
- Will there be food? Yes, there will be food. We will be providing dinner on Saturday night, breakfast and lunch on Sunday, and snacks throughout.
- What to bring You will need to bring your own laptop and power supply. Also, bring any challenging civic data sets you've been wanting to wrangle.
- We'll supply the WiFi, the plugs, and collaboration and brainstorming materials like post-its, sharpies, etc.
Heading into Sunday, here are some of the requests people had for assistance with their projects.
- "Judgmental" Court Decision Scraper - could use help creating an easy search/index. Would like to do search with elasta search or AWS, have experience with sphinx/solr, but want to do something easier.
- CivOmega - anybody with better ideas about taking sentences and making them into queries rather than just using regular expressions, e.g. natural language processing
- Open Gov Data Guide - if anyone has a particular data set they'd want to share, please add it. Any good examples of environmental data, talk to Saul.
- NY Drug Price Data - expert at data analysis, would like to know most optimal way to pick intervals for coloring and do data normalization.
- OpenOpenNewsNews - help from person who is interested in data visualization.
- 3pm Opening Circle
- 7pm Dinner
- 10pm Building closes
- 9am Building opens
- 9am Breakfast
- 11pm Brunch
- 2:45pm Show and Tell (with pizza)
- 3:45pm Closing Circle
Want to get in touch with other hack day participants?
- Join the #opennews channel on irc.mozilla.org
- Tweet questions/ideas with #datahack hashtag. Tag @opennews with any questions about the event
- Email OpenNews if there are any questions/concerns/ideas that should be emailed to the group on off hours.
HackDash: Project Teams and Ideas
How to use HackDash
- Go to our HackDash page. There's an example project listed to show the basic project format.
- To create or join a project, log in with Twitter (if you don't have a Twitter account, email Erika for assistance).
- To create a project:
- Click create a project.
- If your project exists on GitHub, you can import some of the fields from your GitHub repo using the GitHub importer.
- Give your project a title and description. Both of these items will be shown on the project card on the HackDash page. You can also include a photo associated with your project.
- If there's a link associated with your project, you can include that as well as any topical tags.
- The final drop list is for "state" of your project, which you can update as the project progresses from brainstorming to wireframing and so on.
- Click create project!
- Once a project exists, anyone can join, like, or follow the project. The Twitter avatars for team members are displayed at the bottom of the project card.
- Each project card also includes a Disqus comment thread where team members can communicate or other people can offer feedback on the idea.
It should be that simple. We'll have an easy way to collaborate and see all of the projects from the hack day. Please go ahead and start adding project ideas. If a project looks interesting to you, join the team.
Let Erika know if you have any questions or run into issues with the setup. This tool is in active development by Dan Zajdband, so we can get help with any questions and any feedback is much appreciated.
Data "White Whales"
As a way of giving attendees some things to chew on right out from jump, we invited a number of civic data experts to give us lists of their data "white whales"--high value datasets that are currently difficult to access.
From Derek Willis, New York Times:
- U.S. House of Representatives Foreign Travel Reports
- U.S. State Department Public Schedule
- White House Office of Management and Budget Meeting Records
From Waldo Jacquith, Virginia Decoded:
Waldo notes: It's a huge obstacle that I simply haven't put any time into dealing with. Every few months I spend half an hour on trying to put together a system to systematically scrape data out, get discouraged, and give up. Footnotes, blockquotes, and page numbers just kill me, although even if I could get the raw text decently, rendered terribly, I could still extract great metadata from them.
From John Keefe & Stephen Menendez, WNYC:
- 2013 New York City Council budget document (warning large PDF download)
- NYPD Motor Vehicle Accident Data
From Phil Ashlock, Civic Agency:
- City officials contact info (per state)
From Daniel X O'Neil, Smart Chicago Collaborative/Everyblock: Daniel, check out: http://stopfrisknyc.github.io/.
The data is amazingly detailed (here's a great primer), and lends itself to great visualizations (here's one re: 2009 data). The data itself is published in a highly inaccessible to regular people (notwithstanding the fact that is extremely well-structured as an SPSS portable file. Publishing this info as an easy-to-search, RSS-ready list of items would be high value.
This is a gem of a lookup tool that cries out for scraping and simple display. The disparity in drug prices is often profound, even in the space of a few blocks. This fits into a general new trend/ huge opportunity to call out disparate health care costs, given the Medicare Provider Charge Data provided by the U.S. government recently.
Given the reality of a new mayor in LA, it would be good to look for some data in this enormous city with very little available civic data. It has always bothered me that the LAPD has an exclusive relationship with the LA Times on crime data: http://www.lapdonline.org/crime_prevention/content_basic_view/42390. Muffing that up might be fun.
This is an enormous, underutilized cache of crime data. Chicago gets lots of attention and plaudits for their crime data, but the Dallas stuff goes even farther back (2000!) and contains narrative that will make your eyes bleed. They have the actual comments typed into the system by actual police officers, including graphic details about horrible crimes and a huge amount of profanity. This is a researcher's treasure chest.
Tools & APIs
Have tools or APIs that would be helpful for this hack weekend? List them here.
Manuel Aristarán and Mike Tigas will release Tabula 1.0 on the Hack Day. We will also work on the tool itself. If you want to get your hands dirty with Ruby code and fun table extraction heuristics, feel free to join!
Tabula is split in two components:
There's a lot of room for improvement in both components. If you feel like helping out, reach out and we'll find you something to work on :)
City of Boston Data Portal
You can browse the Data Portal using a D3 Interactive.
Get in touch with Nick Doiron during the hack day for help with Boston data (especially maps).
Data for the Boston metro area and dozens of other cities are available for download at http://metro.teczno.com/
What would you do with 100,000+ documents and extracted text from 170 city and town municipalities in Vermont? We collect city and town documents from select board meeting minutes, planning and zoning committees and other local government legislation. These are often published as PDFs and difficult to scrape HTML. We classify, extract entities [People, Companies, Locations], terms and make them searchable. This is a corpus of partially structured raw text from hundreds of cities and towns.
Open Data Tech Review by ODI
From Marcio Vasconcelos a Wiki compiled by Open Data Institute with several tools.
The DemocracyMap API aims to provide normalized structured data for all the contact details and other primary information for every government body and government official that represents you (limited to the US for now). Simply enter an address or lat/long and get back the full stack of government bodies and elected officials. Much of this API relies on third parties so it essentially aggregates, normalizes, and caches a variety data sources including geospatial boundary queries and scrapers on ScraperWiki. It does not yet sync all data in a central datastore, so performance is not nearly as efficient as it could be because. Much of the aggregation happens on the fly.
The current coverage includes primary contact information for every city, county, and state in the United States as well as contact information for all state and national legislators, all governors, all county officials, and over 100,000 municipal officials.
Thanks to Waldo, the folks at Semantria have offered participants access to their Text Analytics and Sentiment Analysis APIs. Here's what they have to say:
We do have great documentation on the support section of our website. Here's a link to all of our support pages: http://support.semantria.com
There are so many articles I feel like it can be a bit overwhelming. Here is one that is strictly on sentiment analysis: http://support.semantria.com/customer/portal/articles/834168-about-semantria-s-sentiment-analysis
Here is a link to our video page: https://semantria.com/excel/tutorial
Once again, there are quite a few videos, so here are the two I recommend people to check out:
Building Categories for Survey Analysis (it's nice because it's a use case): https://www.youtube.com/watch?v=_pYsJdOqKE4&feature=player_embedded
Sentiment Analysis: https://www.youtube.com/watch?v=Ypdf4QbokXo&feature=player_embedded
Here is a link that participants can register with: https://semantria.com/user/login_register
All of our accounts come with 10k documents for free. Please let them know that if they need more than 10k, they simply have to contact someone from Semantria and we can load their account with as many calls as they need. Please let me know if there's anything else I can do to help out from my end before the event starts.
Google Civic Information API
The Google Civic Information API allows developers to build applications that display civic information including polling place, early vote location, candidate data, and election official information to users. The initial version of the API is geared towards election-related information for the United States. We will have data for the upcoming New York City Mayoral election.
Archive.org TV News Closed Caption Data
You can use this script to query Archive.orgs TVNews archive (http://archive.org/details/tv) and return a JSON dump. Background and example: http://www.niemanlab.org/2013/03/tracking-memes-across-television-news-a-tool-for-analyzing-how-stories-move-through-broadcast/
LazyTruth Misinformation Database
Get in touch with @mstem if you want to talk about creative uses for a credibility API, consisting of rumors and myths and their associated debunks.
If you're tweeting about this hackday, please use the #datahack hashtag
If you're blogging about the event, please link to it here.
Places to Eat and Drink
We'll be feeding you dinner on Saturday and breakfast and lunch on Sunday. Additionally, we'll be providing snacks and plenty of caffeine throughout. That said, if you're looking to find food some other time while you're in town:
- Za Homemade pickles, excellent wine list. Rest of menu is pizza and salad and THAT'S IT. Across the street from the Marriott.
- Mary Chung's Classic MIT hangout. Chinese food. Eat the Suan La Chow Show. In Central Square. Lip-dragging distance from Le Meridien; an eight minute walk from the Marriott.
Do You Want To Eat A Vegetable?
Lisa Williams' Guide to Things To Do, See and Eat in Boston From a native.