Our next big goal, The Fifty State Project
- Written by
- Joshua Ruihley
- Date
- 02/26/2009 6:32 p.m.
Those of you who are familiar with Open Congress know that its power lies not in making legislative information available, but instead in how it makes legislation accessible by allowing people to interact with and repurpose what Congress produces. Unfortunately, hurdles remain in creating a better democracy at the local level and shedding light on state legislation. At Sunlight Labs, we've been thinking about this problem for a while and now is the time for a fix.
When Open Congress was launched and the source code released, it didn't take long for Jim Caralis to build Open Mass, a site that follows the legislative events of Massachusetts. Just like Open Congress, Open Mass makes data accessible that was once merely available by keeping track of the latest news, hot issues, and popular bills and legislators in the Massachusetts legislature. The release of Open Mass created the hope that one by one, open sites for each state legislature would be built, but sadly, the cascading effect that we all hoped for didn't happen. The problem? Data for state legislatures is available, but not yet accessible.
Although many of the tools are already available that would make an Open Congress for all fifty states possible, the one (and most important) missing link is that an openly available structured database of state legislation does not yet exist. Just as Open Congress relied on structured legislative data made available by GovTrack.us, usable data must exist for state level legislation before we can start knocking out sites for every state.
Out of curiosity, I visited the website of the Kentucky Legislature to see how simple it would be to scrape each bill from the site and store it in a database. It turned out to be surprisingly simple. Using Python and the Beautiful Soup library, I quickly set up a script to scrape and store bills from sessions of the Kentucky Legislature dating back to 2001. The entire process took about five hours and the vast majority of that was spent staring at the screen watching the scraper do its thing. Granted, not all states will be as simple as Kentucky, but it remains a relatively simple task to squeeze state legislation into a structured format.
While no single developer has the time to volunteer writing a custom scraper for each state, the goal of having data for all fifty states is entirely attainable if we come together and share the workload. This is where you come in. We need your help databasing state legislation. To coordinate, we've set up project pages on the Sunlight Labs Wiki and github to share scraping utilities, data, and ideas. We also will be promoting the Fifty State Project at a series of "hackathons" that we're hosting at various events around the country. The hope is that soon, we'll have a standardized database and APIs to work from, putting the goal of "an Open Congress for all fifty states" within reach.
Give it a shot. Pick a state and start scraping! It doesn't matter what language you're proficient in, the point is to get the data to a state where we can work together to make sense of it. If your experience is anything like mine was, it will only take a few hours of your time and you'll leave with the satisfaction of knowing that you freed important government information, allowing it to be used in new and meaningful ways.
Discussion
What are Your Thoughts?
Have thoughts that might fuel this discussion further, post them below. (Markdown syntax is supported in comments.)
It would be great to get a central spot where people working in various states could find each other, even if they don't want to collaborate once they do.
Wonderful project, guys. I think you'll find the legislative branch sites to be pretty good. Far better than the disclosure and ethics sites.
One "key" to think about as you're scraping is committee names through which the bills passed. Our L-CAT tool uses Project Vote Smart's APIs to group our campaign-finance data. Getting uniformity in committee names as much as possible will all us to build legislation into our existing tool, as well as vice versa.
Or, working on unique committee IDs may solve this as a next step.
Great work.
Is there going to be some kind of universal data format? That way just one web interface could access all the data that will be scraped easily without having to build a custom read for each scraped data set?
What's the target format for the data? I.e., after it's scraped, how do you want it delivered?
I am working on a Library of Congress NDIIPP grant project that is taking a similar approach, but this is much more ambitious. Several years ago there was an unsuccessful attempt to create a standard XML schema for state bill drafting systems to facilitate data sharing. The Minnesota project is exploring a legislative metadata standard and a native XML database pilot to bring together legislative data from some of our partner states with an API to expose the data for reuse. The aim is to promote data standards and help states address not only preservation concerns, but improve accessibility of legislative data. I believe our project will ultimately further this effort (and with any luck improve the quality of the data also). Researchers in legislatures across the county would love to be able to see how other states have handled similar issues. I see this as a way to further transparency, improve the quality of government, and help insure preservation because valuable information is maintained.
Great project. I put a note about California on the wiki, it will be straightforward to get the raw data from 1993 onward without any page parsing.
Kyle - Yes, the goal is to have a universal format to represent all state legislation. Right now we're at the point where "we don't know what we don't" know, so the thinking is that we can grab the data for each state, put it in a relatively flat format, then take a look at what we have. Only when we have all of the data--or at least the structure of all fifty states--will we know what the final format will be.
We were busy with Transparency Camp this weekend but will be updating the wiki with our vision of what the initial flat files should look like.
Great stuff, Nancy. Thanks. Is there a URL for more information on your project?
Amanda-We set up a page on the Sunlight Labs Wiki to share with everyone what you're up to. Sunlight Labs also has a Google Group to communicate with other developers in the labs community. Thanks!
Josh - I know that it's completely buried, but the information is on the url I have linked to my profile. The metadata draft is in the XML schema working group, 10/29 meeting summary. A team member also did a survey of XML bill drafting systems that is posted. The meeting summary for the XML database planning will be online soon, along with an overview of government data mashups intended to help raise awareness of projects like 50 states.
I forgot to ask - is anyone thinking about creating triple stores for the data acquired (like govtrack)? I saw the entity extraction initiative listed on the wiki.
"universal format to represent all state legislation"!!!????
As much as I love Sunlight's advocacy for Transparency in Government- transparency is not going to be achieved with a single computer application. To start, every state has a different organizational structure. Plus, even IF in theory you find common themes that you could construct an application to datascrape, so much is behind closed committee doors...and the majority of the states are not legally required to record the majority of the legislative process.
Sunlight needs to start funding the grunt work that only hardworking researchers can provide to the public.
However, that being said I do want to offer constructive advice to aid any and all efforts at Gov Transparency. Try taking a hard look at Project Vote Smart's State Key Votes Program as they were able to identify the variety of ways states organize their leg. process and the many challenges you'll run into.
I wish you the best of luck!
Great job, excellent article. Thank you.
FYI - http://www.mnhs.org/preserve/records/legislativerecords/
thank you for this article. Ive looked at the end.