Quantcast
Channel: ProgrammableWeb - Government
Viewing all articles
Browse latest Browse all 467

How Indiana’s Legislative Site Foiled Attempts to Scrape It

$
0
0
Primary Target Audience: 
Primary Channel: 
Primary category: 
Secondary category: 
Related Companies: 
Related APIs: 
Sunlight Foundation Open States
Headline on Actual Article : 
Opening up Indiana’s hard to reach legislative data
URL To Article: 
http://sunlightfoundation.com/blog/2015/04/14/opening-up-indianas-hard-to-reach-legislative-data/
Name of Host Site: 
Sunlight Foundation
URL To Home Page of Host Site: 
http://sunlightfoundation.com/
Summary: 
While attempting to provide legislative information to the public, the Sunlight Foundation encountered some technical obstacles while scraping bill data from Indiana’s website.

With Indiana enacting the Religious Freedom Restoration Act earlier this year, the Open States team wanted to access the data to provide the text to the public. However, in a subsequent post on the Sunlight Foundation’s blog, software developer Rachel Shorey discussed some trouble the team encountered while trying to scrape the data.

Bill text is usually provided by the state as a PDF, and Open States provides a link to that specific bill on the state’s website. The organization gets the majority of its information from Indiana through its legislative API, MyIGA, which requires an API key even for PDFs. With no way for Open States users to download the PDF this way without a key, the nonprofit resorted to scraping an ungated download link from the bill’s page.

However, it seemed like the link URL was generated on the fly and required a document-specific hash value that the team needed to find. Some custom code was able to locate this ID, but it left the scraper needing to visit a slow site multiple times for a single bill, often with multiple timeouts, leaving the scraper crashed or hung at most attempts.

This method for generating link URLs on the go could be viewed as a preventive measure against website scraping, despite the nature of open data, and may have applications elsewhere. Reverse engineering the document ID returned nothing about how the IDs were constructed, so the team returned to the API.

The state legislative service failed to return Open States’ calls about the matter, but terms of service allow Open States to use data gained via the API key to create an app. So the team used available data to construct a simple proxy URL that retrieves the document for download, circumventing the hash-generated URLs.

Content type group: 
Articles

Viewing all articles
Browse latest Browse all 467

Trending Articles