How Indiana’s Legislative Site Foiled Attempts to Scrape It

Primary Target Audience:

Enterprise

Primary Channel:

ProgrammableWeb

Primary category:

Open Data

Secondary category:

Government

Headline on Actual Article :

Opening up Indiana’s hard to reach legislative data

URL To Article:

http://sunlightfoundation.com/blog/2015/04/14/opening-up-indianas-hard-to-reach-legislative-data/

Name of Host Site:

Sunlight Foundation

URL To Home Page of Host Site:

http://sunlightfoundation.com/

Author's Name:

Rachel Shorey

Summary:

While attempting to provide legislative information to the public, the Sunlight Foundation encountered some technical obstacles while scraping bill data from Indiana’s website.

With Indiana enacting the Religious Freedom Restoration Act earlier this year, the Open States team wanted to access the data to provide the text to the public. However, in a subsequent post on the Sunlight Foundation’s blog, software developer Rachel Shorey discussed some trouble the team encountered while trying to scrape the data.

Bill text is usually provided by the state as a PDF, and Open States provides a link to that specific bill on the state’s website. The organization gets the majority of its information from Indiana through its legislative API, MyIGA, which requires an API key even for PDFs. With no way for Open States users to download the PDF this way without a key, the nonprofit resorted to scraping an ungated download link from the bill’s page.

However, it seemed like the link URL was generated on the fly and required a document-specific hash value that the team needed to find. Some custom code was able to locate this ID, but it left the scraper needing to visit a slow site multiple times for a single bill, often with multiple timeouts, leaving the scraper crashed or hung at most attempts.

This method for generating link URLs on the go could be viewed as a preventive measure against website scraping, despite the nature of open data, and may have applications elsewhere. Reverse engineering the document ID returned nothing about how the IDs were constructed, so the team returned to the API.

The state legislative service failed to return Open States’ calls about the matter, but terms of service allow Open States to use data gained via the API key to create an app. So the team used available data to construct a simple proxy URL that retrieves the document for download, circumventing the hash-generated URLs.

Content type group:

Articles

How Indiana’s Legislative Site Foiled Attempts to Scrape It

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112