An enterprising fellow from Milwaukee built a site to automatically pay parking tickets in his city and in Madison. Unfortunately, the city of Madison changed around their ticket-payment website, so the site is not working. I thought it would be a good programing challenge for me.
How Do You Work with Web Services When They Have No API?###
Since the City of Madison doesn’t have an API on their website, it’s not a simple matter of sending a query and getting back a nicely-formatted JSON response to use in my app. So how can I get the information that I need from the website?
Fortunately, all websites have at least one available point of entry - via the web, as a user! There’s a Python package to emulate user behavior in a browser and return the HTTP response, which we can further manipulate as we want.
This is Mechanize, originally developed for Perl. I install the Python package and make a script to go to the City of Madison’s parking ticket payment site at https://www.cityofmadison.com/epayment/parkingTicket/. I based this on an example at http://stockrt.github.io/p/emulating-a-browser-in-python-with-mechanize/.
This returns us the html of the city’s parking ticket page as a big string of HTML.
Searching for Tickets##
In order to search for a parking ticket, it looks like I need to first accept the terms and continue.
Since Mechanize is ‘stateful’ - it acts like a browser tab that a user would have open - I can submit this form and continue working with my browser class.
Before we do that, here are some other things you can do with Mechanize:
Back to the main objective. We need to first:
Find the form on the page with the checkbox
Select the checkbox
Submit the form to continue
In order to do this, I needed to find the name of the form and the checkbox control. I used Chrome Developer tools to inspect it:
So we follow that action in our simulated browser and are brought to the Search page:
Great news! This is where we can enter our first dynamic input - a license plate to test. I do a similar inspection and find the names of the form and controls I need to manipulate:
Here’s what the HTML of the search returns for a plate with existing tickets:
Manipulating the HTML Search Results to Work With Them in Python
We need to take the HTML output and parse the page. The problem is that the output is not always regular. We need something like regular expressions that will let us recognize certain, sometimes repeated parts of the page.
The LXML library is a good starting place to process XML and HTML in Python.
We’ll also take advantage of XPATH, which is like regex for XML:
Let’s inspect the HTML of the search results to find what is regular about these ticket results.
I’ve skipped over the header of the page and gone right to the meat of where the tickets are:
We see that each discrete piece of data about each ticket is contained in a table cell <td>, and that each of these is the child of a <tr> that stands for each ticket. To get each ticket, we’ll figure out what is common to those <tr>s.
Each is the direct child of a form. Since there’s only one form on the page, we can start there. Note that XPATH requires some strict syntax and won’t necessarily follow the design of your element nesting here.
Here’s the XPATH syntax that finally worked to return each ticket and nothing else:
Here’s how we read this, from right to left:
//text()
Two slashes: Select ALL of
text() - Select the text inside this element (as opposed to the class, or the href, or another attribute)
//td - Select ALL <td> elements (even if they’re not the direct child)
/tr[position()>2] - This is the complicated one. Select <tr>s that are the direct children of the parent (one slash). Only select <tr>s that are the 3rd or greater <tr> child of their parent. This is to accommodate for these two <tr>s in our page that come before the tickets:
//form/table - Select tables that are the direct children of ALL forms on the page.
Whew.
This give us (semi) nice lists like:
There’s a lot of garbage in there, like carriage returns, tabs, and spaces. Additionally, we’re only interested in some of the list elements:
Apart from the neatly printed output, we also now have a dictionary of tickets. We’ve achieved the goal of a fake ‘API’ for searching for parking tickets.