I’ve moved to London last September. Like many new Londoners, I have changed accommodation fairly quickly, being already after one removal and with another looming in a couple of months; my current flat was largely unfurnished when I moved in, so I had to buy some basic homeware. I didn’t want to invest much in it, since it’d be only for a few months. Luckily, it is not hard to do that cheaply: many people are moving out and getting rid of their stuff, so quite often you can search for the desired item on Gumtree and find there’s a cheap one a short bike ride away.
Except when there isn’t. In this case, it’s worthwhile to check again within a few days as new items are constantly being posted. Being lazy, I’ve decided to automate this. A few hours and a hundred lines of Clojure later, gumtree-scraper was born.
I’ve packaged it using
lein uberjar into a standalone jar, which,
when run, produces a
gumtree.rss that is included in my Google
Reader subscriptions. This way, whenever something I’m interested in
appears, I get notified within an hour or so.
It’s driven by a Google spreadsheet. I’ve created a sheet that has three columns: item name, minimum price, maximum price; then I’ve made it available to anyone who knows the URL. This way I can edit it pretty much from everywhere without touching the script. Each time the script is run (by cron), it downloads that spreadsheet as a CSV that looks like this:
hand blender,,5 bike rack,,15
For each row the script queries Gumtree’s category “For Sale” within London given the price range, gets each result and transforms it to a RSS entry.
Gumtree has no API, so I’m using screenscraping to retrieve all the data. Because the structure of the pages is much simpler, I’m actually scraping the mobile version; a technical twist here is that the mobile version is only served to actual browsers so I’m supplying a custom User-Agent, pretending to be Safari. For actual scraping, the code uses Enlive; it works out nicely.
About half of the code is RSS generation — mostly XML emitting. I’d
clojure.xml/emit but it’s known to produce malformed XML at
times, so I include a variant that should work.
In case anyone wants to tries it out, be aware that the location and
category are hardcoded in the search URL template; if you want, change
the template line in
get-page. The controller spreadsheet URL is
not, however, hardcoded; it’s built up using the
system property. Here’s the wrapper script I use that is actually run
1 2 3 4 5 6 7 8
Now let me remove that entry for a blender — I’ve bought one yesterday for £4…