Daniel Janus’s blog
Lifehacking: How to get cheap home equipment using Clojure
12 April 2012
I’ve moved to London last September. Like many new Londoners, I have changed accommodation fairly quickly, being already after one removal and with another looming in a couple of months; my current flat was largely unfurnished when I moved in, so I had to buy some basic homeware. I didn’t want to invest much in it, since it’d be only for a few months. Luckily, it is not hard to do that cheaply: many people are moving out and getting rid of their stuff, so quite often you can search for the desired item on Gumtree and find there’s a cheap one a short bike ride away.
Except when there isn’t. In this case, it’s worthwhile to check again within a few days as new items are constantly being posted. Being lazy, I’ve decided to automate this. A few hours and a hundred lines of Clojure later, gumtree-scraper was born.
I’ve packaged it using
lein uberjar into a standalone jar, which, when run, produces a
gumtree.rss that is included in my Google Reader subscriptions. This way, whenever something I’m interested in appears, I get notified within an hour or so.
It’s driven by a Google spreadsheet. I’ve created a sheet that has three columns: item name, minimum price, maximum price; then I’ve made it available to anyone who knows the URL. This way I can edit it pretty much from everywhere without touching the script. Each time the script is run (by cron), it downloads that spreadsheet as a CSV that looks like this:
For each row the script queries Gumtree’s category “For Sale” within London given the price range, gets each result and transforms it to a RSS entry.
Gumtree has no API, so I’m using screenscraping to retrieve all the data. Because the structure of the pages is much simpler, I’m actually scraping the mobile version; a technical twist here is that the mobile version is only served to actual browsers so I’m supplying a custom User-Agent, pretending to be Safari. For actual scraping, the code uses Enlive; it works out nicely.
About half of the code is RSS generation — mostly XML emitting. I’d use
clojure.xml/emit but it’s known to produce malformed XML at times, so I include a variant that should work.
In case anyone wants to tries it out, be aware that the location and category are hardcoded in the search URL template; if you want, change the template line in
get-page. The controller spreadsheet URL is not, however, hardcoded; it’s built up using the
spreadsheet.key system property. Here’s the wrapper script I use that is actually run by cron:
if [ "`ps ax | grep java | grep gumtree`" ]; then
echo "already running, exiting"
cd "`dirname $0`"
java -Dspreadsheet.key=MY_SECRET_KEY -jar $HOME/gumtree/gumtree.jar
cp $HOME/gumtree/gumtree.rss $HOME/public_html
Now let me remove that entry for a blender — I’ve bought one yesterday for £4…