musings of a Lispnik

blog index | homepage | RSS

Sun, 18 Apr 2010

Sunflower

The program I’ve been writing about recently has come to a point where I think it can be shown to the wide public. It’s called Sunflower and has its home on GitHub. It’s nowhere near being completed, and of alpha quality right now, but even at this stage it might be useful.

Just as sunflower seed kernels come wrapped in hulls, most HTML documents seen in the wild come wrapped in noise that is not really part of the document itself. Take any news site: a document from such a site contains things such as advertisements, header, footer, and many links. Now suppose you have many documents grabbed from the same site. Is it possible to somehow automate the extraction of the document “essences”?

Sunflower to the rescue. It relies on the assumption that documents coming from the same source have the same structure. It presents a list of strings to the user, and asks to pick those that are contained in the text essence. Then it finds the coordinates of the smallest HTML subtree that contains all those strings, and uses those coordinates to extract information from all documents. And it comes with a nice, easily understandable GUI for that.

This technique works remarkably well for many collections, although not all. An earlier, proof-of-concept implementation (in Common Lisp) has been used to extract many press texts for the National Corpus of Polish.

I’ve given up on the symbol-capturing approach to wizards I’ve presented in my previous posts. Inspired by the DOM tree in Web apps, with a bag of elements with identifiers, I now have a central bag of Swing widgets (implemented as an atom) identified by keywords. This bag contains tidbits of the mutable state of Sunflower. This means that I can write callback functions like this:

#(with-components [strings-model selected-dir]
   (.removeAllElements strings-model)
   (let [p (-> selected-dir htmls first parse)]
     (add-component :parsed p)
     (doseq [x (strings p)]
       (.addElement strings-model x))))

Name and conquer: having parts of state explicitly named mean that I can reliably access them from just about anywhere. This reduces confusion and allows for less tangled, more self-contained and understandable code.

permanent link | comments

Mon, 05 Apr 2010

A case for symbol capture

Clojure by default protects macro authors from incidentally capturing a local symbol. Stuart Halloway describes this in more detail, explaining why this is a Good Thing. However, sometimes this kind of symbol capture is called for. I’ve encountered one such case today while hacking a Swing application.

As I develop the app, I find new ways to express Swing concepts and interact with Swing objects in a more Clojuresque way, so a library of GUI macros and functions gets written. One of them is a wizard macro for easy creation of installer-like wizards, where there is a sequence of screens that can be navigated with Back and Next buttons at the bottom of the window.

The API (certainly not finished yet) currently looks like this:

(wizard & components)

where each Swing component corresponding to one wizard screen can be augmented by a supplementary map, which can contain, inter alia, a function to execute upon showing the screen in question.

Now, I want those functions to be able to access the Back and Next buttons in case they want to disable or enable them at need. I thus want the API user to be able to use two symbols, back-button and next-button, in the macro body, and have them bound to the corresponding buttons.

It is crucial that these bindings be lexical and not dynamic. If they were dynamic, they would be only effective during the definition of the wizard, but not when my closures are invoked later on. Thus, my implementation looks like this:

(defmacro wizard [& panels]
  `(let [~'back-button (button "< Back")
         ~'next-button (button "Next >")]
   (do-wizard ~'back-button ~'next-button ~(vec panels))))

where do-wizard is a private function implementing the actual wizard creation, and the ~'foo syntax forces symbol capture.

By the way, if all goes well, this blog post should be the first one syndicated to Planet Clojure. Hello, Planet Clojure readers!

permanent link | comments

Sun, 04 Apr 2010

Hiking in the Apennines

I’ve recently done a week-long hike in the Umbria-Marche region of the Italian Apennines (the vicinity of Monte Catria, near Cantiano, to be more precise), and here are some tips I’d like to share.

Rifugio Fonte del Faggio

permanent link | comments