musings of a Lispnik

blog index | homepage | RSS

Tue, 04 May 2010

Keyword arguments

There’s been an ongoing debate about how to pass optional named arguments to Clojure functions. One way to do this is the defnk macro from clojure.contrib.def; I hesitate to call it canonical, since apparently not everyone uses it, but I’ve found it useful a number of times. Here’s a sample:

user> (use 'clojure.contrib.def)
nil
user> (defnk f [:b 43] (inc b))
#'user/f
user> (f)
44
user> (f :b 100)
101

This is an example of keyword arguments in action. Keyword arguments are a core feature of some languages, notably Common Lisp and Objective Caml. Clojure doesn’t have them, but it’s pretty easy to emulate their basic usage with macros, as defnk does.

But there’s more to Common Lisp’s keyword arguments than defnk provides. In CL, the default value of a keyword argument can be an expression referring to other arguments of the same function. For example:

CL-USER> (defun f (&key (a 1) (b a)) 
           (+ a b))
F 
CL-USER> (f)
2
CL-USER> (f :a 45)
90 
CL-USER> (f :b 101)
102

I wish defnk had this feature. Or is there some better way that I don’t know of?

permanent link | comments

Sun, 18 Apr 2010

Sunflower

The program I’ve been writing about recently has come to a point where I think it can be shown to the wide public. It’s called Sunflower and has its home on GitHub. It’s nowhere near being completed, and of alpha quality right now, but even at this stage it might be useful.

Just as sunflower seed kernels come wrapped in hulls, most HTML documents seen in the wild come wrapped in noise that is not really part of the document itself. Take any news site: a document from such a site contains things such as advertisements, header, footer, and many links. Now suppose you have many documents grabbed from the same site. Is it possible to somehow automate the extraction of the document “essences”?

Sunflower to the rescue. It relies on the assumption that documents coming from the same source have the same structure. It presents a list of strings to the user, and asks to pick those that are contained in the text essence. Then it finds the coordinates of the smallest HTML subtree that contains all those strings, and uses those coordinates to extract information from all documents. And it comes with a nice, easily understandable GUI for that.

This technique works remarkably well for many collections, although not all. An earlier, proof-of-concept implementation (in Common Lisp) has been used to extract many press texts for the National Corpus of Polish.

I’ve given up on the symbol-capturing approach to wizards I’ve presented in my previous posts. Inspired by the DOM tree in Web apps, with a bag of elements with identifiers, I now have a central bag of Swing widgets (implemented as an atom) identified by keywords. This bag contains tidbits of the mutable state of Sunflower. This means that I can write callback functions like this:

#(with-components [strings-model selected-dir]
   (.removeAllElements strings-model)
   (let [p (-> selected-dir htmls first parse)]
     (add-component :parsed p)
     (doseq [x (strings p)]
       (.addElement strings-model x))))

Name and conquer: having parts of state explicitly named mean that I can reliably access them from just about anywhere. This reduces confusion and allows for less tangled, more self-contained and understandable code.

permanent link | comments

Mon, 05 Apr 2010

A case for symbol capture

Clojure by default protects macro authors from incidentally capturing a local symbol. Stuart Halloway describes this in more detail, explaining why this is a Good Thing. However, sometimes this kind of symbol capture is called for. I’ve encountered one such case today while hacking a Swing application.

As I develop the app, I find new ways to express Swing concepts and interact with Swing objects in a more Clojuresque way, so a library of GUI macros and functions gets written. One of them is a wizard macro for easy creation of installer-like wizards, where there is a sequence of screens that can be navigated with Back and Next buttons at the bottom of the window.

The API (certainly not finished yet) currently looks like this:

(wizard & components)

where each Swing component corresponding to one wizard screen can be augmented by a supplementary map, which can contain, inter alia, a function to execute upon showing the screen in question.

Now, I want those functions to be able to access the Back and Next buttons in case they want to disable or enable them at need. I thus want the API user to be able to use two symbols, back-button and next-button, in the macro body, and have them bound to the corresponding buttons.

It is crucial that these bindings be lexical and not dynamic. If they were dynamic, they would be only effective during the definition of the wizard, but not when my closures are invoked later on. Thus, my implementation looks like this:

(defmacro wizard [& panels]
  `(let [~'back-button (button "< Back")
         ~'next-button (button "Next >")]
   (do-wizard ~'back-button ~'next-button ~(vec panels))))

where do-wizard is a private function implementing the actual wizard creation, and the ~'foo syntax forces symbol capture.

By the way, if all goes well, this blog post should be the first one syndicated to Planet Clojure. Hello, Planet Clojure readers!

permanent link | comments

Sun, 04 Apr 2010

Hiking in the Apennines

I’ve recently done a week-long hike in the Umbria-Marche region of the Italian Apennines (the vicinity of Monte Catria, near Cantiano, to be more precise), and here are some tips I’d like to share.

Rifugio Fonte del Faggio

permanent link | comments

Wed, 31 Mar 2010

The pitfalls of `lein swank`

A couple of weeks ago I finally got around to acquainting myself with Leiningen, one of the most popular build tools for Clojure. The thing that stopped me the most was that Leiningen uses Maven under the hood, which seemed a scary beast at first sight – but once I’ve overcome the initial fear, it turned out to be a quite simple and useful tool.

One feature in particular is very useful for Emacs users like me: lein swank. You define all dependencies in project.clj as usual, add a magical line to :dev-dependencies, then say

$ lein swank

and lo and behold, you can M-x slime-connect from your Emacs and have all the code at your disposal.

There is, however, an issue that you must be aware of when using lein swank: Leiningen uses a custom class loader – AntClassLoader to be more precise – to load the Java classes referenced by the code. Despite being a seemingly irrelevant thing – an implementation detail – this can bite you in a number of most surprising and obscure ways. Try evaluating the following code in a Leiningen REPL:

(str (.decode
       (java.nio.charset.Charset/forName "ISO-8859-2")
       (java.nio.ByteBuffer/wrap
         (into-array Byte/TYPE (map byte [-79 -26 -22])))))
==> "???"

The same code evaluated in a plain Clojure REPL will give you "ąćę", which is a string represented in ISO-8859-2 by the three bytes from the above snippet.

Whence the difference? Internally, each charset is represented as a unique instance of its specific class. These are loaded lazily as needed by the Charset/forName method. Presumably, the system class loader is used for that, and somewhere along the way a SecurityException gets thrown and caught.

Note also that there are parts of Java API which use the charset lookup under the hood and are thus vulnerable to the same problem, for example Reader constructors taking charset names. If you use clojure.contrib.duck-streams, then rebinding *default-encoding* will not work from a Leiningen REPL. Jars and überjars produced by Leiningen should be fine, though.

permanent link | comments

Tue, 16 Feb 2010

Downcasing strings

I just needed to convert a big (around 200 MB) text file, encoded in UTF-8 and containing Polish characters, all into lowercase. tr to the rescue, right? Well, not quite.

$ echo ŻŹŚÓŃŁĘĆĄ | tr A-ZĄĆĘŁŃÓŚŹŻ a-ząćęłńóśźż
żźśóńłęćą

Looks reasonable (apart from the fact that I need to specify an explicit character mapping — it would be handy to just have a lcase utility or suchlike); but here’s what happens on another random string:

$ echo abisyński | tr A-ZĄĆĘŁŃÓŚŹŻ a-ząćęłńóśźż
abisyŅski

I was just about to report this as a bug, when I spotted the following in the manual:

Currently tr fully supports only single-byte characters. Eventually it will support multibyte characters; when it does, the -C option will cause it to complement the set of characters, whereas -c will cause it to complement the set of values.

Turns out some of the basic tools don’t support multibyte encodings. dd conv=lcase, for instance, doesn’t even pretend to touch non-ASCII letters, and perl’s tr operator likewise fails miserably even when one specifies use utf8.

This is a sad, sad state of affairs. It’s 2010, UTF-8 has been around for seventeen years, and it’s still not supported by one of the core operating system components as other encodings are becoming more and more obsolete. I’m dreaming of the day my system uses it internally for everything.

Fortunately, not everything is broken. Gawk, for example, works:

$ echo koŃ i żÓłw | gawk '{ print tolower($0); }'
koń i żółw

and so does sed.

Update 2010-04-04: I should have been more specific. The above rant applies to the GNU tools (tr and dd) as found in most Linux distributions; other versions can be more featureful. As Alex Ott points out in an email comment, tr on OS X works as expected for characters outside of ASCII, and also supports character classes as in tr '[:upper:]' '[:lower:]'. This is yet another testimony to general high quality of Apple software; in this particular case, though, it may well be a direct effect of OS X’s BSD heritage. Does it work on *BSD?

permanent link | comments

Wed, 10 Feb 2010

Clojure SET

I’ve just taken a short breath off work to put some code on GitHub that I had written over one night some two months ago. It is an implementation of the Set game in Clojure, using Swing for GUI.

I do not have time to clean up or comment the code, so I’m leaving it as is for now; however, I hope that even in its current state it can be of interest, especially for Clojure learners.

Some random notes on the code:

Comments?

permanent link | comments

Mon, 18 Jan 2010

Reactivation (and some ramblings on my blogging infrastructure)

This blog has not seen content updates in more than a year. Plenty of things can happen in such a long period, and in fact many aspect of my life have seen major changes over this time. I’m not, however, going to write a lengthy post about all that right now. Instead, I just would like to announce the reactivation of the blog.

You might have noticed that many things have changed. First, the blog has a new address: http://blog.danieljanus.pl; the address of the RSS feed has also changed and is now http://blog.danieljanus.pl/index.rss — please update your readers!

Probably the most important change is that you now may post comments under the entries, even though this blog continues to be just a bunch of static HTML pages. This is possible thanks to the Disqus service. I wonder whether it will encourage people to give feedback: I have received very few email comments since I started blogging. Also, the static calendar at the top of each page is gone, replaced by a bunch of links to archive posts.

I have long been considering changing Blosxom to something else. The main reason for such a step is that it’s written in Perl, which makes it particularly hard to debug upon encountering an unexpected behaviour. The single most irritating thing was that Blosxom would unexpectedly change the date of a post that was edited (which did not let me fix typos and other glitches); I found a patch for this somewhere, but lost it.

On the other hand, I really liked — and still like — Blosxom’s minimalistic approach and the ease of adding posts. (The very idea of installing a monstrosity such as Wordpress, with its gazillion of features I don’t need, posts kept in a database and what not, makes me feel dizzy.) I fiddled for a while with the thought of reimplementing Blosxom in Common Lisp, but that turned out to be a more time-consuming project than it initially seemed. So when I found The Unofficial Blosxom User Group and learned that, contrary to my belief, Blosxom is still actively maintained and has a thriving community, I ended up staying with the original Perl version, refining my installation so that it no longer gets in the way (this FAQ entry did the trick). I also rewrote all my source text files to Markdown, which made them vastly more readable and easy to edit, updating links and adding short followup notes where appropriate, but otherwise leaving old entries as they were.

I’d like to thank Maciek Pasternacki for inspiring me to finally get around to this. While my plans are not as ambitious as his — I am not courageous enough to publicly prove my perseverance, so my blogging will likely continue to be irregular — I plan to write more (having accumulated many ideas for blog posts) and I hope the periods of silence will be much shorter than hitherto.

I would like to take this opportunity to wish my readers all the best in the New Year!

permanent link | comments