Tue, 16 Feb 2010
Downcasing strings
I just needed to convert a big (around 200 MB) text file, encoded in
UTF-8 and containing Polish characters, all into lowercase. tr to
the rescue, right? Well, not quite.
$ echo ŻŹŚÓŃŁĘĆĄ | tr A-ZĄĆĘŁŃÓŚŹŻ a-ząćęłńóśźż
żźśóńłęćą
Looks reasonable (apart from the fact that I need to specify an
explicit character mapping — it would be handy to just have a
lcase utility or suchlike); but here’s what happens on another
random string:
$ echo abisyński | tr A-ZĄĆĘŁŃÓŚŹŻ a-ząćęłńóśźż
abisyŅski
I was just about to report this as a bug, when I spotted the following in the manual:
Currently
trfully supports only single-byte characters. Eventually it will support multibyte characters; when it does, the-Coption will cause it to complement the set of characters, whereas-cwill cause it to complement the set of values.
Turns out some of the basic tools don’t support multibyte encodings.
dd conv=lcase, for instance, doesn’t even pretend to touch non-ASCII
letters, and perl’s tr operator likewise fails miserably even when
one specifies use utf8.
This is a sad, sad state of affairs. It’s 2010, UTF-8 has been around for seventeen years, and it’s still not supported by one of the core operating system components as other encodings are becoming more and more obsolete. I’m dreaming of the day my system uses it internally for everything.
Fortunately, not everything is broken. Gawk, for example, works:
$ echo koŃ i żÓłw | gawk '{ print tolower($0); }'
koń i żółw
and so does sed.
Update 2010-04-04: I should have been more specific. The above rant
applies to the GNU tools (tr and dd) as found in most Linux
distributions; other versions can be more featureful. As Alex Ott
points out in an email comment, tr on OS X works as expected for
characters outside of ASCII, and also supports character classes as in
tr '[:upper:]' '[:lower:]'. This is yet another testimony to
general high quality of Apple software; in this particular case,
though, it may well be a direct effect of OS X’s BSD heritage. Does
it work on *BSD?
Wed, 10 Feb 2010
Clojure SET
I’ve just taken a short breath off work to put some code on GitHub that I had written over one night some two months ago. It is an implementation of the Set game in Clojure, using Swing for GUI.
I do not have time to clean up or comment the code, so I’m leaving it as is for now; however, I hope that even in its current state it can be of interest, especially for Clojure learners.
Some random notes on the code:
- Clojure is concise! The whole thing is just under 250 lines of code, complete with game logic and the GUI. Of these, the logic is about 50 LOC. Despite this it reads clearly and has been a pleasure to write, thanks to Clojure’s supports for sets as a data structure (in vein of the game’s title and theme).
- There are no graphics included. All the drawing is done in the GUI part of code (I’ve replaced the canonical squiggle shape by a triangle and stripes by gradients, for the sake of easier drawing).
- I’ve toyed around with different Swing layout managers for this
game. Back in the days when I wrote in plain Java, I used to use
TableLayout, but it has a non-free license; JGoodies Forms is
also nice, but has a slightly more complicated API (and it’s an
additional dependency, after all). In the end I’ve settled with
the standard GridBagLayout, which is similar in spirit to those
two, but requires more boilerplate to set up. As it turned out,
simple macrology makes it quite pleasurable to use; see
add-gridbagin the code for details. - Other things of interest might be my function to randomly shuffle seqs, which strikes a nice balance between simplicity/conciseness of implementation and randomness; and a useful debugging macro.
Comments?