musings of a Lispnik

blog index | homepage | RSS

Tue, 16 Feb 2010

Downcasing strings

I just needed to convert a big (around 200 MB) text file, encoded in UTF-8 and containing Polish characters, all into lowercase. tr to the rescue, right? Well, not quite.

$ echo ŻŹŚÓŃŁĘĆĄ | tr A-ZĄĆĘŁŃÓŚŹŻ a-ząćęłńóśźż
żźśóńłęćą

Looks reasonable (apart from the fact that I need to specify an explicit character mapping — it would be handy to just have a lcase utility or suchlike); but here’s what happens on another random string:

$ echo abisyński | tr A-ZĄĆĘŁŃÓŚŹŻ a-ząćęłńóśźż
abisyŅski

I was just about to report this as a bug, when I spotted the following in the manual:

Currently tr fully supports only single-byte characters. Eventually it will support multibyte characters; when it does, the -C option will cause it to complement the set of characters, whereas -c will cause it to complement the set of values.

Turns out some of the basic tools don’t support multibyte encodings. dd conv=lcase, for instance, doesn’t even pretend to touch non-ASCII letters, and perl’s tr operator likewise fails miserably even when one specifies use utf8.

This is a sad, sad state of affairs. It’s 2010, UTF-8 has been around for seventeen years, and it’s still not supported by one of the core operating system components as other encodings are becoming more and more obsolete. I’m dreaming of the day my system uses it internally for everything.

Fortunately, not everything is broken. Gawk, for example, works:

$ echo koŃ i żÓłw | gawk '{ print tolower($0); }'
koń i żółw

and so does sed.

permanent link | comments

Wed, 10 Feb 2010

Clojure SET

I’ve just taken a short breath off work to put some code on GitHub that I had written over one night some two months ago. It is an implementation of the Set game in Clojure, using Swing for GUI.

I do not have time to clean up or comment the code, so I’m leaving it as is for now; however, I hope that even in its current state it can be of interest, especially for Clojure learners.

Some random notes on the code:

Comments?

permanent link | comments

Mon, 18 Jan 2010

Reactivation (and some ramblings on my blogging infrastructure)

This blog has not seen content updates in more than a year. Plenty of things can happen in such a long period, and in fact many aspect of my life have seen major changes over this time. I’m not, however, going to write a lengthy post about all that right now. Instead, I just would like to announce the reactivation of the blog.

You might have noticed that many things have changed. First, the blog has a new address: http://blog.danieljanus.pl; the address of the RSS feed has also changed and is now http://blog.danieljanus.pl/index.rss — please update your readers!

Probably the most important change is that you now may post comments under the entries, even though this blog continues to be just a bunch of static HTML pages. This is possible thanks to the Disqus service. I wonder whether it will encourage people to give feedback: I have received very few email comments since I started blogging. Also, the static calendar at the top of each page is gone, replaced by a bunch of links to archive posts.

I have long been considering changing Blosxom to something else. The main reason for such a step is that it’s written in Perl, which makes it particularly hard to debug upon encountering an unexpected behaviour. The single most irritating thing was that Blosxom would unexpectedly change the date of a post that was edited (which did not let me fix typos and other glitches); I found a patch for this somewhere, but lost it.

On the other hand, I really liked — and still like — Blosxom’s minimalistic approach and the ease of adding posts. (The very idea of installing a monstrosity such as Wordpress, with its gazillion of features I don’t need, posts kept in a database and what not, makes me feel dizzy.) I fiddled for a while with the thought of reimplementing Blosxom in Common Lisp, but that turned out to be a more time-consuming project than it initially seemed. So when I found The Unofficial Blosxom User Group and learned that, contrary to my belief, Blosxom is still actively maintained and has a thriving community, I ended up staying with the original Perl version, refining my installation so that it no longer gets in the way (this FAQ entry did the trick). I also rewrote all my source text files to Markdown, which made them vastly more readable and easy to edit, updating links and adding short followup notes where appropriate, but otherwise leaving old entries as they were.

I’d like to thank Maciek Pasternacki for inspiring me to finally get around to this. While my plans are not as ambitious as his — I am not courageous enough to publicly prove my perseverance, so my blogging will likely continue to be irregular — I plan to write more (having accumulated many ideas for blog posts) and I hope the periods of silence will be much shorter than hitherto.

I would like to take this opportunity to wish my readers all the best in the New Year!

permanent link | comments