Downcasing strings

2010-02-16T00:00:00Z

I just needed to convert a big (around 200 MB) text file, encoded in UTF-8 and containing Polish characters, all into lowercase. tr to the rescue, right? Well, not quite.

$ echo ŻŹŚÓŃŁĘĆĄ | tr A-ZĄĆĘŁŃÓŚŹŻ a-ząćęłńóśźż
żźśóńłęćą

Looks reasonable (apart from the fact that I need to specify an explicit character mapping — it would be handy to just have a lcase utility or suchlike); but here’s what happens on another random string:

$ echo abisyński | tr A-ZĄĆĘŁŃÓŚŹŻ a-ząćęłńóśźż
abisyŅski

I was just about to report this as a bug, when I spotted the following in the manual:

Currently tr fully supports only single-byte characters. Eventually it will support multibyte characters; when it does, the -C option will cause it to complement the set of characters, whereas -c will cause it to complement the set of values.

Turns out some of the basic tools don’t support multibyte encodings. dd conv=lcase, for instance, doesn’t even pretend to touch non-ASCII letters, and perl’s tr operator likewise fails miserably even when one specifies use utf8.

This is a sad, sad state of affairs. It’s 2010, UTF-8 has been around for seventeen years, and it’s still not supported by one of the core operating system components as other encodings are becoming more and more obsolete. I’m dreaming of the day my system uses it internally for everything.

Fortunately, not everything is broken. Gawk, for example, works:

$ echo koŃ i żÓłw | gawk '{ print tolower($0); }'
koń i żółw

and so does sed.

Update 2010-04-04: I should have been more specific. The above rant applies to the GNU tools (tr and dd) as found in most Linux distributions; other versions can be more featureful. As Alex Ott points out in an email comment, tr on OS X works as expected for characters outside of ASCII, and also supports character classes as in tr '[:upper:]' '[:lower:]'. This is yet another testimony to general high quality of Apple software; in this particular case, though, it may well be a direct effect of OS X’s BSD heritage. Does it work on *BSD?

Today’s lesson: Mind the symlinks

2008-06-11T00:00:00Z

Probably every day I keep learning new things, without even realizing it most of the time. The vast majority of them are minor or even tiny tidbits of knowledge; but even these might be worth noting down from time to time, especially when they are tiny pitfalls I’d fallen into and spent a couple of minutes getting out. By sharing them, I might hopefully prevent someone else for slipping and falling in.

So here’s a simple Unix question: If you enter a subdirectory of the current directory and back to .., where will you end up? The most obvious answer is, of course, “in the original directory”, and is mostly correct. But is it always? Let’s see.

nathell@breeze:~$ pwd
/home/nathell
nathell@breeze:~$ cd foobar
nathell@breeze:~/foobar$ cd ..
nathell@breeze:~$ pwd
/home/nathell

So the hypothesis seems to be right. But let’s try doing this in Python, just for the heck of it:

nathell@breeze:~$ python
Python 2.5.2 (r252:60911, Apr 21 2008, 11:12:42)
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> print os.getcwd()
/home/nathell
>>> os.chdir("foobar")
>>> os.chdir("..")
>>> print os.getcwd()
/var

Whoa, hang on! What’s that /var doing there? Of course the one thing I didn’t tell you is that foobar is not really a directory, but rather a symlink pointing to one (/var/log in this case).

The corollary is that the shell builtin cd is not the same as Unix chdir() (it is easily checked that both Perl and C exhibit the same behaviour). In fact, the shell builtin has an oft-forgotten command-line switch, -P, which causes it to follow physical instead of logical path structure.

On a closing note: I have somewhat neglected the blog throughout the previous month, but I hope to revive it soon. It is not unlikely that such irregularities will recur.

Daniel Janus – Unix

Downcasing strings

Today’s lesson: Mind the symlinks