Daniel Janus – Clojure

Translating non-trivial codebases with Claude

2026-03-26T00:00:00Z

I was wrong (or was I?)

I don’t think it’s [me writing about LLMs] likely to happen anytime soon: I prefer to write about things that I’m excited about.

I was wrong. And right at the same time. Here comes another post where LLMs play a prominent role.

I've asked Claude Code (Opus 4.6) to “rewrite this non-trivial codebase from C++ to Java”. And it worked quite splendidly. Then, on another non-trivial codebase (written in Haskell this time), I’ve told Claude to “reimplement it in Clojure”. And it worked even better.

Yes, this triggered a wow effect. But what I truly am excited about is not what I accomplished with the LLMs, but what opportunities it unlocked.

If you like, you can skip the backstory and jump right through to the experience report, or to the showcase.

Backstory

I’ve always had an interest in natural language processing. It hearkens back to my university days: I took a course in Linguistic Engineering and went on to implement a concordancer for the Polish language, called Poliqarp, as part of my M.S. thesis. Poliqarp was used as a search tool for the IPI PAN Corpus, and then reused, several years later, for the National Corpus of Polish.

These days, I look at Poliqarp with a mixture of embarrassment and pride. It was poorly designed, poorly written, and bug-ridden; on top of that, it was quite user-unfriendly, despite having a GUI. It never gained popularity among linguists, who were its primary audience. (“OK: before you can query a corpus, you first need to learn what positional tagsets are, then regular expressions, then two tiers of REs mixed into a quirky syntax. And if you want to create a corpus, boy, do you need a Ph.D. in Unixology.”) But it also had some sophisticated ideas in it. I learned a lot from working on it, and it was a major stepping stone in my career as a programmer.

Poliqarp’s GUI, running a query

Shortly after graduation, I’ve started having thoughts about how I’d design Poliqarp if I were to write it again. Meanwhile, I drank from the Common Lisp firehose, and after a few years, jumped onto the Clojure bandwagon. And after a few years, Smyrna was born.

It was nowhere near as complicated as Poliqarp. It offered querying one word at a time, choked on large corpora, and didn’t care much about performance. But it was simple. Simple to install (just one .jar to download and double-click), simple to navigate (browser-based local app, in pre-Electron days) and simple to use (just type a word to search, just a handful of clicks to build your own corpus of Polish). Some people actually used it!

What it did support was automatic lemmatization. Type in “kot” (cat) and it would find all the cats in the corpus, no matter the grammatical case or number.

Smyrna 0.1

Fast forward a few years. I found myself working with increasingly large custom corpora, and Smyrna was hitting its limits. Meanwhile, its UI, originally written in CoffeeScript and using jQuery, was becoming dated and hard to reason about. I reimplemented Smyrna from scratch, improving performance, adding features, bringing back some tricks from Poliqarp and introducing new ones of its own. I then presented Smyrna and some associated tools at EuroClojure 2016.

(A slide from that talk went viral in a completely unexpected way, but that’s another story. Also, my AuDHD makes me a poor speaker – it’s hard to understand me at times. Sorry about that.)

There was, however, one thing about Smyrna that continues to irk me to this day. It’s the wildly suboptimal approach to lemmatization that it takes.

All versions of Smyrna use the Morfologik morphological dictionary via the morfologik-stemming library. It’s written in Java, so it interops with Clojure really easily. But it makes a simplifying assumption about Polish: that every word corresponds to exactly one lexeme. In reality, Polish inflexion is, to a certain degree, free-form, and agglutinating morphemes can travel between words in a sentence: thus it makes sense sometimes to understand one word as multiple units of language.

Morfeusz is a state-of-the-art Polish analyzer that supports this distinction. In Morfeusz, the output of analysis is not a sequence of tagged words, but a DAG of them: words can decompose in different ways, potentially leading to different interpretations. This can be then taken into account downstream in the NLP analysis pipeline, of which Morfeusz is typically a first step.

And it is. To my knowledge, most existing utilities and pipelines use Morfeusz. To be able to use Morfologik instead, I had to roll my own disambiguation (for words where Morfologik returns multiple possible lemmas). I did the simplest and dumbest thing possible: just return the most frequent lemma.

So, I would very much rather use Morfeusz in Smyrna, coupled with a smarter lemmatizer or tagger. Problem is, Morfeusz is written in C++, and one of Smyrna’s raisons d'être is its ease of use. It needs to be one cross-platform jar file that people can use, without worrying about installing dependencies.

There are three possible approaches to making Morfeusz easily bundlable with Smyrna:

Use Morfeusz via JNI, bundle native libraries with the jar, and have the code automatically detect the system and load the correct library at startup. This is, for example, what the JDBC driver for SQLite does.
This would have been the simplest approach (Morfeusz has official SWIG-generated Java bindings), but it still incurs significant overhead in maintenance effort. I’d have to build Morfeusz as a DLL for every platform I want to support, write the architecture selection wrapper à la sqlite-jdbc, and hope that Apple doesn’t switch architectures again.

Somehow compile C++ Morfeusz to JVM bytecode. If there are ways to compile C++ to WASM, there should be some way to compile it to JVM, right? Except I don’t know of one. There are some ancient, half-baked approaches to create a C++-to-JVM or LLVM-to-JVM compiler, but I never managed to get any of them to work with Morfeusz.

Reimplement Morfeusz in Java or Clojure. This is a significant undertaking! Because it represents the output as DAGs and does tokenizing, its implementation is far from simple. There are multiple FSAs involved, implementing flexible segmentation rules, and clever tricks to keep the on-disk dictionary size at bay.
Still, I’ve tried a few times. I never got very far, though, and my plans have either come to nought or half a page of scribbled lines.

I’m pretty sure you have a hunch of where it is going.

Enter Claude Code

Hey Claude! I'd like you to work on converting Morfeusz to Java.

Morfeusz is a morphological analyser for Polish, written in C++. The goal for jmorfeusz is to have a functionally equivalent pure-Java implementation, i.e., without reaching to native code via JNI.

You have access to:

the Morfeusz sources in Morfeusz/ – you'll have to compile it yourself
the SGJP dictionaries in dict/ – use these to cross-validate your implementation against the original

You can start small and only implement the morphological analysis, without synthesis.

Please put your code in jmorfeusz/ only.

Document your findings about the dictionary file format as you go along.

This is what I told Claude, and it eagerly set off to work.

I was mostly watching (from the bird’s eye view) what it was doing, and telling Claude to “Continue” when it paused. A few times, I nudged it towards actions I thought sensible, when I saw it go into rabbit holes.

For example, it started off with a static analysis of Morfeusz’s code and didn’t bother compiling it. Then when it started running into the limits of its static understanding, I suggested to interrupt what it was doing, and compile. However, it ran into some problems, and asked for help:

I’m running into compilation issues (missing system libraries in the linker). Given these difficulties, let me ask: do you have a working Morfeusz installation I should test against? Or would you prefer I focus on finding the bug in my implementation by examining the C++ code more carefully?

Luckily, I did! Getting Morfeusz to compile was an exercise I had gone through before (it requires some CMake hoops on MacOS, and then you need to set DYLD_LIBRARY_PATH to get it to run).

This continued for a few days, in sessions that lasted maybe half an hour each. Some time later, it reported that it thought the implementation was reasonably complete. To which I prompted:

Awesome! Can you test on a longer text? Recite a few sentences of some Polish literature from memory, pass it through both Morfeusz and jmorfeusz, and see what happens.

Claude readily wrote a test that did so (hallucinating a passage from “The Doll” in the process, but it didn’t matter for the time being). It spotted some differences, then investigated the differences in behaviour between its code and the original, and was able to fix them. It iterated a few times until reporting a 100% match on its sentences.

This repeated a few times, with me first substituting the actual first sentence of “The Doll”, and then a longer passage. Claude discovered mismatches and fixed them.

And then it was time for a grande finale. I downloaded a full text of “Quo Vadis” and told the LLM:

I want you to redo the literature tests to operate on the whole quo-vadis.txt provided, feeding it to both impls line by line (1 line = 1 paragraph). Skip empty lines. Let’s see how many more corner cases we can discover!

Claude updated its test, ran it, and – to my astonishment – reported:

Perfect match on the entire novel.

Paragraphs:     4,364
Perfect match:  4,364
With divergences: 0

Matching interps:  321,331
Only in C++:       0
Only in Java:      0
Match rate:        100.00%

321,331 morphological interpretations across all 4,364 paragraphs of Quo Vadis, and every single one matches the C++ Morfeusz output. Zero divergences.

I sat for a while, my jaw dropped. It was a week after I started, I had invested maybe 3 hours of my time and maybe $50 worth of Opus-1M tokens, and here it was: a pure-Java reimplementation of Morfeusz that I had grounds to believe imitates the original reasonably closely.

Raising the bar

A few more weeks passed. Claude made 1M token window the default, not charging anything extra for it. “It’s time for a second component: a lemmatizer”, I thought. “Or, rather, a proper tagger”.

The tool of choice was Concraft-pl. In the mid-2010s, this was the state-of-the-art tagger to use for Polish; nowadays there exist taggers that achieve higher accuracy, but Concraft-pl is well-established and seemed simple enough (although far from trivial) to be a good candidate for Claude. Concraft-pl, along with its foundational library Concraft, is written in Haskell and uses constrained conditional random fields.

I asked Claude to convert Concraft-pl to Clojure. (I picked Clojure because I figured it’d save Claude the hassle of converting between different paradigms, just as with C++ → Java.)

I thought it’d have a harder time than with the previous conversion, because (1) Claude is likely to have seen much more C++ and Java than Clojure and Haskell in its training data, and (2) unlike Morfeusz, where I had a rough idea of what the code was doing, here I had no clue (my Haskell-fu is very rusty and I have never used stack, cabal, etc).

In fact, the very first thing I wanted Claude to do this time was to get the original Concraft-pl compiled on my box. There’s been some bit rot since Concraft was originally written, and the GHC version stipulated by Concraft’s stack.yaml is too old for today’s Stack.

Fortunately, Claude got up to the task this time. Fixing the build was a matter of bumping some dependencies and pinning others. Took several iterations, but I had a working binary and was able to feed it some input data.

Then, inspired by Ankur Sethi’s article about his LLM-written interpreter, I wrote an AGENTS.md to tell Claude to keep a plans directory and update it as it goes. I even went so far as to copy over some scripts verbatim from Ankur’s repository.

And… you know the drill:

Hey Claude! Let’s work on converting Concraft to Clojure. This is a big project but we can do it together. Have a look at AGENTS.md and the resources I’ve gathered for you, and start preparing a plan.

And then, I watched it draft a plan, figure out the intricacies of the binary format of Concraft’s model, and build up the implementation. An hour or so later, after a round of “Continue”s and “Yes please”s…

clj-concraft produces byte-identical output to the Haskell binary on both test inputs.

Summary of what we built:

~900 lines of Clojure reimplementing the Concraft morphosyntactic tagger
Full pipeline: model loading → DAG parsing → guessing → segmentation → disambiguation → formatted output
Key discoveries: decodeFloat Double encoding, TFeat3-only transition features, Viterbi fast-tag for DAG disamb, OOV base form = orth
Performance: ~4 seconds for 39 paragraphs (2570 edges), ~3.5s model load

Yep. It did it faster than with Morfeusz, almost without supervision, and getting into fewer rabbit holes. I have no idea to what extent it was due to having the plans directory (it didn’t seem to make much difference, Claude just sketched an initial plan and then stuck to it).

And there we have it! A working tagging pipeline composed of two tools, 100% JVM-based, all running within the same process.

Excitement

So, I’ve managed to translate two highly non-trivial pieces of code to a common platform using Claude – in a very short timeframe. I saved a lot of time and effort. Did I lose something in return?

Sure I did. I lost understanding.

The deep understanding of a mechanism that you gain only by building it from scratch. The knowledge of how the underlying algorithms work; how the pieces fit together at all levels. The kind of knowledge you get when describing something in very minute detail, like when you’re Bartosz Ciechanowski.

So, yes, I was initially excited like a child. “Look, ma, I have this shiny new toy!” But this shortly wore off, even though the toy might work. So what? Vibecoded stuff is cheap. There’s little value in it by itself. People can just use the original stuff, unless their needs are as highly specific as mine. Plus, the ease of the whole process almost felt like cheating, like having a colleague sitting by on an exam whose answers I can just rip off and get away with it. The analogy reaches farther than it seems: there’s a reason we don’t let people cheat at school, and that reason is precisely because we want them to learn, not just produce well-graded artifacts.

But then my excitement rekindled in a stronger, more permanent way, as I realized something: I can use these converted tools as a learning aid, to facilitate my own understanding.

Sure, I could step through the C++ Morfeusz in a debugger. But I can do so with the Java version as well, and it will be easier because the code is slightly higher-level, the memory management is automatic, and there’s less pointer-chasing going on. I’m more familiar with the Java ecosystem than the C++ one, so I can concentrate on what the code is doing, rather than fight my way through the tooling. A significant obstacle just vanishes into thin air.

Better yet, I can leverage the ecosystem to its highest potential. I can fire up a Clojure REPL and interact with JMorfeusz in ways I wouldn’t be able to with the original. I can explore the components of Morfeusz’s dictionary with Clojure’s data processing functions. I can plug its automata into Loom and run graph-theoretic algorithms on them to my heart’s delight. I can visualize them. The list of things goes on and on. Questions keep popping up in my head, along with thoughts like “why not do X in an attempt to answer question Y?”

And finally, I can ask Claude:

Write (to a Markdown file in the repo) an explanation of how the algorithm works top-to-bottom and how the various FSAs fit together – a documentation that will make it easier for a newcomer to understand the code.

Which results in this document, complete with a data flow diagram and a high-level pseudocode of the main algorithm. Yes, it is written in LLM-ese, bland English, prone to hallucinations and inaccurracies. But that’s fine. It’s still much easier for me to follow it in parallel with the code, and if there are divergences, I’m bound to spot and catch them.

Here’s a similar document for clj-concraft. I mentioned earlier that I knew next to none about CCRFs. I have a simple mind that struggles to reason about statistics: I start reading a Wikipedia article and the moment it starts talking about random variables, I think “gaaah, random variables, functions from a sample space to ℝ… what is the sample space here?… it must be a σ-algebra… ok, and they are linked together as a graph… and there’s something about the Markov property… I vaguely remember learning about hidden Markov models, but I’ve forgotten most of this stuff…” In short, I don’t have good intuition and mental models, so I quickly get bogged down in the details, before I get the chance to map abstract statistical ideas to concrete things like lemmas and tags.

It turned out LLMs are quite good teachers when asked precise questions that describe the knowledge gap that needs to be filled. Here’s a conversation I had the other day with ChatGPT, for a change, starting with:

When applying Hidden Markov Models to POS tagging in NLP, what do the latent states and observations usually represent?

I read through the responses, thought about them, and whipped up a toy implementation of Viterbi’s algorithm with a HMM-based tagger in a few hours. The old-fashioned way, by typing out code in Emacs. Just to see if I can reconstruct the trail of thought in my head. It was a fun exercise.

I’m still digging through that Concraft walkthrough. I haven’t gotten far yet, but I at least have some mental models, and as a bonus learned how Haskell serializes doubles, about doing arithmetic in log domain, and about LogSumExp.

Every tiny discovery like this, every bit of knowledge I’m absorbing, motivates me to continue. In my last post, I declared myself a “conscious LLM-skeptic” and wrote:

I’ve made a choice for those areas not to include LLMs – lest they divert my attention from things I care about.

I care about the fundamentals of my craft. I care about programming languages and their theory. […] I care about abstractions.

I still stand by my words. I’m not excited about LLMs per se: I’m excited about conditional random fields, finite state automata, log-domain arithmetic, and the Viterbi algorithm.

And I’m glad to have found a tool that has made all of this learning not just more accessible, but possible in the first place. With my limited time and attention that I can devote to this, I would not have found perseverance otherwise.

And yet

And yet. And yet.

I keep thinking about what Arne is saying. And Drew. And Rich.

Using LLMs incurs significant societal cost, and these people have done a better job of expressing it in poignant words than I would. Should I not, then, refrain from touching them altogether?

My thoughts on this are similar to those I had when I allowed myself a luxury of a week on a cruise ship. Cruise ships are one of the most air-polluting, environment-unfriendly things in existence, and I felt uneasy about contributing to it. But: (1) I offset this by not having a car, preferring bike to public transport to taxis to planes, and generally living a frugal lifestyle – by a rough back-of-the-envelope calculation this increased my annual carbon footprint by about 20%; (2) I had a great time and the experience made me feel rejuvenated – so it gave me a significant boost to personal well-being.

There’s a tradeoff here. Whether or not it’s an ethically acceptable one, I leave for you to judge. Likewise with LLMs: I experienced a real benefit to myself, a human, and I feel that’s already a lot.

Showcase

Here, I gather links to the LLM-generated artifacts that I’ve been talking about:

JMorfeusz
clj-concraft
szlauch, a pipeline combining the two

If you’re interested, you can also read transcripts of my Claude Code sessions:

Closing remarks

Wow. Somehow, this has become my longest-ever blog post.

There will likely be a Smyrna 0.4, using both libraries, sometime this year. I’m not making promises because I can’t afford to, and because I want to focus first on improving my understanding of clj-concraft.

Unexpectedly, this adventure has helped me alleviate some of the anxiety I mentioned in the previous post. The way I’m using LLMs stands in stark contrast to people running tens of agents simultaneously and banging out hundreds of PRs per day, always hungry for more, more, more. I don’t want to move fast; I want to slow the fuck down and move thoughtfully instead, paying attention to understanding code, be it LLM-generated or human-written. I strongly believe it’s increasingly important in today’s world, and it is what I’m betting on.

It seems fitting to end with a quote:

“Always do the very best job you can,” he said on another occasion as he put a last few finishing touches with a file on the metal parts of a wagon tongue he was repairing.

“But that piece goes underneath,” Garion said. “No one will ever see it.”

“But I know it’s there,” Durnik said, still smoothing the metal. “If it isn’t done as well as I can do it, I’ll be ashamed every time I see this wagon go by—and I'll see the wagon every day.”

— David Eddings, Pawn of Prophecy

No, really, you can’t branch Datomic from the past (and what you can do instead)

2025-04-22T00:00:00Z

I have a love-hate relationship with Datomic. Datomic is a Clojure-based database based on a record of immutable facts; this post assumes a passing familiarity with it – if you haven’t yet, I highly recommend checking it out, it’s enlightening even if you end up not using it.

I’ll leave ranting on the “hate” part for some other time; here, I’d like to focus on some of the love – and its limits.

Datomic has this feature called “speculative writes”. It allows you to take an immutable database value, apply some new facts to it (speculatively, i.e., without sending them over to the transactor – this is self-contained within the JVM), and query the resulting database value as if those facts had been transacted for real.

This is incredibly powerful. It lets you “fork” a Datomic connection (with the help of an ingenious library called Datomock), so that you can see all of the data in the source database up to the point of forking, but any new writes happen only in memory. You can develop on top of production data, but without any risk of damaging them! I remember how aghast I was upon first hearing about the concept, but now can’t imagine my life without it. Datomock’s author offers an analogy to Git: it’s like database values being commits, and connections being branches.

Another awesome feature of Datomic is that it lets you travel back in time. You can call as-of on a database value, passing a timestamp, and you get back a db as it was at that point in time – which you can query to your heart’s content. This aids immensely in forensic debugging, and helps answer questions which would have been outright impossible to answer with classical DBMSs.

Now, we’re getting to the crux of this post: as-of and speculative writes don’t compose together. If you try to create a Datomocked connection off of a database value obtained from as-of, you’ll get back a connection to which you can transact new facts, but you’ll never be able to see them. The analogy to Git falls down here: it’s as if Git only let you branch HEAD.

This is a well-known gotcha among Datomic users. From Datomic’s documentation:

as-of Is Not a Branch

Filters are applied to an unfiltered database value obtained from db or with. In particular, the combination of with and as-of means "with followed by as-of", regardless of which API call you make first. with plus as-of lets you see a speculative db with recent datoms filtered out, but it does not let you branch the past.

So it appears that this is an insurmountable obstacle: you can’t fork the past with Datomic.

Or can you?

Reddit user NamelessMason has tried to reimplement as-of on top of d/filter, yielding what seems to be a working approach to “datofork”! Quoting his post:

Datomic supports 4 kinds of filters: as-of, since, history and custom d/filter, where you can filter by arbitrary datom predicate. […]

d/as-of sets a effective upper limit on the T values visible through the Database object. This applies both to existing datoms as well as any datoms you try to add later. But since the tx value for the next transaction is predictable, and custom filters compose just fine, perhaps we could just white-list future transactions?

(defn as-of'' [db t]
  (let [tx-limit (d/t->tx t)
        tx-allow (d/t->tx (d/basis-t db))]
    (d/filter db (fn [_ [e a v tx]] (or (<= tx tx-limit) (> tx tx-allow))))))

[…] Seems to work fine!

Sadly, it doesn’t actually work fine. Here’s a counterexample:

(def conn (let [u "datomic:mem:test"] (d/create-database u) (d/connect u)))

;; Let's add some basic schema
@(d/transact conn [{:db/ident :test/id :db/valueType :db.type/string
                    :db/cardinality :db.cardinality/one :db/unique :db.unique/identity}])
(d/basis-t (d/db conn)) ;=> 1000

;; Now let's transact an entity
@(d/transact conn [{:test/id "test", :db/ident ::the-entity}])
(d/basis-t (d/db conn)) ;=> 1001

;; And in another transaction let's change the :test/id of that entity
@(d/transact conn [[:db/add ::the-entity :test/id "test2"]])
(d/basis-t (d/db conn)) ;=> 1003

;; Trying a speculative write, forking from 1001
(def db' (-> (d/db conn)
             (as-of'' 1001)
             (d/with [[:db/add ::the-entity :test/id "test3"]])
             :db-after))
(:test/id (d/entity db' ::the-entity)) ;=> "test" (WRONG! it should be "test3")

To recap what we just did: we transacted version A of an entity, then an updated version B, then tried to fork C off of A, but we’re still seeing A’s version of the data. Can we somehow save the day?

To see what d/filter is doing, we can add a debug println to the filtering function, following NamelessMason’s example (I’m translating tx values to t for easier understanding):

(defn as-of'' [db t]
  (let [tx-limit (d/t->tx t)
        tx-allow (d/t->tx (d/basis-t db))]
    (d/filter db (fn [_ [e a v tx :as datom]]
                   (let [result (or (<= tx tx-limit) (> tx tx-allow))]
                     (printf "%s -> %s\n" (pr-str [e a v (d/tx->t tx)]) result)
                     result)))))

Re-running the above speculative write snippet now yields:

[17592186045418 72 "test" 1003] -> false
[17592186045418 72 "test" 1001] -> true

So d/filter saw that tx 1003 retracts the "test" value for our datom, but it’s rejected because it doesn’t meet the condition (or (<= tx tx-limit) (> tx tx-allow)). And at this point, it never even looks at datoms in the speculative transaction 1004, the one that asserted our "test3". It looks like Datomic’s d/filter does some optimizations where it skips datoms if it determines they cannot apply based on previous ones.

But even if it did do what we want (i.e., include datoms from tx 1001 and 1004 but not 1003), it would have been impossible. Let’s see what datoms our speculative transaction introduces:

(-> (d/db conn)
    (as-of'' 1001)
    (d/with [[:db/add ::the-entity :test/id "test3"]])
    :tx-data
    (->> (mapv (juxt :e :a :v (comp d/tx->t :tx) :added))))
;=> [[13194139534316 50 #inst "2025-04-22T12:48:40.875-00:00" 1004 true]
;=>  [17592186045418 72 "test3" 1004 true]
;=>  [17592186045418 72 "test2" 1004 false]]

It adds the value of "test3" but retracts "test2"! Not "test"! It appears that d/with looks at the unfiltered database value to produce new datoms for the speculative db value (corroborated by the fact that we don’t get any output from the filtering fn at this point; we only do when we actually query db'). Our filter cannot work: transactions 1001 plus 1004 would be “add "test", retract "test2", add "test3"”, which is not internally consistent.

So, no, really, you can’t branch Datomic from the past.

Which brings us back to square one: what can we do? What is our usecase for branching the past, anyway?

Dunno about you, but to me the allure is integration testing. Rather than having to maintain an elaborate set of fixtures, with artificial entity names peppered with the word “example”, I want to test on data that’s close to production; that feels like production. Ideally, it is production data, isolated and made invincible by forking. At the same time, tests have to behave predictably: I don’t want a test to fail just because someone deleted yesterday an entity from production that the test depends on. Being able to fork the past would have been a wonderful solution if it worked, but… it’s what it is.

So now I’m experimenting with a different approach. My observation here is that my app’s Datomic database is (and I’d wager a guess that most real-world DBs are as well) “mostly hierarchical”. That is, while its graph of entities might be a giant strongly-connected blob, it can be subdivided into many small subgraphs by judiciously removing edges.

This makes sense for testing. A test typically focuses on a handful of “top-level entities” that I need to be present in my testing database like they are in production, along with all their dependencies – sub-entities that they point to. Say, if I were developing a UI for the MusicBrainz database and testing the release page, I’d need a release entity, along with its tracks, label, medium, artist, country etc to be present in my testing DB. But just one release is enough; I don’t need all 10K of them.

My workflow is thus:

create an empty in-memory DB
feed it with the same schema that production has
get hold of a production db with a fixed as-of
given a “seed entity”, perform a graph traversal (via EAVT and VAET indexes) starting from that entity to determine reachable entities, judiciously blacklisting attributes (and whitelisting “backward-pointing” ones) to avoid importing too much
copy those entities to my fresh DB
run the test!

This can be done generically. I’ve written some proof-of-concept code that wraps a Datomic db to implement the Loom graph protocol, so that one can use Loom’s graph algorithms to perform a breadth-first entity scan, and a function to walk over those entities and convert them to a transaction applicable on top of a pristine DB. So far I’ve been able to extract meaningful small sub-dbs (on the order of ~10K datoms) from my huge production DB of 17+ billion datoms.

This is a gist for now, but let me know if there’s interest and I can convert it into a proper library.

Double, double toil and trouble or, Corner-Cases of Comparing Clojure Numbers

2025-02-21T00:00:00Z

Let’s talk about Clojure.

In Clojure, comparing two numbers can throw an exception.

Check this out:

(< 1/4 0.5M)
;=> true        ; as expected

(< 1/3 0.5M)
; Execution error (ArithmeticException) at java.math.BigDecimal/divide (BigDecimal.java:1783).
; Non-terminating decimal expansion; no exact representable decimal result.

But why? Why would comparing two perfectly cromulent numbers throw an ArithmeticException?! Everybody knows that ⅓ < 0.5 – we aren’t dividing by zero or anything like that, are we?

Well, the problem is that we’re comparing a ratio to a BigDecimal (a decimal number of arbitrary precision). Java doesn’t offer a built-in way of comparing these (Clojure’s ratios aren’t part of the Java standard library), so it has to coerce one into the other. It chooses to coerce the ratio into a BigDecimal, so divides (bigdec 1) by (bigdec 3)…

…and that throws! The decimal representation of ⅓ is infinite, so you can’t keep all the digits in finite memory.

You may ask: how exactly does Clojure know what coercions to apply and how to produce the result? Let’s look at the code.

The implementation of clojure.core/< calls the Java method clojure.lang.Numbers.lt, which is implemented like this:

static public boolean lt(Object x, Object y){
	return ops(x).combine(ops(y)).lt((Number)x, (Number)y);
}

What’s ops? It’s an implementation of the Ops interface, which has methods for addition, subtraction, etc.; each number class has its own implementation: there is a LongOps, RatioOps, BigDecimalOps etc.

The combine method can alter the behaviour of an Ops depending on the type of the other argument – for example, RatioOps switches to BigDecimalOps if the other argument is a BigDecimal. It’s like a poor man’s implementation of multiple dispatch, which Java doesn’t have.

BigDecimalOps.lt calls toBigDecimal on both arguments, and it’s that method that performs the failing division:

static BigDecimal toBigDecimal(Object x) {
    // ... other cases ...
    if (x instanceof Ratio) {
        Ratio r = (Ratio)x;
        return (BigDecimal)divide(new BigDecimal(r.numerator), r.denominator);
    }
}

Incidentally, this used to produce the expected result in Clojure up to 1.2.1. At that version, Clojure already used the Ops-based multiple dispatch, but combining RatioOps with BigDecimalOps would yield the former, not the latter.

Is the current behaviour a bug? I’m not sure. It seems so, but maybe 1.3.0’s optimizations warrant this behaviour in the admitedly rare case. There’s an ongoing discussion on the Ask Clojure Q&A.

So, in current Clojure, how do you compare ratios to bigdecs? Simple, you think: just coerce the bigdec to a double!

(< 1/3 (double 0.5M))
;=> true

(> 2/3 (double 0.5M))
;=> true

(= 1/2 (double 0.5M))
;=> false

Wait, WHAT?

Yep. Comparing ratios to doubles for inequality works fine, but a ratio is never equal to a double (nor a bigdec), even if said double is an exact representation of the ratio.

This one is documented, but often forgotten about (and not hinted at by the docstring). From Clojure’s equality guide:

Clojure’s = is true when called with two immutable scalar values, if:
Both arguments are nil, true, false, the same character, or the same string (i.e. the same sequence of characters).
Both arguments are symbols, or both keywords, with equal namespaces and names.
Both arguments are numbers in the same 'category', and numerically the same, where category is one of:
integer or ratio
floating point (float or double)
BigDecimal.

And indeed, the code for Numbers.equal has a check for both operands’ categories before it delves to the Ops business that we’ve seen. Remember also that Clojure has a numbers-only == which doesn’t trigger that category check:

(== 1/2 (double 0.5M))
;=> true ; yay

Corollary: if you want to compare a ratio to a BigDecimal, you could coerce the bigdec to a double. That can return an incorrect result only in a very narrow range of cases: when the BigDecimal’s value is close enough to the ratio that it would be lost in the double conversion.

For 100% certainty, the only way I’m aware of is to remember to always use == when comparing for equality, and explicitly coerce the bigdec to ratio:

(defn exactly-equals? [ratio bigdec]
  (== (* 1 (clojure.lang.Numbers/toRatio bigdec)) ratio))

(exactly-equals? 1/18446744073709551616 5.42101086242752217003726400434970855712890625E-20M)
;=> true ; correct even in this pathological case!

(Multiplying by 1 forces Clojure to normalize the ratio. Otherwise, converting 0.5M would have yielded 5/10 which doesn’t test == to 1/2. Go figure.)

Lossy CSS compression for fun and loss (or profit)

2024-01-26T00:00:00Z

What

Late last year, I had an idea that’s been steadily brewing in my head. I’ve found myself with some free time recently (it coincided with vacation, go figure), and I’ve hacked together some proof-of-concept code. Whether or not it is actually proving the concept I’m not sure, but the results are somewhat interesting, and I believe the idea is novel (I haven’t found any other implementation in the wild). So it’s at least worthy of a blog post.

I wrote cssfact, a lossy CSS compressor. That is, a program that takes some CSS and outputs back some other CSS that hopefully retains some (most) of the information in the input, but contains fewer rules than the original. Exactly how many rules it produces is configurable, and the loss depends on that number.

The program only works on style rules (which make up the majority of a typical CSS). It leaves the non-style rules unchanged.

Here’s the source. It’s not exactly straightforward to get it running, but it shouldn’t be very hard, either. It’s very simple – the program itself doesn’t contain any fancy logic; the actual decisions on what the output will contain are made by an external program.

If you just want to see some results, here is a sample with my homepage serving as a patient etherized upon a table. Its CSS is quite small – 55 style rules that cssfact can work on – and here’s how the page looks with various settings:

Original: page, CSS, source SASS
1 style rule: page, CSS (93% information loss)
5 style rules: page, CSS (74% information loss)
10 style rules: page, CSS (55% information loss)
20 style rules: page, CSS (31% information loss)
30 style rules: page, CSS (17% information loss)

My homepage and both of my blogs all use the same CSS, so you can try to replace the CSS in your browser’s devtools elsewhere on the site and see how it looks.

How

Three words: binary matrix factorization (BMF, in the Boolean algebra).

I guess I could just stop here, but I’ll elaborate just in case it isn’t clear.

Consider a simple CSS snippet:

h1, h2 {
   padding: 0;
   margin-bottom: 0.5em;
}

h1 {
   font-size: 32px;
   font-weight: bold;
}

h2 {
   font-size: 24px;
   font-weight: bold;
}

The first rule tells you that for all elements that match either the h1 or h2 selectors, the two declarations should apply.

You could visualize this CSS as a 5x2 binary matrix A^T where the n columns correspond to simple selectors (i.e., without commas in them) and the m rows correspond to declarations:

	`h1`	`h2`
`padding: 0`	1	1
`margin-bottom: 0.5em`	1	1
`font-size: 32px`	1	0
`font-size: 24px`	0	1
`font-weight: bold`	1	1

You could also transpose the matrix, yielding A with m rows denoting selectors and n columns denoting declarations. For my homepage’s CSS, m = 60 and n = 81; for bigger stylesheets, several thousand in either direction is not uncommon.

Now, linear algebra gives us algorithms to find a matrix A′ ≈ A such that there exists a decomposition A′ = B × C, where B has dimensions m × r, C has dimensions r × n, and r is small – typically much smaller than m or n. So this is a way of dimensionality reduction.

In the usual algebra of real numbers, there’s no guarantee that B or C will themselves be binary matrices – in fact, most likely they won’t. But if we operate in Boolean algebra instead (i.e. one where 1 + 1 = 1), then both B and C will be binary. The flip side is that the Boolean BMF problem is NP-hard, so the algorithms found in the wild perform approximate decompositions, not guaranteed to be optimal.

But that’s okay, because lossiness is inherent in what we’re doing anyway, and it turns the binary matrices B and C are readily interpretable. Look again at the CSS matrix above: why is there a 1 in the top-left cell? Because at least one of the CSS rules stipulates the declaration padding: 0 for the selector h1.

This is exactly the definition of matrix multiplication in the Boolean algebra. The matrix A′ will have a 1 at coordinates [i, j] iff there is at least one k ∈ {1, …, r} such that B[i, k] = 1 and C[k, j] = 1. So the columns of B and rows of C actually correspond to CSS rules! Every time you write CSS, you’re actually writing out binary matrices – and the browser is multiplying them to get at the actual behaviour.

Well, not really, but it’s one way to think about it. It’s not perfect – it completely glosses over rules overlapping each other and having precedence, and treats them as equally important – but it somewhat works!

You could plug in any BMF algorithm to this approach. For cssfact, I’ve picked the code by Barahona and Goncalves 2019 – sadly, I wasn’t able to find the actual paper – not because it performs spectacularly well (it’s actually dog-slow on larger stylesheets), but because I was easily able to make it work and interface with it.

Why

Why not?

The sheer joy of exploration is reason enough, but I believe there are potential practical applications. CSS codebases have the tendency to grow organically and eventually start collapsing under their own weight, and they have to be maintained very thoughtfully to prevent that. In many CSS monstrosities found in the wild, there are much cleaner, leaner, essence-capturing cores struggling to get out.

This tool probably won’t automatically extract them for you – so don’t put it in your CI pipeline – but by perusing the CSS that it produces and cross-checking it with the input, you could encounter hints on what redundancy there is in your styles. Things like “these components are actually very similar, so maybe should be united” may become more apparent.

My mental model of transducers

2023-09-09T00:00:00Z

Intro

I’ve been programming in Clojure for a long time, but I haven’t been using transducers much. I learned to mechanically transform (into [] (map f coll)) to (into [] (map f) coll) for a slight performance gain, but not much beyond that. Recently, however, I’ve found myself refactoring transducers-based code at work, which prompted me to get back to speed.

I found Eero Helenius’ article “Grokking Clojure transducers” a great help in that. To me, it’s much more approachable than the official documentation – in a large part because it shows you how to build transducers from the ground up, and this method of learning profoundly resonates with me. I highly recommend it. However, it’s also useful to have a visual intuition of how transducers work, a mental model that hints at the big picture without zooming into the details too much. In this post, I’d like to share mine and illustrate it with a REPL session. (Spoiler alert: there’s core.async ahead, but in low quantities.)

Pictures

Imagine data flowing through a conveyor belt. Say, infinitely repeating integers from 1 to 5:

I’m using the abstract term “conveyor belt”, rather than “sequence” or something like this, to avoid associations with any implementation details. Just pieces of data, one after another. These data may be anything; they may flow infinitely or stop at some point; may or may not all exist in memory at the same time. Doesn’t matter. That’s the beauty of transducers: they completely abstract away the implementation of sequentiality.

So, what is a transducer, intuitively? It’s a mechanism for transforming conveyor belts into other conveyor belts.

For example, (map inc) is a transducer that says: “take this conveyor belt and produce one where every number is incremented”. Applying it to the above belt yields this one:

An important thing about transducers is that they’re composable. To understand that, imagine further transforming the above belt by removing all the odd numbers. Intuitively, that’s what (remove odd?) does:

(I’ve left the spacing between boxes the same as before, because it helps me visualise (remove odd?) better. I imagine an invisible gnome sitting above the belt, watching carefully all the boxes that pass below it, and snatching greedily every one that happens to contain an odd number.)

Composability means that Clojure lets you say (comp (map inc) (remove odd?)) to mean the transducer that transforms the first belt to the third one. By putting together two simple building blocks, we produced a more complex one – that it itself reusable and can be used as another building block in an ever more complex data pipeline.

Notice we still haven’t said anything about the actual representation of the data, but are already able to model complex processes. We can then apply them to actual data, whether it’s a simple vector-to-vector transformation within the same JVM, or listening to a topic on a Kafka cluster, summarizing the incoming data and sending them to a data warehouse.

Code

OK, enough handwaving, time for a demo. Let’s fire up a REPL and load core.async (I’m assuming you’ve added it to your dependencies already). I won’t reproduce here the resulting values of expressions we evaluate (they’re mostly nils anyway), but I will reproduce output from the REPL (as comments).

(require '[clojure.core.async :refer [chan <!! >!! thread close!]])

Why core.async? Because I find it a great way to implement a conveyor belt that you can play with interactively. This can help you understand how the various Clojure-provided transducers work. For the noncognoscenti: core.async is a Clojure library that allows you to implement concurrent processes that communicate over channels. By default, that communication is synchronous, meaning that if a process tries to read from a channel, it blocks until another process writes something to that channel.

As it happens, we can pass a transducer to the function that creates channels, chan. It will put the invisible gnomes to work on values that pass through the channel. So you can view that channel as a conveyor belt!

For easy tinkering, we can do this:

(defn transformed-belt [xf]
  (let [ch (chan 1 xf)]
    (thread
      (loop []
        (when-some [value (<!! ch)]
          (println "Value:" (pr-str value)))
          (recur)))
    ch))

This fires up a process working at the receiving end of the conveyor belt. It will print out any transformed values as soon as they become available. Typing at the REPL, we will assume the role of producer, putting data on the belt.

Like this:

(def b (transformed-belt (map inc)))
(>!! b 2)
; Value: 3
(>!! b 42)
; Value: 43

It works! We’re putting in numbers, and out come the incremented ones.

When we’re done experimenting with the belt, we need to close! it. This will cause the worker thread to shutdown.

(close! b)

We can now experiment with something more complex, like that combined transducer we’ve talked about before:

(def b (transformed-belt (comp (map inc) (remove odd?))))
(>!! b 1)
; Value: 2
(>!! b 2)
(>!! b 3)
; Value: 4

We got the transformed 1 and 3, but the intermediate value for 2 was odd, so it was snatched by the gnome and we never saw it.

There’s even more fun to be had! Let’s try (partition-all 3):

(close! b)
(def b (transformed-belt (partition-all 3)))
(>!! b 1)

Nothing…

(>!! b 2)

Still nothing…

(>!! b 3)
; Value: [1 2 3]

Blammo! Our gnome is now packaging together incoming items into bundles of three, caching them in the interim while the bundle is not complete yet. But if we close the input prematurely, it will acknowledge and produce the incomplete bundle:

(>!! b 4)
(>!! b 5)
(close! b)
; Value: [4 5]

In fact, partition-all is what prompted me to write this post. That code at work I mentioned actually included a transducer composition that had a (net.cgrand.xforms/into []) in it. That transducer (from Christophe Grand’s xforms library) accumulates data until there’s nothing more to accumulate, and then emits all of it as one large vector. By replacing it with partition-all, I altered the downstream processing to handle multiple smaller batches rather than one huge batch, improving the system’s latency.

A small change for a huge win. Clojure continues to amaze me.

Plus, it’s fun to make JS-less animations in SVG. :)

Testing a compiler that can’t even print stuff out

2021-09-25T00:00:00Z

I’m enjoying a week-long vacation. In addition to other vacationy things (a trip to Prague, yay!), I wanted to do some off-work programming Just For Fun™ and revisit one of my dormant pet projects, to see if I can make some progress.

I opted for Lithium, my toy x86 assembler and Lisp compiler that hasn’t seen new development since 2014. But before that, I had blogged about it and even talked about it at EuroClojure one time.

Over the week, I’ve re-read the paper that I’ve been loosely following while developing Lithium. In it, Abdulaziz Ghuloum advocates to have a testing infrastructure from day one, so that one can ensure that the compiler continues to work after each small modification. I’d cut corners on it before, but today, I’ve finally added one.

What’s the big deal? And why not earlier?

One of the original goals that I set myself for Lithium is that it have no runtime dependencies. Not even a C library; not even an OS. It produces raw x86 binaries targetting real mode – non-relocatable blobs of raw machine code. I’m running them in DOSBox, because it’s convenient, but the point is it’s not necessary.

(Some day, I’ll write a mission statement to explain why. But that’s a story for another day.)

And because the setup is so minimalistic, the setup suggested by Ghuloum becomes unfeasible. Ghuloum presupposes the existence of a host C compiler and linker; I have no such privilege. By itself, Lithium can barely output stuff to screen. There’s a write-char primitive that emits one character, but nothing more than that. And there’s as yet no library to add things to, because there’s no defn and not much of a global environment.

So what to do? I thought about the invariant in Ghuloum’s design, one that Lithium inherits as well:

Every expression is compiled to machine code that puts its value in the AX register.

If I could somehow obtain the values that the CPU registers have at the end of executing a Lithium-compiled program, then I could compare them to the expected value in a test. But how to grab those registers?

That turned out to be easier than expected. Instead of extending Lithium to support printing decimal or hexadecimal numbers, I just grabbed some pre-existing assembly code to affix to the program as an epilog. (It does depend on DOS’s interrupt 21h, but hey, it doesn’t hurt to have it for debugging/testing only.) Surprise: the snippet failed to compile, because Lithium’s assembler is woefully incomplete! But it was easy enough to extend it until it worked.

So this gave me a way to view the program’s results.

But there’s another problem: these results are printed within DOSBox. In the emulated DOS machine. I needed a way to transfer them back to the host. Can you guess how?

Yes, you’re right: the simplest thing (DOS redirection to a file, as in PROG.COM >REG.TXT) works. And you’ll laugh at me that it hasn’t occurred to me until now, when I’m writing up the commit that’s already out in the wild. Another proof that it pays to write documentation.

My original idea was… SCREEN CAPTURE!

I’ve scavenged Google for a DOS screen grabber that can produce text files and is not a TSR, found one, bundled it with Lithium, and wrote some duct-tape code that invokes the compiled program and the screen grabber in turn and then parses the output. With that, I can finally have tests that check whether (+ 3 4) is really 7.

And now let me go refactor it…

I made a website to guess tomorrow’s number of COVID-19 cases, and here’s what happened

2020-11-08T00:00:00Z

Before

It seems so obvious in hindsight. Here in Poland, people have been guessing it ever since the pandemic breakout: in private conversations, in random threads on social media, in comments under governmental information outlets. It seemed a matter of time before someone came up with something like this. In fact, on one Sunday evening in October, I found myself flabbergasted that apparently no one yet has.

I doled out $4 for a domain, koronalotek.pl (can be translated as “coronalotto” or “coronalottery” – occurrences of the name on Twitter date back at least as far as April), and fired up a REPL. A few hours and 250 Clojure LOCs later, the site was up.

I wanted it to be as simple as possible. A form with two fields: “your name” and “how many cases tomorrow?” A top-ten list of today’s winners, sorted by the absolute difference between the guess and the actual number of cases, as reported daily on Twitter by the Polish Ministry of Health. The official number, prominently displayed. And that’s all.

On 17 October, I posted the link on my Facebook and Twitter feeds, and waited. The stream of guesses started to trickle in.

After

It never grew to be more than a stream, but it hasn’t gone completely unnoticed either.

The above plot shows daily number of accepted guesses (i.e., those that were used to generate the next day’s winners) over time – a metric of popularity. Each day’s number means guesses cast in the 24 hours up until 10:30 (Warsaw time) on that day, which is when the official numbers are published by the Ministry of Health.

I’ve been filtering out automated submissions, as well as excess manual submissions by the same IP that seemed to skew the results too much – I’ve arbitrarily set the “excess” threshold at 10. The missing datapoint for 19 October is not a zero, but a N/A: I’ve lost that datapoint due to a glitch. More on this below.

The interest peaked on October 23, with more than a thousand guesses for that day (I think it was reposted by someone with a significant outreach back then), and has been slowly declining since.

I have privately received some feedback. One person has pointed out that they found the site distasteful and that making fun of pandemic tragedies made them uncomfortable. (I empathise; for me it’s not so much making fun as it is a coping mechanism—a way to put distance between my thoughts and the difficult times we’re in and to keep fears at bay.) Some people, however, have thanked me for making them smile when they guessed more or less correctly.

Back to data. Being a data junkie, I looked at what I had been collecting. First things first: how accurate is the collective predictive power of the guessers?

Quite accurate, in fact! Data for this plot has only been slightly preprocessed, by filtering out “unreasonable” guesses that don’t fall within the range [100; 50000].

People have over- and underguesstimated the number of new cases, but not by much. There were only a few occasions where the actual case count didn’t fall within one standard deviation of the mean of guesses (represented by the whiskers around blue bars on the plot). Granted, the daily standard deviation tends to be large (on the order of a few thousand), but still, I’m impressed. A paper on estimating the growth of pandemic based on coronalottery results coming soon to a journal near you! ;-)

Just for the heck of it, I’ve also been looking at individual votes. Specifically, names. Here’s a snapshot of unique guessers’ names sorted by decreasing length, on 23 October. (NSFW warning: expletives ahead!)

Let me translate a few of these for those of you who don’t speak Polish:

1 is “Sasin has fucked over 70 million zlotys for elections that didn’t take place and was never held responsible.” This alludes to the ghost election in Poland from May. This news had gone memetic, going so far as Minister Sasin’s name being ironically used as a dimensionless unit of 70 million (think Avogadro’s number). You’ll discover the same theme in #2, #3, #5, and others.

6 is “CT {Constitutional Tribunal}, you focking botch, stop repressing my abortion”. Just a day before, the Polish constitutional court (whose current legality is disputed at best) has decreed a ban on almost all legal abortion in Poland, giving rise to the biggest street protests in decades.

Not all is political: 4 is “Why study for the exam if we’re not gonna survive until November anyway?”. I hope whoever wrote this is alive and well.

Corollary? Give people a text field, and they’ll use it to express themselves: politically or otherwise.

In fact, I have taken the liberty of chiming in. Shortly after, I altered the thank-you page (which used to just say “thanks for guessing”) to proudly display one of the emblems of the Women’s Strike, along with a link to a crowdfounding campaign for an NGO that supports women needing abortion.

Inside out

I’m not much of a DevOps person, so I deployed it the quick and dirty way, not caring about scalability or performance. The maxim “make it as simple as possible” permeates the setup.

I just started a REPL within a screen session on the tiny Scaleway C1 server that also hosts this blog and some of my other personal stuff. I launched a Jetty server within it, and set up a nginx proxy. And that’s pretty much it. I liberally tinker with the app’s state in “production,” evaluating all kinds of expressions when I feel like it.

Code changes are deployed by git pulling new developments and doing (require 'koronalotek.core :reload) in the REPL.

Someone tried a SQL injection attack. This is doomed to fail because there’s no SQL involved. In fact, there’s no database at all. The entire state is kept in an in-memory atom and periodically synced out to an EDN file. In addition, state is reset and archived daily at the time of announcing winners. (I’ve added the archiving after forgetting it on one occasion – hence the lack of data for 19 October.)

I also don’t yet have a mechanism of automatically pulling in the Ministry of Health’s data. Every morning, I spend two minutes checking if there’s excess automatic votes, removing them if any, and then filling in the blanks:

(new-data! #inst "2020-11-08T10:30+01:00" 24785)

For all the violations of good practices in this setup, it has worked out surprisingly well so far. I’ve resorted to removing automated votes a handful of times, and blacklisting IPs of voting bots in the nginx setup twice, but otherwise it’s been a low-maintenance toy. People seem to be willing to have fun, and I’m just not interfering.

Takeaways

You should call on your country’s authorities to exert pressure on the Polish government to respect women’s choices and stop actively repressing them.
Give people a text field, and they’ll use it to express themselves.
Release early, release often.

Making of “Clojure as a dependency”

2020-05-08T00:00:00Z

In my previous post, “Clojure as a dependency”, I’ve presented the results of some toy research on Clojure version numbers seen in the wild. I’m a big believer in reproducible research, so I’m making available a Git repo that contains code you can run yourself to reproduce these results. This post is an experience report from writing that code.

There are two main components to this project: acquisition and analysis of data (implemented in the namespaces versions.scrape and versions.analyze, respectively). Let’s look at each of these in turn.

Data acquisition

This step uses the GitHub API v3 to:

retrieve the 1000 most popular Clojure repositories (using the Search repositories endpoint and going through all pages of the paginated result);
for each of these repositories, look at its file list (in the master branch) and pick up any files named project.clj or deps.edn in the root directory, using the Contents endpoint);
parse each of these files and extract the list of dependencies.

As hinted by the namespace, I’ve opted to use Skyscraper to orchestrate the process. It would arguably have been simpler to use GitHub’s GraphQL v4 API, but I wanted to showcase Skyscraper’s custom parsing facilities.

There’s no actual HTML scraping going on (all processors use either JSON or Clojure parsers), but Skyscraper is still able to “restructure” the result – traverse the graph endpoint in a manner similar to that of GraphQL – with very little effort. It would have been possible with any other RESTful API. Plus, we get goodies like caching or tree pruning for free.

Most of the code is straightforward, but parsing of project.clj merits some explanation. Some of my initial assumptions proved incorrect, and it’s fun to see how. I initially tried to use clojure.edn, but Leiningen project definitions are not actually EDN – they are Clojure code, which is a superset of EDN. So I had to resort to read-string from core – with *read-eval* bound to nil (otherwise the code would have a Clojure injection vulnerability – think Bobby Tables). Needless to say, some project.cljs turned out to depend on read-eval.

Some projects (I’m looking at you, Closh, Babashka and sci) keep the version number outside of project.clj, in a text file (typically in resources/), and slurp it back into project.clj with a read-eval’d expression:

(defproject closh-sci
  #=(clojure.string/trim
     #=(slurp "resources/CLOSH_VERSION"))
  …)

A trick employed by one project, Metabase, is to dynamically generate JVM options containing a port number at parse time, so that test suites running at the same time don’t clash with each other:

#=(eval (format "-Dmb.jetty.port=%d" (+ 3001 (rand-int 500))))

Finally, it turned out that defproject is not always a first form in project.clj. Some projects, like bridge, only contain a placeholder project.clj with no forms; others, like aleph, first define some constants, and then refer to them in a defproject form. If those constants contain parts of the dependencies list, then those dependencies won’t be processed correctly. Fortunately, not a lot of projects do this, so it doesn’t skew the results much.

Anyway, the end result of the acquisition phase is a sequence of maps describing project definitions. They look like this:

{:name "clojure-koans",
 :full-name "functional-koans/clojure-koans",
 :deps-type :leiningen,
 :page 1,
 :deps {org.clojure/clojure #:mvn{:version "1.10.0"},
        koan-engine #:mvn{:version "0.2.5"}}},
 :profile-deps {:dev {lein-koan #:mvn{:version "0.1.5"}}}

Homogeneity is important: every dependency description has been converted to the cli-tools format, even if it comes from a project.clj.

Data analysis

I’ve long been searching for a way to do exploratory programming in Clojure without turning the code to a tangled mess, portable only along with my computer.

Exploratory (or research) programming is very different from “normal” programming. In the latter, most of the time you typically focus on a coherent project – a program or a library. In contrast, in the former, you spend a lot of time in the REPL, trying all sorts of different things and defing new values derived from already computed ones.

This is very convenient, but it’s extremely easy to get carried away in the REPL and get lost in a sea of defs. If you want to redo your computations from scratch, just about your only option is to take your REPL transcript and re-evaluate the expressions one by one, in the correct order. Cleaning up the code (e.g. deglobalizing) as you go is very difficult.

I’ve found an answer: Plumatic Graph, part of the plumbing library. There are a plethora of uses for it: for example, at Fy, my current workplace, we’re using it to define our test fixtures. But as it turns out, it makes exploratory programming enjoyable.

The bulk of code in versions.analyze consists of a big definition of a graph, with nodes representing computations – things that I’d normally have def’d in a REPL. Consequently, most of these definitions are short and to the point. I also gave the nodes verbose, descriptive, explicit names. Name and conquer. raw-repos is the output from data acquisition, repos is an all-important node containing those raw-repos that were successfully parsed, and most other things depend on it.

It also doesn’t obstruct much the normal REPL research flow. My normal workflow with REPL and Graph is something along the lines of:

(def result (main))
evaluate something using inputs from result
nah, it leads nowhere
evaluate something else
hey, that’s interesting!
add a new node to the graph definition
GOTO 1

Thanks to Graph’s lazy compiler, I can re-evaluate anything at need and have it evaluate only the things needed, and nothing else. Also, because the graph is explicit, it’s fairly easy to visualize it. (Click the image to open it in full-size in another tab.)

Because it’s lazy, it doesn’t hurt to put extra things in there just in case, even when you’re not going to report them. For example, I was curious what things besides a version number people put in dependencies. :exclusions, for sure, but what else? This is the :what-other-things-besides-versions node.

Imagine my surprise when I found :exlusions (sic) in there, which turned out to be a typo in shadow-cljs’ project.clj! I submitted a PR, and Thomas Heller merged it a few days after.

My only gripe with Graph is that it runs somewhat contrary to the current trends in the Clojure community: for example, it doesn’t support namespaced keywords (although there’s an open ticket for that). But on the whole, I’m sold. I’ll definitely be using it in the next piece of research in Clojure, and I’m on a lookout for something similar in pure R. If you know something, do tell me!

Some words on plotting

The plot from previous post has been generated in pure R, using ggplot2 (an extremely versatile API). Clojure generates a CSV with munged data, and then R reads that CSV as a data frame and generates the plot in a few lines.

I’ve briefly played around with clojisr, a bridge between Clojure and R. It was an enlightening experiment, and it would let me avoid the intermediate CSV, but I decided to ditch it for a few reasons:

It pulls in quite a few dependencies (I wanted to keep them down to a minimum), and requires some previous setup on the R side.
I’d much rather write my R as R, since I’m comfortable with it, rather than spend time wondering how it maps to Clojure. This is similar to the SQL story: these days I prefer HugSQL over Korma, unless I have good reasons to choose otherwise.
clojisr opens up a child R process just by requireing a namespace. I’m not a fan of that.

But it’s definitely very promising! I applaud the effort and I’ll keep a close eye on it.

Key takeaways

Skyscraper makes data acquisition bearable, if not fun.
Plumatic Graph makes writing research code in Clojure fun.
ggplot makes plotting data fun.
Clojure makes programming fun. (But you knew that already.)

Clojure as a dependency

2020-05-02T00:00:00Z

I have a shameful confession to make: I have long neglected an open-source library that I maintain, clj-tagsoup.

This would have been less of an issue, but this is my second-most-starred project on GitHub. Granted, I don’t feel a need for it anymore, but apparently people do. I wish I had spent some time reviewing and merging the incoming PRs.

Anyway, I’ve recently been prompted to revive it, and I’m preparing a new release. While on it, I’ve been updating dependencies to their latest versions, and upon seeing a dependency on [org.clojure/clojure "1.2.0"] in project.clj (yes, it’s been neglected for that long), I started wondering: which Clojure to depend on? Actually, should Clojure itself be a dependency at all?

I’ve googled around for best practices, but with no conclusive answer. So I set out to do some research.

TLDR: with Leiningen, add it with :scope "provided"; with cli-tools, you don’t have to, unless you want to be explicit.

Is it possible for a Clojure project to declare no dependency on Clojure at all?

Quite possible, as it turns out. But the details depend on the build tool.

Obviously, this only makes sense for libraries. Or, more broadly, for projects that are not meant to be used standalone, but rather included in other projects (which will have a Clojure dependency of their own).

Leiningen

If you try to create a Leiningen project that has no dependencies:

(defproject foo "0.1.0"
  :dependencies [])

then Leiningen (as of version 2.9.3, but I’d guess older versions behave similarly) won’t allow you to launch a REPL:

$ lein repl
Error: Could not find or load main class clojure.main
Caused by: java.lang.ClassNotFoundException: clojure.main
Subprocess failed (exit code: 1)

But all is not lost: lein jar works just fine (as long as you don’t AOT-compile any namespaces), as does lein install. The resulting library will happily function as a dependency of other projects.

The upside of depending on no particular Clojure version is that you don’t impose it on your consumers. If a library depends on Clojure 1.9.0, but a project that uses it depends on Clojure 1.10.1, then Leiningen will fetch 1.9.0’s pom.xml (it’s smart enough to figure out that the jar itself won’t be needed, as the conflict will always be resolved in favour of the direct dependency), and lein deps :tree will report “possibly confusing dependencies”.

It’s not very useful to have a library that you can’t launch a REPL against, though. So what some people do is declare a dependency on Clojure not in the main :dependencies, but in a profile.

(defproject foo "0.1.0"
  :dependencies []
  :profiles {:dev {:dependencies [[org.clojure/clojure "1.10.1"]]}})

This avoids conflicts and brings back the possibility to launch a REPL. Sometimes, people create multiple profiles for different Clojure versions; Leiningen’s documentation mentions this possibility.

Unfortunately, with this approach it’s still not possible to AOT-compile things or create uberjars with Leiningen. (Putting Clojure in the :provided profile causes building the uberjar to succeed, but the resulting -standalone jar doesn’t actually contain Clojure).

Another option is to add Clojure to the main :dependencies, but with :scope "provided". Per the Maven documentation, this means:

This is much like compile, but indicates you expect the JDK or a container to provide the dependency at runtime. For example, when building a web application for the Java Enterprise Edition, you would set the dependency on the Servlet API and related Java EE APIs to scope provided because the web container provides those classes. This scope is only available on the compilation and test classpath, and is not transitive.

The key are the last words: “not transitive.” If project A depends on a library B that declares a “provided” dependency C, then C won’t be automatically put in A’s dependencies, and A is expected to explicitly declare its own C.

This means that it’s adequate for both libraries and standalone projects when it comes to declaring a Clojure dependency. It doesn’t break anything, doesn’t cause any ephemeral conflicts, and can be combined with the profiles approach when multiple configurations are called for.

cli-tools

cli-tools will accept a deps.edn as simple as {}. Even passing -Srepro to clojure or clj (which excludes the Clojure dependency that you probably have in your ~/.clojure/deps.edn) doesn’t break anything: cli-tools will just use 1.10.1 (at least as of version 1.10.1.536).

With cli-tools, as a library author you probably don’t have to declare a Clojure dependency at all. But things are less uniform in this land than they are in Leiningen (for example, there are quite a few uberjarrers to choose from), so it’s reasonable to check with your tooling first.

Boot

I’m no longer a Boot user, so I can’t tell. But from what I know, it uses Aether just like Leiningen and Maven do, so I’d wager a guess the same caveats apply as for Leiningen. Haven’t checked, though.

So what do the existing projects do?

I figured it would be a fun piece of research to examine how the popular projects depend (or don’t depend) on Clojure. I queried GitHub’s API for the 1000 most starred Clojure projects, fetched and parsed their project.cljs and/or deps.edns, and tallied things up.

I’ll write a separate “making of” post, because it turned out to be an even more fun weekend project than I had anticipated. But for now, let me share the conclusions.

I ended up with 968 project definition files that I was able to successfully parse: 140 deps.edns and 828 project.cljs. Here’s a breakdown of Clojure version declared as a “main” dependency (i.e., not in a profile or alias):

N/A means that there’s no dependency on Clojure declared, and “other” is an umbrella for the zoo of alphas, betas and snapshots.

As expected, not depending on Clojure is comparatively more popular in the cli-tools land: almost half (48.6%) of cli-tools projects don’t declare a Clojure dependency, versus 21.5% (174 projects) for Leiningen.

That Leiningen number still seemed quite high to me, so I dug a little deeper. Out of those 174 projects, 100 have Clojure somewhere in their :profiles. The remaining 74 are somewhat of outliers:

some, like Ring or Pedestal, are umbrella projects composed of sub-projects (with the lein-sub plugin) that have actual dependencies themselves;
some, like Klipse or Reagent, are essentially ClojureScript-only;
some, like Overtone, use the lein-tools-deps plugin to store their dependencies in deps.edn while using Leiningen for other tasks.

Finally, the popularity of :scope "provided" is much lower. Only 68 Leiningen projects specify it (8.9% of those that declare any dependencies), and only two deps.edn files do so (re-frame and fulcro – note that re-frame actually has both a project.clj and a deps.edn).

Indenting cond forms

2020-02-10T00:00:00Z

Indentation matters when reading Clojure code. It is the primary visual cue that helps the reader discern the code structure. Most Clojure code seen in the wild conforms to either the community style guide or the proposed simplified rules; the existing editors make it easy to reformat code to match them.

I find both these rulesets to be helpful when reading code. But there’s one corner-case that’s been irking me: cond forms.

cond takes an even number of arguments: alternating test-expression pairs. They are commonly put next to each other, two forms per line.

(cond
  test expr-1
  another-test expr-2
  :else expr-3)

Sometimes, people align the expressions under one another, in a tabular fashion:

(cond
  test         expr-1
  another-test expr-2
  :else        expr-3)

But things get out of hand when either tests or exprs get longer and call for multiple lines themselves. There are several options here, all of them less than ideal.

Tests and expressions next to each other

In other words, keep the above rule. Because we’ll have multiple lines in a form, this tends to make the resulting code axe-shaped:

(cond
  (= (some-function something) expected-value) (do
                                                 (do-this)
                                                 (and-also-do-that))
  (another-predicate something-else) (try
                                       (do-another-thing)
                                       (catch Exception _
                                         (println "Whoops!"))))

This yields code that is indented abnormally far to the right, forcing the reader’s eyeballs to move in two dimensions – even more so if the tabular feel is desired. If both the test and the expression is multi-lined, it just looks plain weird.

Stack all forms vertically, no extra spacing

(cond
  (= (some-function something) expected-value)
  (do
    (do-this)
    (and-also-do-that))
  (another-predicate something-else)
  (try
    (do-another-thing)
    (catch Exception _
      (println "Whoops!"))))

This gets rid of the long lines, but introduces another problem: it’s hard to tell at a glance

where a given test or expression starts or ends;
which tests are paired with which expression;
whether a given line corresponds to a test or an expression, and which one.

Stack all forms vertically, blank lines between test/expr pairs

(cond
  (= (some-function something) expected-value)
  (do
    (do-this)
    (and-also-do-that))

  (another-predicate something-else)
  (try
    (do-another-thing)
    (catch Exception _
      (println "Whoops!"))))

The Style Guide says that this is an “ok-ish” thing to do.

But with the added blank lines, logical structure of the code is much more apparent. However, it breaks another assumption that I make when reading the code: functions contain no blank lines. The Style Guide even mentions it, saying that cond forms are an acceptable exception.

It is now harder to tell at a glance where the enclosing function starts or ends. And once this assumption is broken once, the brain expects it to be broken again, causing reading disruption across the entire file.

Forms one under another, extra indentation for expressions only

(cond
  (= (some-function something) expected-value)
    (do
      (do-this)
      (and-also-do-that))
  (another-predicate something-else)
    (try
      (do-another-thing)
      (catch Exception _
        (println "Whoops!"))))

I resorted to this several times. The lines are not too long; the visual cues are there; it’s obvious what is the condition, what is the test, and what goes with what.

Except… it’s against the rules. List items stacked vertically should be aligned one under the other. I have to actively fight my Emacs to enforce this formatting, and it will be lost next time I press C-M-q on this form. No good.

Forms one under another, expressions prefixed by `#_=>`

(cond
  (= (some-function something) expected-value)
  #_=> (do
         (do-this)
         (and-also-do-that))
  (another-predicate something-else)
  #_=> (try
         (do-another-thing)
         (catch Exception _
           (println "Whoops!"))))

This one is my own invention: I haven’t seen it anywhere else. But I think it manages to avoid most problems.

#_ is a reader macro that causes the next form to be elided and not seen by the compiler. => is a valid form. Thus, #_=> is effectively whitespace as far as the compiler is concerned, and the indentation rules treat it as yet another symbol (although it technically isn’t one). No tooling is broken, no assumptions are broken, and the #_=> tends to be syntax-highlighted unintrusively so it doesn’t stand out. I tend to read it aloud as “then.”

Meanwhile, in another galaxy

Other Lisps (Scheme and CL) wrap each test/expression pair in an extra pair of parens, thereby avoiding the blending of conditions and expressions when indented one under the other. But I’m still happy Clojure went with fewer parens. As I say, this is a corner case where additional pair of parens would somewhat help, but most of the time I find them less aesthetic and a visual clutter.

Careful with that middleware, Eugene

2020-01-21T00:00:00Z

Prologue

I’ll be releasing version 0.3 of Skyscraper, my Clojure framework for scraping entire sites, in a few days.

More than three years have passed since its last release. During that time, I’ve made a number of attempts at redesigning it to be more robust, more usable, and faster; the last one, resulting in an almost complete rewrite, is now almost ready for public use as I’m ironing out the rough edges, documenting it, and adding tests.

It’s been a long journey and I’ll blog about it someday; but today, I’d like to tell another story: one of a nasty bug I had encountered.

Part One: Wrap, wrap, wrap, wrap

While updating the code of one of my old scrapers to use the API of Skyscraper 0.3, I noticed an odd thing: some of the output records contained scrambled text. Apparently, the character encoding was not recognised properly.

“Weird,” I thought. Skyscraper should be extra careful about honoring the encoding of pages being scraped (declared either in the headers, or the tag). In fact, I remembered having seen it working. What was wrong?

For every page that it downloads, Skyscraper 0.3 caches the HTTP response body along with the headers so that it doesn’t have to be downloaded again; the headers are needed to ensure proper encoding when parsing a cached page. The headers are lower-cased, so that Skyscraper can then call (get all-headers "content-type") to get the encoding declared in headers. If this step is missed, and the server returns the encoding in a header named Content-Type, it won’t be matched. Kaboom!

I looked at the cache, and sure enough, the header names in the cache were not lower-cased, even though they should be. But why?

Maybe I was mistaken, and I had forgotten the lower-casing after all? A glance at the code: no. The lower-casing was there, right around the call to the download function.

Digression: Skyscraper uses clj-http to download pages. clj-http, in turn, uses the middleware pattern: there’s a “bare” request function, and then there are wrapper functions that implement things like redirects, OAuth, exception handling, and what have you. I say “wrapper” because they literally wrap the bare function: (wrap-something request) returns another function that acts just like request, but with added functionality. And that other function can in turn be wrapped with yet another one, and so on.

There’s a default set of middleware wrappers defined by clj-http, and it also provides a macro, with-additional-middleware, which allows you to specify additional wrappers. One such wrapper is wrap-lower-case-headers, which, as the name suggests, causes the response’s header keys to be returned in lower case.

Back to Skyscraper. We’re ready to look at the code now. Can you spot the problem?

(let [request-fn (or (:request-fn options)
                     http/request)]
  (http/with-additional-middleware [http/wrap-lower-case-headers]
    (request-fn req
                success-fn
                error-fn)))

I stared at it for several minutes, did some dirty experiments in the REPL, perused the code of clj-http, until it dawned on me.

See that request-fn? Even though Skyscraper uses http/request by default, you can override it in the options to supply your own way of doing HTTP. (Some of the tests use it to mock calls to a HTTP server.) In this particular case, it was not overridden, though: the usual http/request was used. So things looked good: within the body of http/with-additional-middleware, headers should be lower-cased because request-fn is http/request.

Or is it?

Let me show you how with-additional-middleware is implemented. It expands to another macro, with-middleware, which is defined as follows (docstring redacted):

(defmacro with-middleware
  [middleware & body]
  `(let [m# ~middleware]
     (binding [*current-middleware* m#
               clj-http.client/request (reduce #(%2 %1)
                                               clj-http.core/request
                                               m#)]
       ~@body)))

That’s right: with-middleware works by dynamically rebinding http/request. Which means the request-fn I was calling is not actually the wrapped version, but the one captured by the outer let, the one that wasn’t rebound, the one without the additional middleware!

After this light-bulb moment, I moved with-additional-middleware outside of the let:

(http/with-additional-middleware [http/wrap-lower-case-headers]
  (let [request-fn (or (:request-fn options)
                       http/request)]
    (request-fn req
                success-fn
                error-fn)))

And, sure enough, it worked.

Part Two: The tests are screaming loud

Is it the end of the story? I’m guessing you’re thinking it is. I thought so too. But I wanted to add one last thing: a regression test, so I’d never run into the same problem in the future.

I whipped up a test in which one ISO-8859-2-encoded page was scraped, and a check for the correct string was made. I ran it against the fixed code. It was green. I ran it against the previous, broken version…

It was green, too.

At this point, I knew I had to get to the bottom of this.

Back to experimenting. After a while, I found out that extracting encoding from a freshly-downloaded page actually worked fine! It only failed when parsing headers fetched from a cache. But the map was the same in both cases! In both cases, the code was effectively doing

(get {"Content-Type" "text/html; charset=ISO-8859-2"}
     "content-type")

This lookup shouldn’t succeed: in map lookup, string comparison is case-sensitive. And yet, for freshly-downloaded headers, it did succeed!

I checked the type of both maps. One of them was a clojure.lang.PersistentHashMap, as expected. The other one was not. It was actually a clj_http.headers.HeaderMap.

I’ll let the comment of that one speak for itself:

a map implementation that stores both the original (or canonical) key and value for each key/value pair, but performs lookups and other operations using the normalized – this allows a value to be looked up by many similar keys, and not just the exact precise key it was originally stored with.

And so it turned out that the library authors have actually foreseen the need for looking up headers irrespective of case, and provided a helpful means for that. The whole lowercasing business was not needed, after all!

I stripped out the with-additional-middleware altogether, added some code elsewhere to ensure that the header map is a HeaderMap regardless of whether it comes from the cache or not, and they lived happily ever after.

Epilogue

Moral of the story? It’s twofold.

Dynamic rebinding can be dangerous. Having a public API that is implemented in terms of dynamic rebinding, even more so. I’d prefer if clj-http just allowed the custom middleware to be explicitly specified as an argument, thusly:

(http/request req
              :additional-middleware [http/wrap-lower-case-headers])

Know your dependencies. If you have a problem that might be generically addressed by the library you’re using, look deeper. It might be there already.

Thanks to 3Jane for proofreading this article.

Word Champions

2020-01-03T00:00:00Z

This story begins on August 9, 2017, when a friend messaged me on Facebook: “Hey, I’m going to be on a TV talent show this weekend. They’ll be giving me this kind of problems. Any ideas how to prepare?”

He attached a link to this video:

Now, we’re both avid Scrabble players, so we explored some ideas about extracting helpful data out of the Official Polish Scrabble Player’s Dictionary. I launched a Clojure REPL and wrote some throwaway code to generate sample training problems for Krzysztof. The code used a brute-force algorithm, so it was dog slow, but it was a start. It was Wednesday.

I woke up next morning with the problem still in my head. Clearly, I had found myself in a nerd sniping situation.

There was only one obvious way out—to write a full-blown training app so that Krzysztof could practice as if he were in the studio. The clock was ticking: we had two days left.

After work, I started a fresh re-frame project. (I was a recent re-frame convert those days, so I wanted to see how well it could cope with the task at hand.) Late that night, or rather early next morning, the prototype was ready.

It had very messy code. It only worked on Chrome. It failed miserably on mobile. It took ages to load. It had native JS dependencies, notably Material-UI and react-dnd, and for some reason it would not compile with ClojureScript’s advanced optimization turned on; so it weighed in at more than 6 MB, slurping in more than 300 JS files on load.

But it worked.

Krzysztof didn’t win his episode against the other contestants, ending up third, but he completed his challenge successfully. It took him 3 minutes and 42 seconds, out of 5 minutes allotted. The episode aired on 24 October.

Krzysztof said that the problem he ended up solving on the show was way easier than the ones generated by the app: had they been more difficult, the wow factor might have been higher.

Several months later, we met at a Scrabble tournament, and I received a present. I wish I had photographed that bottle of wine, so I could show it here, but I hadn’t.

Meanwhile, the code remained messy and low-priority. But I kept returning to it when I felt like it, fixing up things one at a time. I’ve added difficulty levels, so you can have only one diagram, or three. I’ve made it work on Firefox. I’ve done a major rewrite, restructuring the code in a sane way and removing the JS dependencies other than React. I’ve made advanced compilation work, getting the JS down to 400K. I’ve made it work on mobile devices. I’ve written a puzzle generator in C, which ended up several orders of magnitude faster than the prototype Clojure version (it’s still brute-force, but uses some dirty C tricks to speed things up; I hope to rewrite it in Rust someday).

And now, 2½ years later, I’ve added an English version, with an accompanying set of puzzles (generated from a wordlist taken from this repo), for the English-speaking world to enjoy.

Play Word Champions now!

The code is on GitHub if you’d like to check it out or try hacking on it. It’s small, less than 1KLOC in total, so I think it can be a learning tool for re-frame or ClojureScript.

(This game as featured on the TV shows is called Gridlock. The name “Word Champions” was inspired by the title of Krzysztof’s video on YouTube, literally meaning “Lord of the Words”. There is no pun in the Polish title.)

Re-framing text-mode apps

2019-02-05T00:00:00Z

Intro

“But, you know, many explorers liked to go to places that are unusual. And, it’s only for the fun of it.” – Richard P. Feynman

A couple of nights ago, I hacked together a small Clojure program.

All it does is displays a terminal window with a red rectangle in it. You can use your cursor keys to move it around the window, and space bar to change its colour. It’s fun, but it doesn’t sound very useful, does it?

In this post, I’ll try to convince you that there’s more to this little toy than might at first sight appear. You may want to check out the repo as you go along.

In which an unexpected appearance is made

(I’ve always envied Phil Hagelberg this kind of headlines.)

As you might have guessed from this article’s title, clj-tvision (a working name for the program) is a re-frame app.

For those of you who haven’t heard of re-frame, a word of explanation: it’s a ClojureScripty way of writing React apps, with Redux-like management of application state. If you do know re-frame (shameless plug: we at WorksHub do, and use it a lot: it powers the site you’re looking at right now!), you’ll instantly find yourself at home. However, a few moments later, a thought might dawn upon you, and you might start to feel a little uneasy…

Because I’ve mentioned React and ClojureScript, and yet I’d said earlier that we’re talking a text-mode application here. And I’ve mentioned that it’s written in Clojure. It is, in fact, not using React at all, and it has nothing to do whatsoever with ClojureScript, JavaScript, or the browser.

How is that even possible?

Here’s the catch: re-frame is implemented in .cljc files. So while it’s mostly used in the ClojureScript frontend, it can be used from Clojure. You may know this if you’re testing your events or subscriptions on the JVM.

While it’s mostly – if not hitherto exclusively – used for just that, I wanted to explore whether it could be used to manage state in an actual, non-web app. Text-mode is a great playground for this kind of exploration. Rather than picking a GUI toolkit and concern myself with its intricacies, I chose to just put things on a rectangular sheet of text characters.

(But if you are interested in pursuing a React-ish approach for GUIs, check out what Bodil Stokke’s been doing in vgtk.)

Living without the DOM

The building blocks of a re-frame app are subscriptions, events, and views. While the first two work in Clojureland pretty much the same way they do in the browser (although there are differences, of which more anon), views are a different beast.

re-frame’s documentation says that views are “data in, Hiccup out. Hiccup is ClojureScript data structures which represent DOM.” But outside of the browser realm, there’s no DOM. So let’s rephrase that more generally: re-frame views should produce data structures which declaratively describe the component’s appearance to the user. In web apps, those structures correspond to the DOM. What they will look like outside is up to us. We’ll be growing our own DOM-like model, piecemeal, as needs arise.

For clj-tvision, I’ve opted for a very simple thing. Let’s start with a concrete example. Here’s a view:

(defn view []
  [{:type :rectangle, :x1 10, :y1 5, :x2 20, :y2 10, :color :red}])

Unlike in the DOM, in this model the UI state isn’t a tree. It’s a flat sequence of maps that each represent individual “primitive elements”. We could come up with a fancy buzzword-compliant name and call it Component List Model, or CLiM for short, in homage to the venerable GUI toolkit.

Like normal re-frame views, CLiM views can include subviews. An example follows:

(defn square [left top size color]
  [{:type :rectangle,
    :x1 left,
    :y1 top,
    :x2 (+ left size -1),
    :y2 (+ top size -1),
    :color color}])

(defn view []
  [[square 1 1 5 :red]
   [square 9 9 5 :blue]])

How to render a view? Simple. First, flatten the list, performing funcalls on subviews so that you get a sequence containing only primitives. Then, draw each of them in order. (If there is an overlap, the trailers will obscure the leaders. Almost biblical.)

I’ve defined a multimethod, render-primitive, dispatching on :type. Its methods draw the corresponding primitive to a Lanterna screen.

Oh, didn’t I mention Lanterna? It’s a Java library for terminals. Either real ones or emulated in Swing (easier to work with when you’re in a CIDER REPL). Plus, it sports virtual screens which can be blitted to a real terminal. This gives us a rough poor man’s equivalent of React’s VDOM. And it has a Clojure wrapper!

Events at eventide

So now we know how to draw our UI. But an app isn’t made up of just drawing. It has a main loop: it listens to events, which cause the app state to change and the corresponding components to redraw.

re-frame does provide an event mechanism, but it doesn’t define any events per se. So we need to ask ourselves: who calls dispatch? How do events originate? How to write the main loop?

clj-tvision is a proof-of-concept, so it doesn’t concern itself with mouse support. There’s only one way a user can interact with the app: via the keyboard. So keystrokes will be the only “source events”, as it were, for the app; and so writing the event loop should be simple. Sketching pseudocode:

(loop []
  (render-app)
  (let [keystroke (wait-for-key)] ;; blocking!
    (dispatch [:key-pressed keystroke])
    (recur)))

Simple as that, should work, right?

Wrong.

If you actually try that, it’ll somewhat work. Hit right arrow to move the rectangle, nothing happens! Hit right arrow again, it moves. Hit left, it moves right. Hit right, it moves left. Not what you want.

You see, there’s a complication stemming from the fact that re-frame’s events are asynchronous by default. (Hence the dispatch vs. dispatch-sync dichotomy.) They don’t get dispatched immediately; rather, re-frame places them on a queue and processes them asynchronously, so that they don’t hog the browser. The Clojure version of re-frame handles that using a single-threaded executor with a dedicated thread.

We almost could use dispatch-sync everywhere, but for re-frame that’s a no-no: once within a dispatch-sync handler, you cannot dispatch other events. If you try anyway, re-frame will detect it and politely point its dragon-scaly head at you, explaining it doesn’t like it. (It is a benevolent dragon, you know.)

So we need to hook into that “next-tick” machinery of re-frame’s somehow. There are probably better ways of doing this, but I opted to blatantly redefine re-frame.interop/next-tick to tell the main loop: “hey, events have been handled and we have a new state, dispatch an event so we can redraw.” This is one of the rare cases where monkey-patching third-party code with alter-var-root saves you the hassle of forking that entire codebase.

So now we have two sources of events: keystrokes, and next-tick. To multiplex them, I’ve whipped up a channel with core.async. Feels hacky, but allows to add mouse support in the future. Or time-based events that will be fired periodically every so often.

For completeness, I should also add that Clojure-side re-frame doesn’t have the luxury of having reactive atoms provided by Reagent. Its ratoms are ordinary Clojure atoms. Unlike in ClojureScript, any time the app state changes, every subscription in the signal graph will be recomputed. It may well be possible to port Reagent’s ratoms to Clojure, but it is a far more advanced exercise. For simple apps, what re-frame provides on its own might just be enough.

And with that final bit, we can swipe all that hackitude under the carpet… or, should I say, tuck it into an internal ns that hopefully no-one will ever look into. And we’re left with shiny, declarative, re-framey, beautiful UI code on the surface. Just look.

Closing thoughts

“Within C++, there is a much smaller and cleaner language struggling to get out.” – Bjarne Stroustrup

If you’ve ever encountered legacy C++ code, this will ring true. Come to think of it, Stroustrup’s words are true of every system that has grown organically over its lifetime, with features being added to it but hardly ever removed.

And modern webapps may well be the epitome of that kind of system. We now have desktop apps that are fully self-contained on a single machine, yet use an overwhelmingly complex and vast machinery grown out of a simple system originally devised to view static documents over the Internet.

For all that complexity, we continue to use it. Partly owing to its ubiquity, partly for convenience. In my experience, the abstractions provided by re-frame allow you to wrap your head around large apps and reason about them much more easily than, say, object-oriented approaches. It just feels right. Conversely, writing an app in, say, GTK+ would now feel like a setback by some twenty years.

So this toy, this movable rectangle on a black screen, is not so much an app as it is a philosophical exercise. It is what my typing fingers produced while I pondered, weak and weary: “can we throw away most of that cruft, while still enjoying the abstractions that make life so much easier?”

Can we?

This post was originally published on Functional Works.

You already use Lisp syntax

2014-05-20T00:00:00Z

Unix Developer: I’m not going to touch Lisp. It’s horrible!

Me: Why so?

UD: The syntax! This illegible prefix-RPN syntax that nobody else uses. And just look at all these parens!

Me: Well, many people find it perfectly legible, although most agree that it takes some time to get accustomed to. But I think you’re mistaken. Lots of people are using Lisp syntax on a daily basis…

UD: I happen to know no one doing this.

Me: …without actually realizing this. In fact, I think you yourself are using it.

UD: Wait, what?!

Me: And the particular variant of Lisp syntax you’re using is called Bourne shell.

UD: Now I don’t understand. What on earth does the shell have to do with Lisp?

Me: Just look: in the shell, you put the name of the program first, followed by the arguments, separated by spaces. In Lisp it’s exactly the same, except that you put an opening paren at the beginning and a closing paren at the end.

Shell: run-something arg1 arg2 arg3

Lisp: (run-something arg1 arg2 arg3)

UD: I still don’t get the analogy.

Me: Then you need a mechanism for expression composition — putting the output of one expression as an input to another. In Lisp, you just nest the lists. And in the shell?

UD: Backticks.

Me: That’s right. Or $(), which has the advantage of being more easily nestable. Let’s try arithmetic. How do you do arithmetic in the shell?

UD: expr. Or the Bash builtin let. For example,

$ let x='2*((10+4)/7)'; echo $x
4

Me: Now wouldn’t it be in line with the spirit of Unix — to have programs do just one thing — if we had one program to do addition, and another to do subtraction, and yet another to do multiplication and division?

It’s trivial to write it in C:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char **argv) {
  int mode = -1, cnt = argc - 1, val, i;
  char **args = argv + 1;
  switch (argv[0][strlen(argv[0]) - 1]) {
    case '+': mode = 0; break;
    case '-': mode = 1; break;
    case 'x': mode = 2; break;
    case 'd': mode = 3; break;
  }
  if (mode == -1) {
    fprintf(stderr, "invalid math operation\n");
    return 1;
  }
  if ((mode == 1 || mode == 3) && !cnt) {
    fprintf(stderr, "%s requires at least one arg\n", argv[0]);
    return 1;
  }
  switch (mode) {
    case 0: val = 0; break;
    case 2: val = 1; break;
    default: val = atoi(*args++); cnt--; break;
  }
  while (cnt--) {
    switch (mode) {
      case 0: val += atoi(*args++); break;
      case 1: val -= atoi(*args++); break;
      case 2: val *= atoi(*args++); break;
      case 3: val /= atoi(*args++); break;
    }
  }
  printf("%d\n", val);
  return 0;
}

This dispatches on the last character of its name, so it can be symlinked to +, -, x and d (I picked unusual names for multiplication and division to make them legal and avoid escaping).

Now behold:

$ x 2 $(d $(+ 10 4) 7)
4

UD: Wow, this sure looks a lot like Lisp!

Me: And yet it’s the shell. Our two basic rules — program-name-first and $()-for-composition — allowed us to explicitly specify the order of evaluation, so there was no need to do any fancy parsing beyond what the shell already provides.

UD: So is the shell a Lisp?

Me: Not really. The shell is stringly typed: a program takes textual parameters and produces textual output. To qualify as a Lisp, it would have to have a composite type: a list or a cons cell to build lists on top of. Then, you’d be able to represent code as this data structure, and write programs to transform code to other code.

But the Tao of Lisp lingers in the shell syntax.

I know I’ve glossed over many details here, like the shell syntax for redirection, globbing, subprocesses, the fact that programs have standard input in addition to command-line arguments, pipes, etc. — all these make the analogy rather weak. But I think it’s an interesting way to teach Lisp syntax to people.

Lithium revisited: A 16-bit kernel (well, sort of) written in Clojure (well, sort of)

2013-05-26T00:00:00Z

Remember Lithium? The x86 assembler written in Clojure, and a simple stripes effect written in it? Well, here’s another take on that effect:

And here is the source code:

(do (init-graph)
    (loop [x 0 y 0]
      (put-pixel x y (let [z (mod (+ (- 319 x) y) 32)]
                       (if (< z 16) (+ 16 z) (+ 16 (- 31 z)))))
      (if (= y 200)
        nil
        (if (= x 319)
          (recur 0 (inc y))
          (recur (inc x) y)))))

I’ve implemented this several months ago, pushed it to Github and development has pretty much stalled since then. And after seeing this recent post on HN today, I’ve decided to give Lithium a little more publicity, in the hope that it will provide a boost of motivation to me. Because what we have here is pretty similar to Rustboot: it’s a 16-bit kernel written in Clojure.

Well, sort of.

After writing a basic assembler capable of building bare binaries of simple x86 real-mode programs, I’ve decided to make it a building block of a larger entity. So I’ve embarked on a project to implement a compiler for a toy Lisp-like language following the paper “An Incremental Approach to Compiler Construction”, doing it in Clojure and making the implemented language similar to Clojure rather than to Scheme.

(Whether it actually can be called Clojure is debatable. It’s unclear what the definition of Clojure the language is. Is running on JVM a part of what makes Clojure Clojure? Or running on any host platform? Is ClojureScript Clojure? What about ClojureCLR, or clojure-py?)

So far I’ve only gotten to step 7 of 24 or so, but that’s already enough to have a working loop/recur implementation, and it was trivial to throw in some graphical mode 13h primitives to be able to implement this effect.

By default I’m running Lithium programs as DOS .COM binaries under DOSBox, but technically, the code doesn’t depend on DOS in any way (it doesn’t ever invoke interrupt 21h) and so it can be combined with a simple bootloader into a kernel runnable on the bare metal.

The obligatory HOWTO on reproducing the effect: install DOSBox and Leiningen, checkout [the code][3], launch a REPL with lein repl, execute the following forms, and enjoy the slowness with which individual pixels are painted:

(require 'lithium.compiler)
(in-ns 'lithium.compiler)
(run! (compile-program "/path/to/lithium/examples/stripes-grey.clj"))

Lithium: an x86 assembler for Clojure

2012-05-14T00:00:00Z

Ah, the golden days of childhood’s hackage. Don’t you have fond memories of them?

I got my first PC when I was 10. It was a 486DX2/66 with 4 megs of RAM and a 170 meg HDD; it ran DOS and had lots of things installed on it, notably Turbo Pascal 6. I hacked a lot in it. These were pre-internet days when knowledge was hard to come by, especially for someone living in a small town in Poland; my main sources were the software I had (TP’s online help was of excellent quality), a couple of books, and a popular computing magazine that published articles on programming. From the latter, I learned how to program the VGA: how to enter mode 13h, draw pixels on screen, wait for vertical retrace, manipulate the palette and how to combine these things into neat effects. One of the very first thing I discovered was when you plot every pixel using sum of its coordinates modulo 40 as color, you get a nice-looking diagonal stripes effect. Because of the initially incomprehensible inline assembly snippets appearing all over the place, I eventually learned x86 assembly, too.

Back to 2012: I’ve long been wanting to hack on something just for pure fun, a side pet project. Writing code for the bare metal is fun because it’s just about as close as you can get to wielding the ultimate power. And yet, since Clojure is so much fun too, I wanted the project to have something to do with Clojure.

So here’s Lithium, an x86 16-bit assembler written in pure Clojure and capable of assembling a binary version of the stripes effect.

To try it, clone the git repo to your Linux or OS X machine, install DOSBox, launch a REPL with Leiningen, change to the lithium namespace and say:

(run! "/home/you/lithium/src/stripes.li.clj")

FAQ

(Well, this is not really a FAQ since nobody actually asked me any questions about Lithium yet. This is more in anticipation of questions that may arise.)

What’s the importance of this?

None whatsoever. It’s just for fun.

How complete is it?

Very incomplete. To even call it pre-pre-alpha would be an exaggeration. It’s currently little more than pure minimum required to assemble stripes.li.clj. Output format wise, it only produces bare binaries (similar to DOS .COMs), and that’s unlikely to change anytime soon.

Do you intend to continue developing it?

Absolutely. I will try to make it more complete, add 32- and possibly 64-bit modes, see how to add a macro system (since the input is s-expressions, it should be easy to produce Clojure macros to write assembly), write something nontrivial in it, and see how it can be used as a backend for some higher-level language compiler (I’m not sure yet which language that will turn out to be).

How to call a private function in Clojure

2012-04-25T00:00:00Z

tl;dr: Don’t do it. If you really have to, use (#'other-library/private-function args).

A private function in Clojure is one that has been defined using the defn- macro, or equivalently by setting the metadata key :private to true on the var that holds the function. It is normally not allowed in Clojure to call such functions from outside of the namespace where they have been defined. Trying to do so results in an IllegalStateException stating that the var is not public.

It is possible to circumvent this and call the private function, but it is not recommended. That the author of the library decided to make a function private probably means that he considers it to be an implementation detail, subject to change at any time, and that you should not rely on it being there. If you think it would be useful to have this functionality available as part of the public API, your best bet is to contact the library author and consult the change, so that it may be included officially in a future version.

Contacting the author, however, is not always feasible: she may not be available or you might be in haste. In this case, several workarounds are available. The simplest is to use (#'other-library/private-function args), which works in Clojure 1.2.1 and 1.3.0 (it probably works in other versions of Clojure as well, but I haven’t checked that).

Why does this work? When the Clojure compiler encounters a form (sym args), it invokes analyzeSeq on that form. If its first element is a symbol, it proceeds to analyze that symbol. One of the first operation in that analysis is checking if it names an inline function, by calling isInline. That function looks into the metadata of the Var named by the symbol in question. If it’s not public, it throws an exception.

On the other hand, #' is the reader macro for var. So our workaround is equivalent to ((var other-library/private-function) args). In this case, the first element of the form is not a symbol, but a form that evaluates to a var. The compiler is not able to check for this so it does not insert a check for privateness. So the code compiles to calling a Var object.

Here’s the catch: Vars are callable, just like functions. They implement IFn. When a var is called, it delegates the call to the IFn object it is holding. This has been recently discussed on the Clojure group. Since that delegation does not check for the var’s privateness either, the net effect is that we are able to call a private function this way.

Lifehacking: How to get cheap home equipment using Clojure

2012-04-12T00:00:00Z

I’ve moved to London last September. Like many new Londoners, I have changed accommodation fairly quickly, being already after one removal and with another looming in a couple of months; my current flat was largely unfurnished when I moved in, so I had to buy some basic homeware. I didn’t want to invest much in it, since it’d be only for a few months. Luckily, it is not hard to do that cheaply: many people are moving out and getting rid of their stuff, so quite often you can search for the desired item on Gumtree and find there’s a cheap one a short bike ride away.

Except when there isn’t. In this case, it’s worthwhile to check again within a few days as new items are constantly being posted. Being lazy, I’ve decided to automate this. A few hours and a hundred lines of Clojure later, gumtree-scraper was born.

I’ve packaged it using lein uberjar into a standalone jar, which, when run, produces a gumtree.rss that is included in my Google Reader subscriptions. This way, whenever something I’m interested in appears, I get notified within an hour or so.

It’s driven by a Google spreadsheet. I’ve created a sheet that has three columns: item name, minimum price, maximum price; then I’ve made it available to anyone who knows the URL. This way I can edit it pretty much from everywhere without touching the script. Each time the script is run (by cron), it downloads that spreadsheet as a CSV that looks like this:

hand blender,,5
bike rack,,15

For each row the script queries Gumtree’s category “For Sale” within London given the price range, gets each result and transforms it to a RSS entry.

Gumtree has no API, so I’m using screenscraping to retrieve all the data. Because the structure of the pages is much simpler, I’m actually scraping the mobile version; a technical twist here is that the mobile version is only served to actual browsers so I’m supplying a custom User-Agent, pretending to be Safari. For actual scraping, the code uses Enlive; it works out nicely.

About half of the code is RSS generation — mostly XML emitting. I’d use clojure.xml/emit but it’s known to produce malformed XML at times, so I include a variant that should work.

In case anyone wants to tries it out, be aware that the location and category are hardcoded in the search URL template; if you want, change the template line in get-page. The controller spreadsheet URL is not, however, hardcoded; it’s built up using the spreadsheet.key system property. Here’s the wrapper script I use that is actually run by cron:

#!/bin/bash
if [ "`ps ax | grep java | grep gumtree`" ]; then
  echo "already running, exiting"
  exit 0
fi
cd "`dirname $0`"
java -Dspreadsheet.key=MY_SECRET_KEY -jar $HOME/gumtree/gumtree.jar
cp $HOME/gumtree/gumtree.rss $HOME/public_html

Now let me remove that entry for a blender — I’ve bought one yesterday for £4…

Combining virtual sequences
or, Sequential Fun with Macros
or, How to Implement Clojure-Like Pseudo-Sequences with Poor Man’s Laziness in a Predominantly Imperative Language

2011-12-09T00:00:00Z

Sequences and iteration

There are a number of motivations for this post. One stems from my extensive exposure to Clojure over the past few years: this was, and still is, my primary programming language for everyday work. Soon, I realized that much of the power of Clojure comes from a sequence abstraction being one of its central concepts, and a standard library that contains many sequence-manipulating functions. It turns out that by combining them it is possible to solve a wide range of problems in a concise, high-level way. In contrast, it pays to think in terms of whole sequences, rather than individual elements.

Another motivation comes from a classical piece of functional programming humour, [The Evolution of a Haskell Programmer][1]. If you don’t know it, go check it out: it consists of several Haskell implementations of factorial, starting out from a straightforward recursive definition, passing through absolutely hilarious versions involving category-theoretical concepts, and finally arriving at this simple version that is considered most idiomatic:

fac n = product [1..n]

This is very Clojure-like in that it involves a sequence (a list comprehension). In Clojure, this could be implemented as

(defn fac [n]
  (reduce * 1 (range 1 (inc n)))

Now, I thought to myself, how would I write factorial in an imperative language? Say, Pascal?

function fac(n : integer) : integer;
var
  i, res : integer;
begin
  res := 1;
  for i := 1 to n do
    res := res * i;
  fac := res;
end;

This is very different from the functional version that works with sequences. It is much more elaborate, introducing an explicit loop. On the other hand, it’s memory efficient: it’s clear that its memory requirements are O(1), whereas a naïve implementation of a sequence would need O(n) to construct it all in memory and then reduce it down to a single value.

Or is it really that different? Think of the changing values of i in that loop. On first iteration it is 1, on second iteration it’s 2, and so on up to n. Therefore, one can really think of a for loop as a sequence! I call it a “virtual” sequence, since it is not an actual data structure; it’s just a snippet of code.

To rephrase it as a definition: a virtual sequence is a snippet of code that (presumably repeatedly) yields the member values.

Let’s write some code!

To illustrate it, throughout the remainder of this article I will be using Common Lisp, for the following reasons:

It allows for imperative style, including GOTO-like statements. This will enable us to generate very low-level code.
Thanks to macros, we will be able to obtain interesting transformations.

Okay, so let’s have a look at how to generate a one-element sequence. Simple enough:

(defmacro vsingle (x)
 `(yield ,x))

The name VSINGLE stands for “Virtual sequence that just yields a SINGLE element”. (In general, I will try to define virtual sequences named and performing similarly to their Clojure counterparts here; whenever there is a name clash with an already existing CL function, the name will be prefixed with V.) We will not concern ourselves with the actual definition of YIELD at the moment; for debugging, we can define it just as printing the value to the standard output.

(defun yield (x)
  (format t "~A~%" x))

We can also convert a Lisp list to a virtual sequence which just yields each element of the list in turn:

(defmacro vseq (list)
  `(loop for x in ,list do (yield x)))

(defmacro vlist (&rest elems)
  `(vseq (list ,@elems)))

Now let’s try to define RANGE. We could use loop, but for the sake of example, let’s pretend that it doesn’t exist and write a macro that expands to low-level GOTO-ridden code. For those of you who are not familiar with Common Lisp, GO is like GOTO, except it takes a label that should be established within a TAGBODY container.

(defmacro range (start &optional end (step 1))
  (unless end
    (setf end start start 0))
  (let ((fv (gensym)))
    `(let ((,fv ,start))
       (tagbody
        loop
          (when (>= ,fv ,end)
            (go out))
          (yield ,fv)
          (incf ,fv ,step)
          (go loop)
       out))))

Infinite virtual sequences are also possible. After all, there’s nothing preventing us from considering a snippet of code that loops infinitely, executing YIELD, as a virtual sequence! We will define the equivalent of Clojure’s iterate: given a function fun and initial value val, it will repeatedly generate val, (fun val), (fun (fun val)), etc.

(defmacro iterate (fun val)
  (let ((fv (gensym)))
    `(let ((,fv ,val))
       (tagbody loop
          (yield ,fv)
          (setf ,fv (funcall ,fun ,fv))
          (go loop)))))

So far, we have defined a number of ways to create virtual sequences. Now let’s ask ourselves: is there a way, given code for a virtual sequence, to yield only the elements from the original that satisfy a certain predicate? In other words, can we define a filter for virtual sequences? Sure enough. Just replace every occurrence of yield with code that checks whether the yielded value satisfies the predicate, and only if it does invokes yield.

First we write a simple code walker that applies some transformation to every yield occurrence in a given snippet:

(defun replace-yield (tree replace)
  (if (consp tree)
      (if (eql (car tree) 'yield)
          (funcall replace (cadr tree))
          (loop for x in tree collect (replace-yield x replace)))
      tree))

We can now write filter like this:

(defmacro filter (pred vseq &environment env)
  (replace-yield (macroexpand vseq env)
                 (lambda (x) `(when (funcall ,pred ,x) (yield ,x)))))

It is important to point out that since filter is a macro, the arguments are passed to it unevaluated, so if vseq is a virtual sequence definition like (range 10), we need to macroexpand it before replacing yield.

We can now verify that (filter #'evenp (range 10)) works. It macroexpands to something similar to

(LET ((#:G70192 0))
  (TAGBODY
    LOOP (IF (>= #:G70192 10)
           (PROGN (GO OUT)))
         (IF (FUNCALL #'EVENP #:G70192)
           (PROGN (YIELD #:G70192)))
         (SETQ #:G70192 (+ #:G70192 1))
         (GO LOOP)
    OUT))

concat is extremely simple. To produce all elements of vseq1 followed by all elements of vseq2, just execute code corresponding to vseq1 and then code corresponding to vseq2. Or, for multiple sequences:

(defmacro concat (&rest vseqs)
  `(progn ,@vseqs))

To define take, we’ll need to wrap the original code in a block that can be escaped from by means of return-from (which is just another form of goto). We’ll add a counter that will start from n and keep decreasing on each yield; once it reaches zero, we escape the block:

(defmacro take (n vseq &environment env)
  (let ((x (gensym))
        (b (gensym)))
    `(let ((,x ,n))
       (block ,b
         ,(replace-yield (macroexpand vseq env)
                         (lambda (y) `(progn (yield ,y)
                                             (decf ,x)
                                             (when (zerop ,x)
                                               (return-from ,b)))))))))

rest (or, rather, vrest, as that name is taken) can be defined similarly:

(defmacro vrest (vseq &environment env)
  (let ((skipped (gensym)))
    (replace-yield
     `(let ((,skipped nil)) ,(macroexpand vseq env))
     (lambda (x) `(if ,skipped (yield ,x) (setf ,skipped t))))))

vfirst is another matter. It should return a value instead of producing a virtual sequence, so we need to actually execute the code — but with yield bound to something else. We want to establish a block as with take, but our yield will immediately return from the block once the first value is yielded:

(defmacro vfirst (vseq)
  (let ((block-name (gensym)))
   `(block ,block-name
      (flet ((yield (x) (return-from ,block-name x)))
        ,vseq))))

Note that so far we’ve seen three classes of macros:

macros that create virtual sequences;
macros that transform virtual sequences to another virtual sequences;
and finally, vfirst is our first example of a macro that produces a result out of a virtual sequence.

Our next logical step is vreduce. Again, we’ll produce code that rebinds yield: this time to a function that replaces the value of a variable (the accumulator) by result of calling a function on the accumulator’s old value and the value being yielded.

(defmacro vreduce (f val vseq)
  `(let ((accu ,val))
     (flet ((yield (x) (setf accu (funcall ,f accu x))))
       ,vseq
       accu)))

We can now build a constructs that executes a virtual sequence and wraps the results up as a Lisp list, in terms of vreduce.

(defun conj (x y)
  (cons y x))

(defmacro realize (vseq)
 `(nreverse (vreduce #'conj nil ,vseq)))

Let’s verify that it works:

CL-USER> (realize (range 10))
(0 1 2 3 4 5 6 7 8 9)

CL-USER> (realize (take 5 (filter #'oddp (iterate #'1+ 0))))
(1 3 5 7 9)

Hey! Did we just manipulate an infinite sequence and got the result in a finite amount of time? And that without explicit support for laziness in our language? How cool is that?!

Anyway, let’s finally define our factorial:

(defun fac (n)
  (vreduce #'* 1 (range 1 (1+ n))))

Benchmarking

Factorials grow too fast, so for the purpose of benchmarking let’s write a function that adds numbers from 0 below n, in sequence-y style. First using Common Lisp builtins:

(defun sum-below (n)
  (reduce #'+ (loop for i from 0 below n collect i) :initial-value 0))

And now with our virtual sequences:

(defun sum-below-2 (n)
  (vreduce #'+ 0 (range n)))

Let’s try to time the two versions. On my Mac running Clozure CL 1.7, this gives:

CL-USER> (time (sum-below 10000000))
(SUM-BELOW 10000000) took 8,545,512 microseconds (8.545512 seconds) to run
                    with 2 available CPU cores.
During that period, 2,367,207 microseconds (2.367207 seconds) were spent in user mode
                    270,481 microseconds (0.270481 seconds) were spent in system mode
5,906,274 microseconds (5.906274 seconds) was spent in GC.
 160,000,016 bytes of memory allocated.
 39,479 minor page faults, 1,359 major page faults, 0 swaps.
49999995000000

CL-USER> (time (sum-below-2 10000000))
(SUM-BELOW-2 10000000) took 123,081 microseconds (0.123081 seconds) to run
                    with 2 available CPU cores.
During that period, 127,632 microseconds (0.127632 seconds) were spent in user mode
                    666 microseconds (0.000666 seconds) were spent in system mode
 4 minor page faults, 0 major page faults, 0 swaps.
49999995000000

As expected, SUM-BELOW-2 is much faster, causes less page faults and presumably conses less. (Critics will be quick to point out that we could idiomatically write it using LOOP’s SUM/SUMMING clause, which would probably be yet faster, and I agree; yet if we were reducing by something other than + — something that LOOP has not built in as a clause — this would not be an option.)

Conclusion

We have seen how snippets of code can be viewed as sequences and how to combine them to produce other virtual sequences. As we are nearing the end of this article, it is perhaps fitting to ask: what are the limitations and drawbacks of this approach?

Clearly, this kind of sequences is less powerful than “ordinary” sequences such as Clojure’s. The fact that we’ve built them on macros means that once we escape the world of code transformation by invoking some macro of the third class, we can’t manipulate them anymore. In Clojure world, first and rest are very similar; in virtual sequences, they are altogether different: they belong to different worlds. The same goes for map (had we defined one) and reduce.

But imagine that instead of having just one programming language, we have a high-level language A in which we are writing macros that expand to code in a low-level language B. It is important to point out that the generated code is very low-level. It could almost be assembly: in fact, most of the macros we’ve written don’t even require language B to have composite data-types beyond the type of elements of collections (which could be simple integers)!

Is there a practical side to this? I don’t know: to me it just seems to be something with hack value. Time will tell if I can put it to good use.

Color your own Europe with Clojure!

2011-07-11T00:00:00Z

This is a slightly edited translation of an article I first published on my Polish blog on January 19, 2011. It is meant to target newcomers to Clojure and show how to use Clojure to solve a simple real-life problems.

The problem

Some time ago I was asked to prepare a couple of differently-colored maps of Europe. I got some datasets which mapped countries of Europe to numerical values: the greater the value, the darker the corresponding color should be. A sample colored map looked like this:

I began by downloading an easily editable map from Wikipedia Commons, calculated the required color intensities for the first dataset, launched Inkscape and started coloring. After half an hour of tedious clicking, I realized that I would be better off writing a simple program in Clojure that would generate the map for me. It turned out to be an easy task: the remainder of this article will be an attempt to reconstruct my steps.

SVG

The format of the source image is SVG. I knew it was an XML-based vector graphics format, I’d often encountered images in this format on Wikipedia — but editing it by hand was new to me. Luckily, it turned out that the image has a simple structure. Each country’s envelope curve is described with a path element that looks like this:

<path
   id="pl"
   class="eu europe"
   d="a long list of curve node coordinates" />

An important thing to note here is the id attribute — this is the two-letter ISO-3166-1-ALPHA2 country code. In fact, there is an informative comment right at the beginning of the image that explains the naming conventions used. Having such a splendid input was of great help.

Just like HTML, SVG uses CSS stylesheets to define the look of an element. All that is needed to color Poland red is to style the element with a fill attribute:

<path
   id="pl"
   style="fill: #ff0000;"
   class="eu europe"
   d="a long list of curve node coordinates" />

Now that we know all this, let’s start coding!

XML in Clojure

The basic way to handle XML in Clojure is to use the clojure.xml namespace, which contains functions that parse XML (on a DOM basis, i.e., into an in-memory tree structure) and serialize such structures back into XML. Let us launch a REPL and start by reading our map and parsing it:

> (use 'clojure.xml)
nil
> (def m (parse "/home/nathell/eur/Blank_map_of_Europe.svg"))
[...a long while...]
Unexpected end of file from server
  [Thrown class java.net.SocketException]

Hold on in there! What’s that SocketException doing here? Firefox displays this map properly, so does Chrome, WTF?! Shouldn’t everything work fine in such a great language as Clojure?

Well, the language is as good as its libraries — and when it comes to Clojure, one can stretch that thought further: Clojure libraries are as good as the Java libraries they use under the hood. In this case, we’ve encountered a feature of the standard Java XML parser (from javax.xml package). It is restrictive and tries to reject invalid documents (even if they are well-formed). If the file being parsed contains a DOCTYPE declaration, the Java parser, and hence clojure.xml/parse, tries to download the DTD schema from the given address and validate the document against that schema. This is unfortunate in many aspects, especially from the point of view of the World Wide Web Consortium, since their servers hold the Web standards. One can easily imagine the volume of network traffic this generates: W3C has a blog post about it. Many Java programmers have encountered this problem at some time. There are a few solutions; we will go the simplest way and just manually remove the offending DOCTYPE declaration.

> (def m (parse "/home/nathell/eur/bm.svg"))
#'user/m
> m
[...many screenfuls of numbers...]

This time we managed to parse the image. Viewing the structure is not easy because of its sheer size (as expected: the file weighs in at over 0,5 MB!), but from the very first characters of the REPL’s output we can make out that’s it a Clojure map (no pun intended). Let’s examine its keys:

> (keys m)
(:tag :attrs :content)

So the map contains three entries with descriptive names. :tag contains the name of the XML element, :attrs is a map of attributes for this element, and :content is a vector of its subelements, each in turn being represented by similarly structured map (or a string if it’s a text node):

> (:tag m)
:svg
> (:attrs m)
{:xmlns "http://www.w3.org/2000/svg", :width "680", :height "520", :viewBox "1754 161 9938 7945", :version "1.0", :id "svg2"}
> (count (:content m))
68

Just for the sake of practice, let’s try to write the serialized representation of the parsed back as XML. The function emit should be able to do it, but it prints XML to standard output. We can use the with-out-writer macro from the namespace clojure.contrib.io to dump the XML to a file:

> (use 'clojure.contrib.io)
nil
> (with-out-writer "/tmp/a.svg" (emit m))
nil

We try to view a.svg in Firefox and…

Error parsing XML: not well-formed
Area: file:///tmp/a.xml
Row 15, column 44: Updated to reflect dissolution of Serbia & Montenegro: http://commons.wikimedia.org/wiki/User:Zirland
                 -------------------------------------------^

It turns out that using clojure.xml/emit is not recommended, because it does not handle XML entities in comments correctly; we should use clojure.contrib.lazy-xml instead. For the sake of example, though, let’s stay with emit and manually remove the offending line once again (we can safely do it, since that’s just a comment).

Coloring Poland

We saw earlier that our main XML node contains 68 subnodes. Let’s see what they are — tag names will suffice:

> (map :tag (:content m))
(:title :desc :defs :rect :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :path :g :path :path :g :path :path :path)

So far, so good. Seems that all country descriptions are contained directly in the main node. Let us try to find Poland:

> (count (filter #(and (= (:tag %) :path)
                       (= ((:attrs %) :id) "pl"))
                 (:content m)))
1

(This snippet of code filters the list of subnodes of m to pick only those elements whose tag name is path and value of attribute id is pl, and returns the length of such list.) Let’s try to add a style attribute to that element, according to what we said earlier. Because Clojure data structures are immutable, we have to define a new top-level element which will be the same as m, except that we will set the style of the appropriate subnode:

> (def m2 (assoc m
                :content
                (map #(if (and (= (:tag %) :path)
                               (= ((:attrs %) :id) "pl"))
                        (assoc % :attrs (assoc (:attrs %) :style "fill: #ff0000;"))
                        %)
                     (:content m))))
#'user/m2
> (with-out-writer "/tmp/a.svg" (emit m2))
nil

We open the created file and see a map with Poland colored red. Yay!

Generalization

We will generalize our code a bit. Let us write a function that colors a single state, taking a path element (subnode of svg) as an argument:

(defn color-state
  [{:keys [tag attrs] :as element} colorize-fn]
  (let [state (:id attrs)]
    (if-let [color (colorize-fn state)]
      (assoc element :attrs (assoc attrs :style (str "fill:" color)))
      element)))

This function is similar to the anonymous one we used above in the map call, but differs in some respects. It takes two arguments. As mentioned, the first one is the XML element (destructured into tag and attrs: you can read more about destructuring in the appropriate part of Clojure docs), and the second argument is… a function that should take a two-letter country code and return a HTML color description (or nil, if that country’s color is not specified — color-state will cope with this and return the element unchanged).

Now that we have color-state, we can easily write a higher-level function that processes and writes XML in one step:

(defn save-color-map
  [svg colorize-fn outfile]
  (let [colored-map (assoc svg :content (map #(color-state % colorize-fn) (:content svg)))]
    (with-out-writer out
      (emit colored-map))))

Let’s test it:

> (save-color-map m {"pl" "#00ff00"} "/tmp/a.svg")
nil

This time Poland is green (we used a country→color map as an argument to color-state, since Clojure maps are callable like functions). Let’s try to add blue Germany:

> (save-color-map m {"pl" "#00ff00", "de" "#0000ff"} "/tmp/a.svg")
nil

It works!

Problem with the UK

Inspired by our success, we try to color different countries. It mostly works, but the United Kingdom remains gray, regardless of whether we specify its code as “uk” or “gb”. We resort to the source of our image, and the beginning comment once again proves helpful:

Certain countries are further subdivided the United Kingdom has gb-gbn for Great Britain and gb-nir for Northern Ireland. Russia is divided into ru-kgd for the Kaliningrad Oblast and ru-main for the Main body of Russia. There is the additional grouping #xb for the “British Islands” (the UK with its Crown Dependencies – Jersey, Guernsey and the Isle of Man)

Perhaps we have to specify “gb-gbn” and “gb-nir”, instead of just “gb”? We try that, but still no luck. After a while of thought: oh yes! Our initial assumption that all the country definitions are path subnodes of the toplevel svg node is false. We have to fix that.

So far we have been doing a “flat” transform of the SVG tree: we only changed the subnodes of the toplevel node, but no deeper. We should change all the path elements (and g, if we want to color groups of paths like the UK), regardless of how deep they occur in the tree.

We can use a zipper to do a depth-first walk of the SVG tree. Let us define a function that takes a zipper, a predicate that tells whether to edit the node in question, and the transformation function to apply to the node if the predicate returns true:

(defn map-zipper [f pred z]
  (if (zip/end? z)
    (zip/root z)
    (recur f pred (-> z (zip/edit #(if (pred %) (f %) %)) zip/next)))))

Now we rewrite save-color-map as:

(defn save-color-map
  [svg colorize-fn outfile]
  (let [colored-map (map-zipper #(color-state % colorize-fn) (fn [x] (#{:g :path} (:tag x))) (zip/xml-zip svg))]
    (with-out-writer out
      (emit colored-map))))

This time the UK can be colored.

Colorizers

We have automated the process of styling countries to make them appear in color, but translating particular numbers to RGB is tedious. In the last part of this article we will see how to ease this: we are going to write a colorizer, i.e., a function suitable for passing to color-state and save-color-map (so far we’ve been using maps for this).

Let’s start by writing a function that translates a triplet of numbers into a HTML RGB notation, because it will be easier for us to work with integers than with strings:

(defn htmlize-color
  [[r g b]]
  (format "#%02x%02x%02x" r g b))

Now we insert a call to htmlize-color into the appropriate pace in color-state:

(defn color-state
  [{:keys [tag attrs] :as element} colorize-fn]
  (let [state (:id attrs)]
    (if-let [color (colorize-fn state)]
      (assoc element :attrs (assoc attrs :style (str "fill:" (htmlize-color color))))
      element)))

Now imagine we have a table with numeric values for states, like this:

State	Value
Poland	20
Germany	15
Netherlands	30

We want to have a function that assigns colors to states, such that the intensity of a color should be proportional to the value assigned to a given state. To be more general, assume we have two colors, c1 and c2, and for a given state, for each of the R, G, B components we assign a value proportional to the difference between the state’s value and the smallest value in the dataset, normalized to lie between c1 and c2.

This sounds complex, but I hope an example will clear things up. This is the Clojure implementation of the described algorithm:

(defn make-colorizer
  [dataset ranges]
  (let [minv (apply min (vals dataset))
        maxv (apply max (vals dataset))
        progress (map (fn [[min-col max-col]] (/ (- max-col min-col) (- maxv minv))) ranges)]
    (into {}
          (map (fn [[k v]] [(.toLowerCase k) (map (fn [progress [min-color _]] (int (+ min-color (* (- v minv) progress)))) progress ranges)])
               dataset))))

Let us see how it works on our sample data:

> (make-colorizer {"pl" 20, "de" 15, "nl" 30} [[0 255] [0 0] [0 0]])
{"pl" (85 0 0), "de" (0 0 0), "nl" (255 0 0)}

The second argument means that the red component is to range between 0 and 255, and the green and blue components are to be fixed at 0.

Like we wanted, Germany ends up darkest (because it has the least value), the Netherlands is lightest (because it has the greatest value), and Poland’s intensity is one third that of the Netherlands (because 20 is in one third of the way between 15 and 30).

Wrapping up

The application we created can be further developed in many ways. One can, for instance, add a Web interface for it, or write many different colorizers (e.g., discrete colorizer: fixed colours for ranges of input values, or a temperature colorizer transitioning smoothly from blue through white to red — to do this we would have to pass through the HSV color space).

What is your idea to improve on it? For those of you who are tired of pasting snippets of code into the REPL, I’m putting the complete source code with a Leiningen project on GitHub. Forks are welcome.

Keyword arguments

2010-05-04T00:00:00Z

There’s been an ongoing debate about how to pass optional named arguments to Clojure functions. One way to do this is the defnk macro from clojure.contrib.def; I hesitate to call it canonical, since apparently not everyone uses it, but I’ve found it useful a number of times. Here’s a sample:

user> (use 'clojure.contrib.def)
nil
user> (defnk f [:b 43] (inc b))
#'user/f
user> (f)
44
user> (f :b 100)
101

This is an example of keyword arguments in action. Keyword arguments are a core feature of some languages, notably Common Lisp and Objective Caml. Clojure doesn’t have them, but it’s pretty easy to emulate their basic usage with macros, as defnk does.

But there’s more to Common Lisp’s keyword arguments than defnk provides. In CL, the default value of a keyword argument can be an expression referring to other arguments of the same function. For example:

CL-USER> (defun f (&key (a 1) (b a))
           (+ a b))
F
CL-USER> (f)
2
CL-USER> (f :a 45)
90
CL-USER> (f :b 101)
102

I wish defnk had this feature. Or is there some better way that I don’t know of?

Sunflower

2010-04-18T00:00:00Z

The program I’ve been [writing about recently][1] has come to a point where I think it can be shown to the wide public. It’s called [Sunflower][2] and has its home on GitHub. It’s nowhere near being completed, and of alpha quality right now, but even at this stage it might be useful.

Just as sunflower seed kernels come wrapped in hulls, most HTML documents seen in the wild come wrapped in noise that is not really part of the document itself. Take any news site: a document from such a site contains things such as advertisements, header, footer, and many links. Now suppose you have many documents grabbed from the same site. Is it possible to somehow automate the extraction of the document “essences”?

Sunflower to the rescue. It relies on the assumption that documents coming from the same source have the same structure. It presents a list of strings to the user, and asks to pick those that are contained in the text essence. Then it finds the coordinates of the smallest HTML subtree that contains all those strings, and uses those coordinates to extract information from all documents. And it comes with a nice, easily understandable GUI for that.

This technique works remarkably well for many collections, although not all. An earlier, proof-of-concept implementation (in Common Lisp) has been used to extract many press texts for the [National Corpus of Polish][3].

I’ve given up on the symbol-capturing approach to wizards I’ve presented in my previous posts. Inspired by the DOM tree in Web apps, with a bag of elements with identifiers, I now have a central bag of Swing widgets (implemented as an atom) identified by keywords. This bag contains tidbits of the mutable state of Sunflower. This means that I can write callback functions like this:

#(with-components [strings-model selected-dir]
   (.removeAllElements strings-model)
   (let [p (-> selected-dir htmls first parse)]
     (add-component :parsed p)
     (doseq [x (strings p)]
       (.addElement strings-model x))))

Name and conquer: having parts of state explicitly named mean that I can reliably access them from just about anywhere. This reduces confusion and allows for less tangled, more self-contained and understandable code.

A case for symbol capture

2010-04-05T00:00:00Z

Clojure by default protects macro authors from incidentally capturing a local symbol. Stuart Halloway describes this in more detail, explaining why this is a Good Thing. However, sometimes this kind of symbol capture is called for. I’ve encountered one such case today while hacking a Swing application.

As I develop the app, I find new ways to express Swing concepts and interact with Swing objects in a more Clojuresque way, so a library of GUI macros and functions gets written. One of them is a wizard macro for easy creation of installer-like wizards, where there is a sequence of screens that can be navigated with Back and Next buttons at the bottom of the window.

The API (certainly not finished yet) currently looks like this:

(wizard & components)

where each Swing component corresponding to one wizard screen can be augmented by a supplementary map, which can contain, inter alia, a function to execute upon showing the screen in question.

Now, I want those functions to be able to access the Back and Next buttons in case they want to disable or enable them at need. I thus want the API user to be able to use two symbols, back-button and next-button, in the macro body, and have them bound to the corresponding buttons.

It is crucial that these bindings be lexical and not dynamic. If they were dynamic, they would be only effective during the definition of the wizard, but not when my closures are invoked later on. Thus, my implementation looks like this:

(defmacro wizard [& panels]
  `(let [~'back-button (button "< Back")
         ~'next-button (button "Next >")]
   (do-wizard ~'back-button ~'next-button ~(vec panels))))

where do-wizard is a private function implementing the actual wizard creation, and the ~'foo syntax forces symbol capture.

By the way, if all goes well, this blog post should be the first one syndicated to Planet Clojure. Hello, Planet Clojure readers!

The pitfalls of `lein swank`

2010-03-31T00:00:00Z

A couple of weeks ago I finally got around to acquainting myself with [Leiningen][1], one of the most popular build tools for Clojure. The thing that stopped me the most was that Leiningen uses [Maven][2] under the hood, which seemed a scary beast at first sight — but once I’ve overcome the initial fear, it turned out to be a quite simple and useful tool.

One feature in particular is very useful for Emacs users like me: lein swank. You define all dependencies in project.clj as usual, add a magical line to :dev-dependencies, then say

$ lein swank

and lo and behold, you can M-x slime-connect from your Emacs and have all the code at your disposal.

There is, however, an issue that you must be aware of when using lein swank: Leiningen uses a custom class loader — [AntClassLoader][3] to be more precise — to load the Java classes referenced by the code. Despite being a seemingly irrelevant thing — an implementation detail — this can bite you in a number of most surprising and obscure ways. Try evaluating the following code in a Leiningen REPL:

(str (.decode
       (java.nio.charset.Charset/forName "ISO-8859-2")
       (java.nio.ByteBuffer/wrap
         (into-array Byte/TYPE (map byte [-79 -26 -22])))))
;=> "???"

The same code evaluated in a plain Clojure REPL will give you "ąćę", which is a string represented in ISO-8859-2 by the three bytes from the above snippet.

Whence the difference? Internally, each charset is represented as a unique instance of its specific class. These are loaded lazily as needed by the Charset/forName method. Presumably, the system class loader is used for that, and somewhere along the way a SecurityException gets thrown and caught.

Note also that there are parts of Java API which use the charset lookup under the hood and are thus vulnerable to the same problem, for example Reader constructors taking charset names. If you use clojure.contrib.duck-streams, then rebinding *default-encoding* will not work from a Leiningen REPL. Jars and überjars produced by Leiningen should be fine, though.

Clojure SET

2010-02-10T00:00:00Z

I’ve just taken a short breath off work to put some code on GitHub that I had written over one night some two months ago. It is an implementation of the Set game in Clojure, using Swing for GUI.

I do not have time to clean up or comment the code, so I’m leaving it as is for now; however, I hope that even in its current state it can be of interest, especially for Clojure learners.

Some random notes on the code:

Clojure is concise! The whole thing is just under 250 lines of code, complete with game logic and the GUI. Of these, the logic is about 50 LOC. Despite this it reads clearly and has been a pleasure to write, thanks to Clojure’s supports for sets as a data structure (in vein of the game’s title and theme).
There are no graphics included. All the drawing is done in the GUI part of code (I’ve replaced the canonical squiggle shape by a triangle and stripes by gradients, for the sake of easier drawing).
I’ve toyed around with different Swing layout managers for this game. Back in the days when I wrote in plain Java, I used to use TableLayout, but it has a non-free license; JGoodies Forms is also nice, but has a slightly more complicated API (and it’s an additional dependency, after all). In the end I’ve settled with the standard GridBagLayout, which is similar in spirit to those two, but requires more boilerplate to set up. As it turned out, simple macrology makes it quite pleasurable to use; see add-gridbag in the code for details.
Other things of interest might be my function to randomly shuffle seqs, which strikes a nice balance between simplicity/conciseness of implementation and randomness; and a useful debugging macro.

Comments?

Daniel Janus – Clojure

Translating non-trivial codebases with Claude

I was wrong (or was I?)

Backstory

Enter Claude Code

Raising the bar

Excitement

And yet

Showcase

Closing remarks

No, really, you can’t branch Datomic from the past (and what you can do instead)

Double, double toil and trouble or, Corner-Cases of Comparing Clojure Numbers

Lossy CSS compression for fun and loss (or profit)

What

How

Why

My mental model of transducers

Intro

Pictures

Code

Testing a compiler that can’t even print stuff out

I made a website to guess tomorrow’s number of COVID-19 cases, and here’s what happened

Before

After

Inside out

Takeaways

Making of “Clojure as a dependency”

Data acquisition

Data analysis

Some words on plotting

Key takeaways

Clojure as a dependency

Is it possible for a Clojure project to declare no dependency on Clojure at all?

Leiningen

cli-tools

Boot

So what do the existing projects do?

Indenting cond forms

Tests and expressions next to each other

Stack all forms vertically, no extra spacing

Stack all forms vertically, blank lines between test/expr pairs

Forms one under another, extra indentation for expressions only

Forms one under another, expressions prefixed by #_=>

Meanwhile, in another galaxy

Careful with that middleware, Eugene

Prologue

Part One: Wrap, wrap, wrap, wrap

Part Two: The tests are screaming loud

Epilogue

Word Champions

Re-framing text-mode apps

Intro

In which an unexpected appearance is made

Living without the DOM

Events at eventide

Closing thoughts

You already use Lisp syntax

Lithium revisited: A 16-bit kernel (well, sort of) written in Clojure (well, sort of)

Lithium: an x86 assembler for Clojure

FAQ

How to call a private function in Clojure

Lifehacking: How to get cheap home equipment using Clojure

Combining virtual sequencesor, Sequential Fun with Macrosor, How to Implement Clojure-Like Pseudo-Sequences with Poor Man’s Laziness in a Predominantly Imperative Language

Sequences and iteration

Let’s write some code!

Benchmarking

Conclusion

Color your own Europe with Clojure!

The problem

SVG

XML in Clojure

Coloring Poland

Generalization

Problem with the UK

Colorizers

Wrapping up

Keyword arguments

Sunflower

A case for symbol capture

The pitfalls of lein swank

Forms one under another, expressions prefixed by `#_=>`

Combining virtual sequences
or, Sequential Fun with Macros
or, How to Implement Clojure-Like Pseudo-Sequences with Poor Man’s Laziness in a Predominantly Imperative Language

The pitfalls of `lein swank`