<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <id>tag:blog.danieljanus.pl,2019:category:skyscraper</id>
  <title>Daniel Janus – Skyscraper</title>
  <link href="http://blog.danieljanus.pl/category/skyscraper/"/>
  <updated>2020-01-21T00:00:00Z</updated>
  <author>
    <name>Daniel Janus</name>
    <uri>http://danieljanus.pl</uri>
    <email>dj@danieljanus.pl</email>
  </author>
  <entry>
    <id>tag:blog.danieljanus.pl,2020-01-21:post:middleware</id>
    <title>Careful with that middleware, Eugene</title>
    <link href="http://blog.danieljanus.pl/middleware/"/>
    <updated>2020-01-21T00:00:00Z</updated>
    <content type="html">&lt;div&gt;&lt;h2 id="prologue"&gt;Prologue&lt;/h2&gt;&lt;p&gt;I’ll be releasing version 0.3 of &lt;a href="https://github.com/nathell/skyscraper"&gt;Skyscraper&lt;/a&gt;, my Clojure framework for scraping entire sites, in a few days.&lt;/p&gt;&lt;p&gt;More than three years have passed since its last release. During that time, I’ve made a number of attempts at redesigning it to be more robust, more usable, and faster; the last one, resulting in an almost complete rewrite, is now almost ready for public use as I’m ironing out the rough edges, documenting it, and adding tests.&lt;/p&gt;&lt;p&gt;It’s been a long journey and I’ll blog about it someday; but today, I’d like to tell another story: one of a nasty bug I had encountered.&lt;/p&gt;&lt;h2 id="part-one:-wrap,-wrap,-wrap,-wrap"&gt;Part One: Wrap, wrap, wrap, wrap&lt;/h2&gt;&lt;p&gt;While updating the code of one of my old scrapers to use the API of Skyscraper 0.3, I noticed an odd thing: some of the output records contained scrambled text. Apparently, the character encoding was not recognised properly.&lt;/p&gt;&lt;p&gt;“Weird,” I thought. Skyscraper should be extra careful about honoring the encoding of pages being scraped (declared either in the headers, or the &lt;code&gt;&lt;meta http-equiv&gt;&lt;/code&gt; tag). In fact, I remembered having seen it working. What was wrong?&lt;/p&gt;&lt;p&gt;For every page that it downloads, Skyscraper 0.3 caches the HTTP response body along with the headers so that it doesn’t have to be downloaded again; the headers are needed to ensure proper encoding when parsing a cached page. The headers are lower-cased, so that Skyscraper can then call &lt;code&gt;(get all-headers "content-type")&lt;/code&gt; to get the encoding declared in headers. If this step is missed, and the server returns the encoding in a header named &lt;code&gt;Content-Type&lt;/code&gt;, it won’t be matched. Kaboom!&lt;/p&gt;&lt;p&gt;I looked at the cache, and sure enough, the header names in the cache were not lower-cased, even though they should be. But why?&lt;/p&gt;&lt;p&gt;Maybe I was mistaken, and I had forgotten the lower-casing after all? A glance at the code: no. The lower-casing was there, right around the call to the download function.&lt;/p&gt;&lt;p&gt;Digression: Skyscraper uses &lt;a href="https://github.com/dakrone/clj-http"&gt;clj-http&lt;/a&gt; to download pages. clj-http, in turn, uses the &lt;a href="http://clojure-doc.org/articles/cookbooks/middleware.html"&gt;middleware pattern&lt;/a&gt;: there’s a “bare” request function, and then there are wrapper functions that implement things like redirects, OAuth, exception handling, and what have you. I say “wrapper” because they literally wrap the bare function: &lt;code&gt;(wrap-something request)&lt;/code&gt; returns another function that acts just like &lt;code&gt;request&lt;/code&gt;, but with added functionality. And that other function can in turn be wrapped with yet another one, and so on.&lt;/p&gt;&lt;p&gt;There’s a default set of middleware wrappers defined by clj-http, and it also provides a macro, &lt;code&gt;with-additional-middleware&lt;/code&gt;, which allows you to specify additional wrappers. One such wrapper is &lt;code&gt;wrap-lower-case-headers&lt;/code&gt;, which, as the name suggests, causes the response’s header keys to be returned in lower case.&lt;/p&gt;&lt;p&gt;Back to Skyscraper. We’re ready to look at the code now. Can you spot the problem?&lt;/p&gt;&lt;pre&gt;&lt;code class="hljs clojure"&gt;(&lt;span class="hljs-name"&gt;&lt;span class="hljs-built_in"&gt;let&lt;/span&gt;&lt;/span&gt; [request-fn (&lt;span class="hljs-name"&gt;&lt;span class="hljs-built_in"&gt;or&lt;/span&gt;&lt;/span&gt; (&lt;span class="hljs-symbol"&gt;:request-fn&lt;/span&gt; options)
                     http/request)]
  (&lt;span class="hljs-name"&gt;http/with-additional-middleware&lt;/span&gt; [http/wrap-lower-case-headers]
    (&lt;span class="hljs-name"&gt;request-fn&lt;/span&gt; req
                success-fn
                error-fn)))
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;I stared at it for several minutes, did some dirty experiments in the REPL, perused the code of clj-http, until it dawned on me.&lt;/p&gt;&lt;p&gt;See that &lt;code&gt;request-fn&lt;/code&gt;? Even though Skyscraper uses &lt;code&gt;http/request&lt;/code&gt; by default, you can override it in the options to supply your own way of doing HTTP. (Some of the tests use it to mock calls to a HTTP server.) In this particular case, it was not overridden, though: the usual &lt;code&gt;http/request&lt;/code&gt; was used. So things looked good: within the body of &lt;code&gt;http/with-additional-middleware&lt;/code&gt;, headers should be lower-cased because &lt;code&gt;request-fn&lt;/code&gt; is &lt;code&gt;http/request&lt;/code&gt;.&lt;/p&gt;&lt;p&gt;Or is it?&lt;/p&gt;&lt;p&gt;Let me show you how &lt;code&gt;with-additional-middleware&lt;/code&gt; is implemented. It expands to another macro, &lt;code&gt;with-middleware&lt;/code&gt;, which is defined as follows (docstring redacted):&lt;/p&gt;&lt;pre&gt;&lt;code class="hljs clojure"&gt;(&lt;span class="hljs-keyword"&gt;defmacro&lt;/span&gt; &lt;span class="hljs-title"&gt;with-middleware&lt;/span&gt;
  [middleware &amp;amp; body]
  `(&lt;span class="hljs-name"&gt;&lt;span class="hljs-built_in"&gt;let&lt;/span&gt;&lt;/span&gt; [m# ~middleware]
     (&lt;span class="hljs-name"&gt;binding&lt;/span&gt; [*current-middleware* m#
               clj-http.client/request (&lt;span class="hljs-name"&gt;&lt;span class="hljs-built_in"&gt;reduce&lt;/span&gt;&lt;/span&gt; #(%&lt;span class="hljs-number"&gt;2&lt;/span&gt; %&lt;span class="hljs-number"&gt;1&lt;/span&gt;)
                                               clj-http.core/request
                                               m#)]
       ~@body)))
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;That’s right: &lt;code&gt;with-middleware&lt;/code&gt; works by dynamically rebinding &lt;code&gt;http/request&lt;/code&gt;. Which means the &lt;code&gt;request-fn&lt;/code&gt; I was calling is not actually the wrapped version, but the one captured by the outer &lt;code&gt;let&lt;/code&gt;, the one that wasn’t rebound, the one without the additional middleware!&lt;/p&gt;&lt;p&gt;After this light-bulb moment, I moved &lt;code&gt;with-additional-middleware&lt;/code&gt; outside of the &lt;code&gt;let&lt;/code&gt;:&lt;/p&gt;&lt;pre&gt;&lt;code class="hljs clojure"&gt;(&lt;span class="hljs-name"&gt;http/with-additional-middleware&lt;/span&gt; [http/wrap-lower-case-headers]
  (&lt;span class="hljs-name"&gt;&lt;span class="hljs-built_in"&gt;let&lt;/span&gt;&lt;/span&gt; [request-fn (&lt;span class="hljs-name"&gt;&lt;span class="hljs-built_in"&gt;or&lt;/span&gt;&lt;/span&gt; (&lt;span class="hljs-symbol"&gt;:request-fn&lt;/span&gt; options)
                       http/request)]
    (&lt;span class="hljs-name"&gt;request-fn&lt;/span&gt; req
                success-fn
                error-fn)))
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;And, sure enough, it worked.&lt;/p&gt;&lt;h2 id="part-two:-the-tests-are-screaming-loud"&gt;Part Two: The tests are screaming loud&lt;/h2&gt;&lt;p&gt;Is it the end of the story? I’m guessing you’re thinking it is. I thought so too. But I wanted to add one last thing: a regression test, so I’d never run into the same problem in the future.&lt;/p&gt;&lt;p&gt;I whipped up a test in which one ISO-8859-2-encoded page was scraped, and a check for the correct string was made. I ran it against the fixed code. It was green. I ran it against the previous, broken version…&lt;/p&gt;&lt;p&gt;It was &lt;em&gt;green&lt;/em&gt;, too.&lt;/p&gt;&lt;p&gt;At this point, I knew I had to get to the bottom of this.&lt;/p&gt;&lt;p&gt;Back to experimenting. After a while, I found out that extracting encoding from a freshly-downloaded page actually worked fine! It only failed when parsing headers fetched from a cache. But the map was the same in both cases! In both cases, the code was effectively doing&lt;/p&gt;&lt;pre&gt;&lt;code class="hljs clojure"&gt;(&lt;span class="hljs-name"&gt;&lt;span class="hljs-built_in"&gt;get&lt;/span&gt;&lt;/span&gt; {&lt;span class="hljs-string"&gt;&amp;quot;Content-Type&amp;quot;&lt;/span&gt; &lt;span class="hljs-string"&gt;&amp;quot;text/html; charset=ISO-8859-2&amp;quot;&lt;/span&gt;}
     &lt;span class="hljs-string"&gt;&amp;quot;content-type&amp;quot;&lt;/span&gt;)
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;This lookup &lt;em&gt;shouldn’t&lt;/em&gt; succeed: in map lookup, string comparison is case-sensitive. And yet, for freshly-downloaded headers, it &lt;em&gt;did&lt;/em&gt; succeed!&lt;/p&gt;&lt;p&gt;I checked the &lt;code&gt;type&lt;/code&gt; of both maps. One of them was a &lt;code&gt;clojure.lang.PersistentHashMap&lt;/code&gt;, as expected. The other one was not. It was actually a &lt;code&gt;clj_http.headers.HeaderMap&lt;/code&gt;.&lt;/p&gt;&lt;p&gt;I’ll let the comment of that one speak for itself:&lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;a map implementation that stores both the original (or canonical) key and value for each key/value pair, but performs lookups and other operations using the normalized – this allows a value to be looked up by many similar keys, and not just the exact precise key it was originally stored with.&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;And so it turned out that the library authors have actually foreseen the need for looking up headers irrespective of case, and provided a helpful means for that. The whole lowercasing business was not needed, after all!&lt;/p&gt;&lt;p&gt;I stripped out the &lt;code&gt;with-additional-middleware&lt;/code&gt; altogether, added some code elsewhere to ensure that the header map is a &lt;code&gt;HeaderMap&lt;/code&gt; regardless of whether it comes from the cache or not, and they lived happily ever after.&lt;/p&gt;&lt;h2 id="epilogue"&gt;Epilogue&lt;/h2&gt;&lt;p&gt;Moral of the story? It’s twofold.&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;span&gt;Dynamic rebinding can be dangerous. Having a public API that is implemented in terms of dynamic rebinding, even more so. I’d prefer if clj-http just allowed the custom middleware to be explicitly specified as an argument, thusly:&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;pre&gt;&lt;code class="hljs clojure"&gt;(&lt;span class="hljs-name"&gt;http/request&lt;/span&gt; req
              &lt;span class="hljs-symbol"&gt;:additional-middleware&lt;/span&gt; [http/wrap-lower-case-headers])
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;&lt;li&gt;&lt;span&gt;Know your dependencies. If you have a problem that might be generically addressed by the library you’re using, look deeper. It might be there already.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Thanks to &lt;a href="https://www.3jane.co.uk"&gt;3Jane&lt;/a&gt; for proofreading this article.&lt;/p&gt;&lt;/div&gt;</content>
  </entry>
</feed>
