enqueue: CLI utility to queue command execution

2011/11/19 § 3 Comments

Update: As you can see in the comments below, and as I feared, it turned out that LluĂ­s Batlle i Rossell already implemented something much like Enqueue, only better in many regards. I doubt I’ll keep maintaining enqueue, there’s no reason to. Oh well, it was a nice afternoon project.

Something that always bugged me with my shell workflow is the problem of queueing commands to run one after the other, while adding commands to the queue as they previous commands are being executed.

Take, for example, a simple usecase: we want to move three large files from diskA to diskB. The problem is that you don’t know the name of the files in advance, perhaps because you’re renaming them manually and it takes you time to type, or because you’re hunting for them in the directory tree, or whatever. Here are some solutions to this:

  • Start one command in the background, then do something like fg ; second-command. Then prepare the third command, but only run it after you saw the second finished. Meh. Or,
  • Just let the jobs run concurrently in the background as you run them (using the & control operator). But since each command is maxing out a resource (CPU, disk, etc), this becomes woefully inefficient really quickly. Or,
  • Use a mad concoction of Ctrl-Z, fg/bg, wait n or (if you left the shell and want to add something to the queue) use a madder concoction of while pgrep -f 'mv /disk/A/foo' > /dev/null; do ... (I’ll leave it as an exercise to the frustrated reader to finish that little one liner). But then again, you could also spend that time getting a paper-cut at the edge of your nostril, and it would probably be just as much fun and maybe even less error prone. Or,
  • Start a shell process reading from a named pipe (mkfifo(1)), and write the commands to the named pipe (credit to my friend and colleague m0she for this sneaky idea). In practice, I found it unwieldy at best, and impossible to extend with bells-and-whistle features if you need them, first and foremost, easily listing the queue and your position in it.

I reckon you could think of a few more ways, but I doubt (and hope! :) none would be more convenient than to simply use Enqueue, a simple and hopefully lightweight Python/twisted command line queuer (written today by yours truly). Usage looks a bit like so:

$ alias nq=enqueue
$ nq add mv /disk1/file1 /disk2/file1
$ nq add mv /disk1/file2 /disk2/file2
$ nq add beep
$ nq list
* mv /disk1/file1 /disk2/file1
  mv /disk1/file2 /disk2/file2
  beep
$

Nice and easy. Enqueue is still a bit rough around the edges and not very feature rich, but it does the job for me and I hope you’d like it too. Queues are managed by a twisted daemon that talks to the CLI client over UNIX domain sockets, and the whole thing fits in about 300 lines of Python. Feel free to open issues/send pull requests on GitHub if you find bugs or want to suggest something, I’ll try to keep up. Promise.


p.s.: Why is it that every time I dabble in Python packaging I end up (a) horribly frustrated and (b) feeling the result is awfully inadequate? Yes, I know I could probably package enqueue better, I know packaging isn’t Python’s strongest side and I know the future is better than the present, but the present sucks and I just had to say this. Yeah, I also walk around in the summertime sayin’ “how about this heat”.

pv: the pipe swiss army knife

2011/11/05 § Leave a comment

When using UNIX, every now and then you run into a relatively unknown command line application which, once you master it, becomes part of your “first class” commands along with cut or tr. You wince every time you work on a computer that doesn’t have it (and promptly wget-configure-make-install it) and you’re amazed your colleagues never heard of it. I often feel pv is such a command for me. Really, this command, much like netcat, should have been written in Berkley sometime circa 1985 and be in every /usr/bin today. Alas, somehow Hobbit only wrote netcat in 1996, and it took a long while for for it to reach /usr/bin ubiquity. Similarly, Andrew Wood only wrote pv in 2002, and I hope this post will convince you to place it in all your /usr/local/bins today and convince distribution makers to promote it to the status of a standard package as soon as possible.

The basic premise of pv is simple – it’s a program that copies stdin to stdout, while richly displaying progress using terminal graphics on stderr. If you use UNIX a lot and you never heard of pv before, I’m pretty sure the lightbulb is already lit above your head (if not, maybe pv isn’t for you after all or maybe it would help if you’d take a look this review of pv to help you see why it’s so great). pv has evolved rather nicely over the years, it’s available from Ubuntu universe for a while now (why only universe? why??), and it has a slew of useful features, like rate limiting, ETA prediction for an expected amount of data, on-the-fly parameter change (increase/decrease rate limit without breaking the pipe!), multiple invocation support (measure speed in two different points of the pipe!) and so on.

If you’re using pv, I hope you may want to see some of the recipes I use it in; if you don’t, maybe they’ll whet your appetite (I’m using long options for pv and short options for everything else):

  1. The basics: copy /tmp/src/ to /tmp/dst/, with progress
  2. $ src=/tmp/src ; tar -cC "$src" . |
      pv --size $(du -hsk "$src" | cut -f1)k |
      tar -xC /tmp/dst
     142MB 0:00:02 [43.4MB/s] [======>    ] 58% ETA 0:00:01
    $
    

    By the way, this works great if you add nc and compression, pv can even help you decide what level of compression to use to achieve the best throughput before the CPU becomes the bottleneck.

  3. Scale a bunch of images to a specific size, using multiple cores and with progress
  4. $ cd /tmp/src ; ls *.jpg |
      xargs -P 4 -I % -t convert -resize 1024 % /tmp/dst/% 2>&1 |
      pv --line-mode --size $(ls *.jpg | wc -l) > /dev/null
      96 0:00:16 [7.85/s] [===>       ] 36% ETA 0:00:28
    $
    
  5. Get a quick assessment of the traffic rate going through some interface
  6. $ sudo tcpdump -c 10000 -i eth1 -w - 2>/dev/null | pv > /dev/null
    35.4MB 0:00:07 [4.56MB/s] [ <=>          ]
    $
    

Nifty, eh? I find myself inserting pv in any pipe I expect to exist for more than a few moments. How do you use pv?

nginx+gzip module might silently corrupt data upon backend failure

2011/11/04 § 3 Comments

There are several elements that make absolutely certain the page you’re reading in your browser is an accurate representation of the resource the HTTP server meant to send you1. Disregarding caching for a minute, we have two elements making sure the representation you get is protected from errors. The first protecting element is, of course, TCP, making sure that if the server wrote two-hundred bytes in a particular order, either they’ll all arrive to your end (in order and without errors) or your TCP stack will realize something bad happened and give your user-agent (your browser) a chance to cope with the error. The need for the second protecting element is a bit more sneaky: TCP will guarantee everything the server wrote will arrive, i.e., bytes for which the server called write(2) or equivalent will arrive (or you’ll know something went wrong). But what about bytes the server should have written but didn’t write all – for example, because some component on the server’s side failed?

The original HTTP (HTTP 0.9, 1996 time) didn’t cope with this situation at all. The signal to the client that the server finished talking was to disconnect the TCP session, which, from the client’s side, is a vague signal. Did the TCP server disconnect because it finished or because it ran into trouble (software fault, sysadmin action, kernel behaviour due to memory pressure or even a bug, etc)? Thankfully, current HTTP kicks in to complement TCP, allowing the server to do one of several things in order to make sure you’ll at least know you didn’t receive the whole picture. By far the two most common thing the server will do are to specify a Content-Length in the response’s header or to use a Transfer-Encoding, most probably chunked transfer encoding.

Content length is simple to grasp. The server wishes to say 200 bytes. It explicitly says: “I will say 200 bytes” in the response header. If the user-agent didn’t receive 200 bytes of response, it knows something went wrong. Chunked transfer encoding is only slightly more complex – the server will send the response in chunks, each chunks prefixed by the length of the chunk. The end of the document is marked by a zero-length chunk. So if the user-agent saw a chunk cut in the middle, or didn’t receive a zero-length chunk, it also knows something went wrong and has a chance to decide what to do about it. For example, when faced with incorrect content length, Chrome displays an ERR_CONNECTION_CLOSED error, whereas Firefox would display the portion of the page it did receive. Different behaviour, yes, but at least both user-agents in this example had a chance realize the response they received is partial. Which is really, really important, you know why? I’ll tell you why.

Enter caching. HTTP caching is a non-trivial matter with many unexpected gotchas and pitfalls, and I can’t cover it all here (why the complexity? I think it’s because all caching is an intentional form of data/state repetition, and repetition is something that in my experience humans often have difficulty reasoning about). By far the best document I know about HTTP caching is this splendid guide, but if you’re in a hurry or impatient, let me summarize the points interesting for this particular post. First, caches might exist in many places, some of them might be surprising, some of them might be slightly broken or at least very aggressive (ISP transparent caches, mutter mutter cough cough). Second, among many other things, HTTP caching lets a server give a client a token together resource, telling the client “next time you request this resource, tell me you have this token; maybe I’ll just tell you that the representation you got with this token is still fresh, without transferring it all over again”. This is called an ETag, and the response that says “just use what you have in your cache” is called HTTP 304 NOT MODIFIED.

How is this relevant to HTTP responses cut in the middle? Well, if servers didn’t have a way of telling the user-agent how long is the document, and if the response was cut in the middle due to a server fault as described above, the user-agent/sneaky-caching-proxy might cache incorrect responses. If the server also sends an ETag along with the response, the caching entity will store this ETag along with the invalid cached representation, and even when it’s time to check the representation’s freshness with the origin server, the server will just take a look at the ETag, say “yep, this is fine”, tell the cache to keep using the bad representation and <sinister>never ever let it recover</sinister>. If this happens on a large ISP’s transparent cache, easily tens of thousand of your users could be affected. If the affected resource is a common element in many of your pages with strict syntax checking, like a javascript resource, you’re kinda screwed. The only hope in such a condition is that the client, for some reason, will specify Cache-Control: no-cache in the request, and that caching entities along the path to the server will honour this request. Browsers like caching, so they won’t usually request no-cache, although AFAIK, recently Chrome started sending no-cache when the user explicitly requests a force-reload (Cmd-R on a Mac). Other browsers don’t fare as well, and I think that hoping one of your Chrome users will force reload the bad resource in time to save the day is hardly a sturdy solution.

Bottom line is, it’s really important to know when a representation of a resource is broken. Which is why I was quite amazed to learn that my HTTP server of choice, nginx, doesn’t validate the Content-Length it receives from its upstreams and is simply unaware when the response it received from an upstream server is chopped off. If your response specifies a content length but closes the connection without delivering enough bytes, nginx will simply stall the request for a long time without closing the connection downstream, even though it has no hope of receiving additional data to push downstream. I tried this both with proxy_pass and uwsgi_pass, but I’m quite confident it’s true for other backends (fastcgi_pass, scgi_pass, etc). This is bad, but not as bad as the case where you want an nginx module to manipulate your content, removing existing content length/transfer encoding and applying its own (the gzip module indeed does that). If a backend error occurs while content-length-oblivious-nginx is altering the data, the content altering module will apply what it applies to the bytes it received, add new content-length/transfer-encoding, assuring everyone the response is OK, and entice user-agents or even proxies to enter the almost-never-recover bad cache scenario I described in the previews paragraph. Ouch!

The proper way to fix this, IMHO, is that nginx simply must start looking at the upstream’s content length (or transfer encoding, once nginx starts using chunked responses with its upstreams). Part of the reason I’m writing this post is that Maxim Dounin, venerable nginx comitter and an OK chap overall, told me he doesn’t consider this a top priority at the moment, but I humbly disagree with his assessment of how serious the issue is. Until such a time as nginx is fixed about this, I think you must disable all content-manipulating nginx modules and instead handle all message length affecting work in your upstream (compression, addition, etc). This is what I opted to do with my django based web app, I replaced nginx’s gzip module with Django’s GZipMiddleware. It’s a terrible shame though. It’s doing the job of nginx for it, probably in a lesser fashion than how nginx could, it violates a must not clause in Python’s WSGI PEP333, and I have empiric proof that Tim Berners-Lee chokes a kitten every time you do it.

But what’s the alternative? Risk invisibly cached corrupt data for an undetermined length of time? Ditch nginx, which I think is the best HTTP server on this planet despite this debacle? Nah. Both are unacceptable.


1 This post assumes convenient values of “absolutely certain”; also, everything related to security/content tampering is out of scope in this post. I’m talking about possibly misbehaving but certainly well-meaning components.

Where Am I?

You are currently viewing the archives for November, 2011 at NIL: .to write(1) ~ help:about.

Follow

Get every new post delivered to your Inbox.

Join 34 other followers