| Mon | Tue | Wed | Thu | Fri | Sat | Sun |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | ||
| 6 | 7 | 8 | 9 | 10 | 11 | 12 |
| 13 | 14 | 15 | 16 | 17 | 18 | 19 |
| 20 | 21 | 22 | 23 | 24 | 25 | 26 |
| 27 | 28 | 29 | 30 | 31 |

This section:
Blog postings by Andrew Cowie about Open Source
and Software Development. This section is about the systems used to build
applications, run tests, and deploy to production, be they small standalone
programs or huge e-commerce platforms.
The syndication links at top left will give you a feed for the blog as a whole.
If you'd like a feed specific to this sub-category, see bottom of page.
blogs > andrew > software > build-systems
Fri, 25 Jan 2008
The arcane secrets of hash-bang
I’ve been working for a while now prototyping various different domain specific approaches to modelling software configuration information. Most of these involve putting the configuration data in the body of an executable script. To that end, I’ve been digging in to how interpreted scripts actually work on Linux and other Unix-like operating systems work.
#! interpreter
Anyone who has ever written a Shell script, Perl program, or Python program is familiar with #! lines:
#! /bin/sh
#
# A program to do something very special.
#
echo "Hello World"
and
#! /usr/bin/perl
#
# Another program to do something very special.
#
while (<>) {
print "Hello World\n";
}
etc. The program mentioned after the magic #! characters is the program that will interpret the script. There are many gotchas with that (notably portability concerns owing to the fact that some idiotic flavours of Unix don’t put Perl in /usr/bin, that sort of thing)
I always figured that the script file got piped by the OS into the interpreter on stdin. A reasonable guess given the way that most of the tools we use work, but it turns out it is nothing of the sort. Every time I tried to write an interpreter (in C) I got stuck.
What threw me off was that cat works as an “interpreter”:
#! /bin/cat
This is the script body
which will happily be sent
to stdout
If you put that in a file called script and run that from your terminal, then sure enough,
$ ./script
#! /bin/cat
This is the script body
which will happily be sent
to stdout
$
which of course is exactly what would happen if you did:
$ cat < script
#! /bin/cat
This is the script body
which will happily be sent
to stdout
$
and that’s what totally had me on the wrong track. I figured that the interpreter on the #! line was being fed the body of the executing file on stdin. Nope.
Seek and ye shall find, sort of
I thought that I might be able to find out what was going on by reading the code of an interpreter program. I started by looking at the sources for /sbin/runscript (which is on the #! line for all of Gentoo Linux’s RC scripts), expecting that to be quite simple. It was simple. Too simple. All it does is some environment filtering and then fires off bash to run /sbin/runscript.sh (in other words, it’s largely a workaround for the fact that you can’t actually make a shell script itself an interpreter). Nothing at all in there about reading stdin. So then I looked at the source code for Perl (Whoa, there’s a beast). Nothing obvious there either. Lots of stuff about reading from stdin but nothing about that being the origin of the script to be executed. A lot of messing around with argument signatures though.
#! is not exactly the easiest term to put into a search engine. I did, however, happen to know that one of the ways #! is pronounced is “hash bang” (being two common names for the respective characters, though lots of old suspender-snapping sandal-wearing bearded Unix freaks would, I’m sure, tell you with great passion that it has to be pronounced some other way). Searching on “hash bang” brought up lots of arcana, including something that lead me to an obscure article by one Andries Brouwer on the parameter signature at invocation wherein I discovered that there is a calling convention for how arguments are passed to the interpreter program being invoked.
It’s a bit complicated, since you can have command line arguments for both the interpreter and for the script being run. It goes something like this. Let’s say you have an script that begins with the following:
#! /path/to/program -v -d
(with -v perhaps meaning “verbose” and -d perhaps meaning “debug”) and you have it in a file called ./script, then running it will actually cause program to execute. The trick is, with what arguments? Check this out. If you do:
$ ./script -p -r
(with -r and -p, for the sake of illustration, perhaps having the same meanings as cp, that is “preserve” and “recursive” respectively) then when our interpreter program is executed, it will be invoked with the following arguments:
/path/to/program -v -d ./script -p -r
the mapping is a bit obscure. It’s actually:
| argv0 | argi | argn | args… | |
argv[0] | argv[1] | argv[2] | argv[3] | argv[4] |
/path/to/program | -v -d | ./script | -p | -r |
(to use the terminology in the above link). This all shed a little light on what I’d seen in runscript.c and perl.c, but still not a single mention of the script being fed in on stdin. So I pondered that for a while longer, until finally the light bulb went off.
Eureaka & Company
The reason I couldn’t find any mention of ./script being fed in on stdin is because is is not fed in on stdin. You don’t need it to be: you’ve got the name of the script file fed to you in your interpreter’s argument list (from the above example, it’s in argv[2], one happy looking string containing “./script”). So read it already!
FILE* body;
body = fopen(argv[2], "r");
...
and ta da, that’s where you get your script’s program body from. Now you can at last get on with parsing your script, and running it.
Most big programs spend lots of time munging the argument list, dealing with the fact that argv[1] could be full of all sorts of stuff jammed into, or nothing, etc. The whole thing goes from elegant to clumsy when you discover that if there are no arguments to the interpreter on the #! line then the script file will be in argv[1], and it goes to nightmare level when you look at the list of variations in behaviour across different operating systems, compiled by one Sven Mascheck. Nonetheless, the interpreter is your program, and presumably you can recognize, parse, and skip over zero, one or more arguments to yourself before deciding you’ve reached the name of the script. Judicious use of argv++; argc--; is your friend here, apparently.
Anyway, this all explains why my cat example was working but my own efforts were not. cat is not reading data being fed to it on stdin (which is cat’s behaviour if you run it without any arguments), it’s being executed with an argument, namely ./script as argv[1], ie exactly the same as:
$ cat ./script
#! /bin/cat
This is the script body
which will happily be sent
to stdout
$
But now that I know what’s going on, I can write my own interpreter.c:
#include <stdio.h>
#define LEN 128
int main(int argc, char** argv) {
char buf[LEN];
FILE* body;
body = fopen(argv[1], "r");
while (fgets(buf, LEN, body) != NULL) {
printf("%s", buf);
}
fclose(body);
return 0;
}
and if I compile that to interpreter, then I can write a domain specific language that is interpreted by this program, say:
#! ./interpreter
This is a test of the Emergency Broadcast System
in a file called script, then, at last,
$ ./script
This is a test of the Emergency Broadcast System
$
Yeay!
Ok, so that’s cat, but cat is the Hello World of input/output. :) The real point is that running script caused interpreter to be executed, and interpreter got at the body of the script that was “run”, and was able to do something with it. Onwards at last.
AfC
Comments
Julio Merino Vidal wrote in suggesting:
Take a look at NetBSD’s
script(7)manual page for some more details
about how that is supposed to work and some things you must consider
for portability (such as being able to feed a single argument to the
interpreter through the#!line, or the maximum length of it).
Updates
Quite by accident, I just came across the related information for Linux; see the execve(2) man page for
a succinct treatment of both exec()‘ing in general, and the topic of interpreting scripts.
Sat, 21 Jan 2006
Understanding Cargo
One of my clients has me working on revamping the infrastructure they use to build their products and run functional tests across them. They’re a Java shop, and so it’s no surprise that their product, a rather large web application, is built in Java Servlets and JSP; since they target a wide range of enterprise customers they need to test their app in as many application server “containers” as possible.
Not terribly unusual, but when you’re trying to run automated tests, it gets tricky. Although in theory one should be able to interchangeably use different app-servers, the different vendors (be they open source or commercial) who have implemented the Servlet, JSP, and J2EE specs all have their quirks. Even assuming the thing you are testing doesn’t use vendor specific extensions, you still have to deal with the problem of setting up, starting, and stopping the app-server containers themselves. And as you’d expect, each different app-server has a rather significantly different way of being configured and run.
Enter Cargo. It’s pretty slick! They have figured out how to configure, start, stop a wide range of different containers and in some cases can control them during runtime. This is all important if you’re trying to do automated testing of a Servlet based web application, because you need to have the container running and your app deployed into it before you can start doing functional tests against it. Their API is primarily meant to be used from within Java but they’ve also made ant tasks and maven plugins.
There are a few examples on the Cargo website, but figuring it out took some doing. Cargo has about three different ways to do any given task — you can set something up using the fully derived strongly typed implementing classes, or you can use one of two factory methods. Presenting all of this at the same time is confusing to say the least. Cargo consists of several very steep class hierarchies with parallel naming conventions. The terms “Local”, “Remote”, “Existing”, “Standalone” are used in permutation with “Container”, “Configuration” and “Deployer”; for instance you have a WebLogic8xLocalContainer and a Resin3xStandaloneLocalConfiguration. Gets confusing when you’re trying to learn the API for the first time, and makes code assist completion really hard (typing “Tomcat” and hitting assist, you have the joy of selecting from Tomcat3xLocalContainer, Tomcat3xStandaloneLocalConfiguration, Tomcat4xLocalContainer, Tomcat4xRemoteContainer, Tomcat4xStandaloneLocalConfiguration, Tomcat5xExistingLocalConfiguration, Tomcat5xLocalContainer, Tomcat5xStandaloneLocalConfiguration, TomcatCopyingLocalDeplyoer, TomcatLocalDeployer, TomcatRemoteDeployer, … you get the idea. Pretty crazy).
The point of me looking through all this was to be able to help my client make a decision about whether to use Cargo in their build & test architecture. I battled my way through their documentation and javadoc trying to duplicate a simple example. Once I started drawing some simple diagrams of the class hierarchies, though, it began to come together. I did up my notes in OpenOffice Draw to make them presentable. Here’s the result:
First, cargo has the notion of a “configuration” being a particular setup for an instance of a running app-server container. This takes a brief moment to grok because normally one installs WAR/EAR files directly into the appropriate directory within (under) wherever the server is installed — only since one normally only ever has one instance of a server on a given box, you don’t tend to think about it. It turns out (always something to learn) that the Servlet and Enterprise Application Server specs state that the information and data to be used in a running instance can all be bundled together and put … wherever. Cargo calls it a Configuration. It’s a steep hierarchy, but the important bits seem to be as follows:

Once you’ve got a configuration, then you bring up the app-server container with it. Cargo’s Container hierarchy is like this:

As you can see, there are considerable variations on the core theme - did you set it up before you ran the app-server, or after; is the app-server Local where Cargo can get at it, or Remote where it can’t…
Once you get the sense of it, though, it’s a very powerful tool. To accomodate the wide range of containers that Cargo does is a remarkable achievement, and watching it Do The Right Thing (tm) in each different case is remarkable.
AfC
Category Specific Feeds.
Use these links for an RSS or ATOM feed limited to this category and its descendants.
Technorati Profile

