| Mon | Tue | Wed | Thu | Fri | Sat | Sun |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | |
| 7 | 8 | 9 | 10 | 11 | 12 | 13 |
| 14 | 15 | 16 | 17 | 18 | 19 | 20 |
| 21 | 22 | 23 | 24 | 25 | 26 | 27 |
| 28 | 29 | 30 | 31 |
![]()
This section:
Blog postings by Operational Dynamics partners and staff
Use the links at top left for a consolidated feed of all the posts made on this
site.
Please note the disclaimer at the bottom of this page.
Fri, 25 Jan 2008
The arcane secrets of hash-bang
I’ve been working for a while now prototyping various different domain specific approaches to modelling software configuration information. Most of these involve putting the configuration data in the body of an executable script. To that end, I’ve been digging in to how interpreted scripts actually work on Linux and other Unix-like operating systems work.
#! interpreter
Anyone who has ever written a Shell script, Perl program, or Python program is familiar with #! lines:
#! /bin/sh
#
# A program to do something very special.
#
echo "Hello World"
and
#! /usr/bin/perl
#
# Another program to do something very special.
#
while (<>) {
print "Hello World\n";
}
etc. The program mentioned after the magic #! characters is the program that will interpret the script. There are many gotchas with that (notably portability concerns owing to the fact that some idiotic flavours of Unix don’t put Perl in /usr/bin, that sort of thing)
I always figured that the script file got piped by the OS into the interpreter on stdin. A reasonable guess given the way that most of the tools we use work, but it turns out it is nothing of the sort. Every time I tried to write an interpreter (in C) I got stuck.
What threw me off was that cat works as an “interpreter”:
#! /bin/cat
This is the script body
which will happily be sent
to stdout
If you put that in a file called script and run that from your terminal, then sure enough,
$ ./script
#! /bin/cat
This is the script body
which will happily be sent
to stdout
$
which of course is exactly what would happen if you did:
$ cat < script
#! /bin/cat
This is the script body
which will happily be sent
to stdout
$
and that’s what totally had me on the wrong track. I figured that the interpreter on the #! line was being fed the body of the executing file on stdin. Nope.
Seek and ye shall find, sort of
I thought that I might be able to find out what was going on by reading the code of an interpreter program. I started by looking at the sources for /sbin/runscript (which is on the #! line for all of Gentoo Linux’s RC scripts), expecting that to be quite simple. It was simple. Too simple. All it does is some environment filtering and then fires off bash to run /sbin/runscript.sh (in other words, it’s largely a workaround for the fact that you can’t actually make a shell script itself an interpreter). Nothing at all in there about reading stdin. So then I looked at the source code for Perl (Whoa, there’s a beast). Nothing obvious there either. Lots of stuff about reading from stdin but nothing about that being the origin of the script to be executed. A lot of messing around with argument signatures though.
#! is not exactly the easiest term to put into a search engine. I did, however, happen to know that one of the ways #! is pronounced is “hash bang” (being two common names for the respective characters, though lots of old suspender-snapping sandal-wearing bearded Unix freaks would, I’m sure, tell you with great passion that it has to be pronounced some other way). Searching on “hash bang” brought up lots of arcana, including something that lead me to an obscure article by one Andries Brouwer on the parameter signature at invocation wherein I discovered that there is a calling convention for how arguments are passed to the interpreter program being invoked.
It’s a bit complicated, since you can have command line arguments for both the interpreter and for the script being run. It goes something like this. Let’s say you have an script that begins with the following:
#! /path/to/program -v -d
(with -v perhaps meaning “verbose” and -d perhaps meaning “debug”) and you have it in a file called ./script, then running it will actually cause program to execute. The trick is, with what arguments? Check this out. If you do:
$ ./script -p -r
(with -r and -p, for the sake of illustration, perhaps having the same meanings as cp, that is “preserve” and “recursive” respectively) then when our interpreter program is executed, it will be invoked with the following arguments:
/path/to/program -v -d ./script -p -r
the mapping is a bit obscure. It’s actually:
| argv0 | argi | argn | args… | |
argv[0] | argv[1] | argv[2] | argv[3] | argv[4] |
/path/to/program | -v -d | ./script | -p | -r |
(to use the terminology in the above link). This all shed a little light on what I’d seen in runscript.c and perl.c, but still not a single mention of the script being fed in on stdin. So I pondered that for a while longer, until finally the light bulb went off.
Eureaka & Company
The reason I couldn’t find any mention of ./script being fed in on stdin is because is is not fed in on stdin. You don’t need it to be: you’ve got the name of the script file fed to you in your interpreter’s argument list (from the above example, it’s in argv[2], one happy looking string containing “./script”). So read it already!
FILE* body;
body = fopen(argv[2], "r");
...
and ta da, that’s where you get your script’s program body from. Now you can at last get on with parsing your script, and running it.
Most big programs spend lots of time munging the argument list, dealing with the fact that argv[1] could be full of all sorts of stuff jammed into, or nothing, etc. The whole thing goes from elegant to clumsy when you discover that if there are no arguments to the interpreter on the #! line then the script file will be in argv[1], and it goes to nightmare level when you look at the list of variations in behaviour across different operating systems, compiled by one Sven Mascheck. Nonetheless, the interpreter is your program, and presumably you can recognize, parse, and skip over zero, one or more arguments to yourself before deciding you’ve reached the name of the script. Judicious use of argv++; argc--; is your friend here, apparently.
Anyway, this all explains why my cat example was working but my own efforts were not. cat is not reading data being fed to it on stdin (which is cat’s behaviour if you run it without any arguments), it’s being executed with an argument, namely ./script as argv[1], ie exactly the same as:
$ cat ./script
#! /bin/cat
This is the script body
which will happily be sent
to stdout
$
But now that I know what’s going on, I can write my own interpreter.c:
#include <stdio.h>
#define LEN 128
int main(int argc, char** argv) {
char buf[LEN];
FILE* body;
body = fopen(argv[1], "r");
while (fgets(buf, LEN, body) != NULL) {
printf("%s", buf);
}
fclose(body);
return 0;
}
and if I compile that to interpreter, then I can write a domain specific language that is interpreted by this program, say:
#! ./interpreter
This is a test of the Emergency Broadcast System
in a file called script, then, at last,
$ ./script
This is a test of the Emergency Broadcast System
$
Yeay!
Ok, so that’s cat, but cat is the Hello World of input/output. :) The real point is that running script caused interpreter to be executed, and interpreter got at the body of the script that was “run”, and was able to do something with it. Onwards at last.
AfC
Comments
Julio Merino Vidal wrote in suggesting:
Take a look at NetBSD’s
script(7)manual page for some more details
about how that is supposed to work and some things you must consider
for portability (such as being able to feed a single argument to the
interpreter through the#!line, or the maximum length of it).
Updates
Quite by accident, I just came across the related information for Linux; see the execve(2) man page for
a succinct treatment of both exec()‘ing in general, and the topic of interpreting scripts.
Category Specific Feeds.
Use these links for an RSS or ATOM feed limited to this category and its descendants.
Technorati Profile

