Understanding setjmp/longjmp

August 8, 2016

Categories: Technical Tags: Programming C

So I have slowly been learning more about relatively lower level stuffs that I glossed over or never even encountered before. There is something elegant about the way everything is built over a foundation that looks funky and fragile, but has a method in the proverbial madness. Today I have read more about the exotic pair of functions named setjmp and longjmp. It’s not a part of the language in the same way goto is, but it’s in <setjmp.h> which is part of the stdlib specified in the ANSI C standard.

Sometimes C is jokingly referred to as merely a syntactic sugar for assembly. It’s not hard to see where that comes from since a lot of C constructs and data structures cleanly map to a corresponding pattern in assembly. That being said, the said patterns are restrictive so as to be kept verifiably correct which isn’t a bad thing at all. But if you were to write assembly by hand, you could do far crazier things (with regards to rules of assembly, no one can claim that C lacks its own share of insanity). But the land off-trail is riddled with hidden mines.

The heritage of C leaks through in places. It has the aforementioned goto which enables unstructured transfer of control. It’s powerful yet dangerous, and quite often it’s used when not needed and this misuse makes it so much maligned.

Yet, its validity is only local. Then there is this weird trick which lets you one-up even that (with caveats of course). Clearly more obnoxious, it has got that counter-intuitive feel about it too. Let’s look at a code example first (taken from wiki):


#include <stdio.h>
#include <setjmp.h>

static jmp_buf buf;

void second(void) {
  printf("second\n");       // prints
  longjmp(buf, 1);          // jumps back to where setjmp was called - setjmp now returns 1
}

void first(void) {
  second();
  printf("first\n");        // does not print
}

int main() {   
  if (!setjmp(buf))
    first();                // when executed, setjmp returns 0
  else                      // when longjmp jumps back, setjmp returns 1
    printf("main\n");       // prints

  return 0;
}

When run, this prints out:

second
main

Few spooky things about this:

  • Let’s ignore everything we don’t recognize. We don’t know what the return value is. But the program printed “second”, which means the function second() was reached. But second() is only possible to be reached from first and first() was only invoked in the first if branch in main. Therefore, the whole condition !setjmp(buf) must have been true. That means setjmp returned 0.

  • Now we don’t know about what longjmp does, but after second ended, first should have resumed execution and the program should have printed “first” but it didn’t. Instead it printed “main” which is not only crazy but should be impossible, didn’t we just establish that the program took the first branch?

I guess I am going to sound a bit long winded here because this is simply my thought dump. Let’s make some restrictive assumptions to simplify the matter first. Imagine our program as a very simple blackbox which doesn’t interact with the outer world at all. If we could take a snapshot of our program state and save it somewhere else, then no matter what it’s doing at any arbitrary point in time, we can always revert it back to that state using that snapshot. From the perspective of that program it would be like going back in time to the point where the snapshot was taken.

Under the restrictive assumptions, the program state at any point in time in our Von Neumann architecture can also be defined, in fact quite succinctly. The processor registers are all we need, the instruction pointer points to which instruction is about to be executed. The stack and base pointer together defines where in the stack we are, and from here we get the stack trace of function calls along with all the local variables. The general purpose and other registers together contain every other necessary piece of information regarding the state of the computation.

The function setjmp, when called for the first time and passed the address of a struct of type jmp_buf, simply takes a snapshot of the processor registers and saves them in that struct, and then it returns 0. Thus the condition in the first if branch becomes true, hence first() gets called.

This in turn calls second() in the first line. Second prints out “second”, and then calls longjmp with the same jmp_buf that was defined earlier. What longjmp does is that it resets the program state using information saved in jmp_buf which means we are back to the line where setjmp was first called, but with the added twist that longjmp now makes it so that instead of 0, setjmp now returns the value passed to the longjmp which in this case is 1. To extend our silly analogy, longjmp allows us to go back to a precise point in the past but with some new information from future. This makes the condition !setjmp(buf) false this time, and thus now the else branch is taken and “main” is printed. First never got to print “first” because second never passed control to it, but instead second went over the head of first and gave control to main directly.

Now the first implication that springs to mind, we can now do Exception Handling in C! Imagine encountering an error several levels below the call chain. Ordinarily we need to propagate the error up through the chain which means error handling at every intermediate level, this is exactly as unsavoury as it sounds! With setjmp/longjmp we can magically climb up the ladder just like an exception being raised somewhere deep and caught several levels above in a try/catch block! That’s neat so what’s the caveat? Let’s see both the advantage and disadvantage in a single (contrived) example:

#include <stdio.h>
#include <stdlib.h>

void fn(int i) {
  int *level = malloc(sizeof(int));
  *level = i;
  printf("Current depth = %d\n", *level);
  if (i == 5) {
    printf("Found the Amulet of Yendor!\n");
    free(level);
    return;
  }
  fn(i + 1);
  printf("Back to depth = %d\n", *level);
  free(level);
  return;
}

int main(int argc, char **argv) {
  printf("Descending into the Dungeon...\n");
  fn(1);
  printf("Safely came back to the surface!\n");
}

And when run this program outputs:

Descending into the Dungeon...
Current depth = 1
Current depth = 2
Current depth = 3
Current depth = 4
Current depth = 5
Found the Amulet of Yendor!
Back to depth = 4
Back to depth = 3
Back to depth = 2
Back to depth = 1
Safely came back to the surface!

As expected. We keep descending down until the base case is met, so the recursion stops at level 5 where we find the Amulet. Then we have to come back through the way we went by unwinding the stack. But wouldn’t it be awesome if we could teleport back to the surface from level 5? A clear solution forms, let’s call setjmp on a buffer in the main before we go down. And then let’s call longjmp from level 5.

#include <setjmp.h>
#include <stdio.h>
#include <stdlib.h>

jmp_buf buf;

void fn(int i) {
  int *level = malloc(sizeof(int));
  *level = i;
  printf("Current depth = %d\n", *level);
  if (i == 5) {
    printf("Found the Amulet of Yendor!\n");
    free(level);
    longjmp(buf, 1);        // longjmp instead of return
  }
  fn(i + 1);
  printf("Back to depth = %d\n", *level);
  free(level);
  return;
}

int main(int argc, char **argv) {
  printf("Descending into the Dungeon...\n");
  if (!setjmp(buf)) {
    fn(1);
  }
  printf("Safely came back to the surface!\n");
}

And when run, this prints out:

Descending into the Dungeon...
Current depth = 1
Current depth = 2
Current depth = 3
Current depth = 4
Current depth = 5
Found the Amulet of Yendor!
Safely came back to the surface!

So it worked! We have successfully performed teleportation and it’s awesome!

Except, well, the program now leaks memory. We dynamically allocated memory using malloc which we reclaimed just before returning in the first version. But now since we bypassed all intermediate returns in-between, that means now there are as many instance of memory unreclaimed! And it doesn’t have to be memory alone, any resource that isn’t properly managed would also get leaked, for example files or sockets. Any intermediate code that alters any necessary global variable or structure also wouldn’t get run which would make the program incorrect. And if the function where setjmp is defined gets returned before longjmp is called, if you are sucked into some Lovecraftian horror it would be completely deserved! Because then the stack frame is already mangled and the stack and base pointers don’t really mean what you think they do!

So, that’s about the gist of it. As with goto there is nothing inherently good or bad about it. Just that caution must be exercised in proportion. Anyway, that’s way too many words to explain something that’s only meant to be used as a vehicle to understand the much broader concept of continuation.

Quoting wiki again, here is how it is defined:

In computer science and computer programming, a continuation is an abstract representation of the control state of a computer program. A continuation reifies the program control state, i.e. the continuation is a data structure that represents the computational process at a given point in the process’s execution; the created data structure can be accessed by the programming language, instead of being hidden in the runtime environment.

I guess, in high-level terms, the continuation of an expression is simply everything that happens afterwards. So if we have an expression, (+ 2 (* 4 2)) then the continuation of the inner expression is simply (+ 2 x) where x is the result of the inner expression.

But the striking thing is that continuations in Scheme has first class privilege like functions or variables. Which means they can be captured, passed around and so on. We can see the fabled call/cc in action here:

(display
  (+ 2
    (call/cc
      (lambda (c)
        (set! k c)
        (* 4 2)))))

What call/cc is doing here is that, it captured its continuation and passed it to a function as parameter. Now, if you do nothing with it, then the continuation is continued with the return value of the function, just like normal. So the first expression does what we think it does, it prints 10.

But we did something, we captured the continuation in a variable k. And now we can call k from anywhere, which will discard the continuation of the context it’s called from, and continue with k instead.

The implications? I don’t really grasp that yet, though I have heard many lofty claims. I have just read a couple of articles [1, 2] with some really cool examples. For examples, here is how you create a python style generator:

(define (iterator lst)

  (define (gen return)
    (for-each
      (lambda (item)
        (call/cc
          (lambda (y)
            (set! gen y)
            (return item))))
      lst)
    (return 'done))

  (define (generator)
    (call/cc gen))

  generator)

(define next (iterator '(1 2 3 4 5 6)))
> (next) ;; Prints 1
> (next) ;; Prints 2
> (next) ;; Prints 3

And that’s just amazing though I don’t see how I can possibly apply this myself.