A C++ program crashes on Solaris trying to throw

This is a story about a problem people bump into from time to time on Solaris: a C++ program crashes trying to throw an exception, even though the exception is caught. The problem usually manifests itself when the throw happens in a shared library.

Crash instead of catch on Solaris x64

A C++ shared library that uses exceptions drops core on Solaris x64 even though the exception seem to get caught in the source code. This problem appears to be common enough, so I decided to prepare this writeup.

The stack of the crash might look like this:

$ pstack core
core 'core' of 126:     ./a.out
 0000000000010e1d ???????? ()
 ffff80ffbf59fc05 _SUNW_Unwind_RaiseException () + 55
 ffff80ffa78e1d79 __cxa_throw () + 59
 ffff80ffa79410e3 foo () + 3e
 ...

What happened? Why it crashes on Solaris only?

Let’s explore.

The Problem

Let’s build a sample program to reproduce and debug the problem. We’ll need a library and a main program that calls into that library:

$ cat mylib.cc
extern "C" int foo()
{
   try
   {
      throw new int(42);
   }
   catch(...)
   {
      return 0;
   }

   return 1;
}

Let’s build the library:

$ g++ mylib.cc -fPIC -shared -m64 -o libA.so
$ file libA.so
libA.so:        ELF 64-bit LSB dynamic lib AMD64 Version 1, ...

Now, let’s build an a.out that uses the library:

$ cat main.c
int foo();
int main()
{
   foo();
}
$ gcc -R. -L. -lA -m64 main.c
$ file a.out
a.out:          ELF 64-bit LSB executable AMD64 Version 1, ...

Note: the main executable is a C program, compiled and linked with gcc rather than g++. This is important.

Running the program, you can witness the crash:

./a.out
Segmentation Fault (core dumped)

Debugging

If you observed uncaught exceptions often enough, you’ll notice that this is the wrong kind of crash; the one caused by an uncaught exception looks like this:

terminate called after throwing an instance of 'int*'
Abort (core dumped)

In our case, we clearly have something of a different nature. Running the program under dbx, we can see that it died trying to throw:

$ dbx a.out
...
(dbx:main) run
(dbx:__cxa_throw) where -h -l
  [1] 0x10c75(0x1, 0x1, 0x474e5543432b2b00, 0x411140, 0xfffffd7fffdff5f0, 0x10c75), at 0x10c75
  [2] libc.so.1:_Unwind_RaiseException_Body(), at 0xfffffd7fff2d1d1c
  [3] libc.so.1:_SUNW_Unwind_RaiseException(0x411140), at 0xfffffd7fff2d1f09
=>[4] libstdc++.so.6.0.18:__cxa_throw(obj = <value unavailable>, tinfo = <value unavailable>, dest = <value unavailable>), line 79 in "eh_throw.cc"
  [5] libA.so:foo(), at 0xfffffd7fff340eb3
  [6] a.out:main(), at 0x400cc3

That -h option to the where command is essential. Otherwise, dbx would’ve hidden the unwinder frames. The -l option shows the libraries the functions belong to; this is almost always useful, so I have an alias in ~/.dbxrc.

The topmost frame is obviously bad - 0x10c75 address is far away from any user code in an x64 process (consult the proc -map dbx command to make sure). So here we seem to have an indirect call that used something bad as the callee address. Looking at the code, it is almost certainly this line:

res = (*ctx_who(ctx))(1, phase,
    exception_object->exception_class,
    exception_object, ctx);

So what went wrong? To get an idea, let’s compare the stack with that of a working program. For instance, compile the main executable with g++ instead of gcc. This combination works:

$ g++ -R. -L. -lA main.c -m64
$ ./a.out
# no crash
$ dbx a.out
# a bit of debugging involved, skipped for brevity
...
(dbx) where -l -h
=>[1] libgcc_s.so.1:_Unwind_RaiseException(exc = 0x4111c0), line 88 in "unwind.inc"
  [2] libstdc++.so.6.0.18:__cxa_throw(obj = <value unavailable>, tinfo = <value unavailable>, dest = <value unavailable>), line 79 in "eh_throw.cc"
  [3] libA.so:foo(), at 0xfffffd7fff340eb3
  [4] a.out:main(), at 0x400d0e

Aha, so the variant that works uses the unwinder provided by libgcc_s instead of libc! Yes, the Solaris libc does implement the same unwinding interface, but does that for the Sun ABI.

Sun ABI and Itanium C++ ABI differ in the contents of the opaque structure called _Unwind_Context. Apparently, where one expects a personality routine address, the other has something else. Which is why our program ends up executing something that is not code.

So why the problem doesn’t appear on other architectures like x86 or sparc? The answer’s very simple: there’s no unwinder implementation there. Compare

$ elfdump -s /lib/64/libc.so.1 | grep Unwind | wc -l
      46

with the same for x86

$ elfdump -s /lib/libc.so.1 | grep Unwind | wc -l
       0

Solutions

The general idea of the solution is to avoid having the second implementation of the stack unwinder in the first place; get rid of the second one completely. That’s beyond user’s control as it essentially requires rebuilding the /lib/64/libc.so.1 library.

The second best idea is to hide one unwinder behind the other. There are several ways to achieve that.

A temporary solution

Use LD_LIBRARY_PATH so that when the application has started, libgcc_s.so had already been loaded. The effect is that libgcc_s will be the first in the list of the libraries searched. Therefore, all unwinding will go through that.

Let’s verify using the original a.out:

LD_PRELOAD=/.../lib/amd64/libgcc_s.so.1 ./a.out

No crash!

If you link the main executable with libgcc_s explicitly, it is easy to force the loader to load it before libc and, therefore, effectively leave only one version of unwinding. Then things will just work.

$ gcc -R. -L. -lA main.c -m64 -l gcc_s
$ ./a.out

Again, no crash.

The same effect can be achieved by linking with g++, which adds libgcc_s behind the scenes. This may be the reason most people never run into this problem because the main program for a C++ library usually also is a C++ program. It doesn’t have to be, though.

Use a different version of the GCC runtime

I noticed that different versions of gcc have libstdc++.so with a different order of recorded dependencies on other libraries. For instance, here we have libc preceding libgcc_s:

$ elfdump -d .../gcc/4.8.2/intel-S2/lib/amd64/libstdc++.so | grep NEED
       [0]  NEEDED            0x26e36             libm.so.2
       [1]  NEEDED            0x26e52             libc.so.1
       [2]  NEEDED            0x26e92             libgcc_s.so.1

And here, the order is reversed:

$ elfdump -d .../amd64/libstdc++.so | grep NEED
       [0]  NEEDED            0x270d7             libm.so.2
       [1]  NEEDED            0x270f3             librt.so.1
       [2]  NEEDED            0x27107             libgcc_s.so.1
       [3]  NEEDED            0x2712f             libc.so.1

So in this case, the linker would load libgcc_s before libc and we’re fine. This can be observed by running ldd on a.out:

$ LD_LIBRARY_PATH=.../4.9.0/intel-S2/lib/amd64/ ldd a.out
        libA.so =>       ./libA.so
        libstdc++.so.6 =>.../amd64//libstdc++.so.6
        libm.so.2 =>     /lib/64/libm.so.2
        libgcc_s.so.1 => .../amd64//libgcc_s.so.1
        libc.so.1 =>     /lib/64/libc.so.1
	...

Useful tools

elfdump(1):

  • -s option shows symbols (and doesn’t truncate the names as readelf does! This difference caused me a lot of grief)
  • -d option shows, among other things, the libraries the loadobject depends on

LD_DEBUG environment variable; it is described in ld.so.1(1) and is incredibly useful:

  • LD_DEBUG=help /bin/true to get usage
  • LD_DEBUG=symbols,detail ./a.out 2>ld.log.txt to learn how and where symbols are found by the loader, ld.so

References

Maxim Kartashev

Maxim Kartashev
Pragmatic, software engineer. Working for Altium Tasking on compilers and tools for embedded systems.