A C++ program crashes on Solaris trying to throw
This is a story about a problem people bump into from time to time on Solaris: a C++ program crashes trying to throw an exception, even though the exception is caught. The problem usually manifests itself when the throw happens in a shared library.
Crash instead of catch on Solaris x64
A C++ shared library that uses exceptions drops core on Solaris x64 even though the exception seem to get caught in the source code. This problem appears to be common enough, so I decided to prepare this writeup.
The stack of the crash might look like this:
$ pstack core
core 'core' of 126: ./a.out
0000000000010e1d ???????? ()
ffff80ffbf59fc05 _SUNW_Unwind_RaiseException () + 55
ffff80ffa78e1d79 __cxa_throw () + 59
ffff80ffa79410e3 foo () + 3e
...
What happened? Why it crashes on Solaris only?
Let’s explore.
The Problem
Let’s build a sample program to reproduce and debug the problem. We’ll need a library and a main program that calls into that library:
$ cat mylib.cc
extern "C" int foo()
{
try
{
throw new int(42);
}
catch(...)
{
return 0;
}
return 1;
}
Let’s build the library:
$ g++ mylib.cc -fPIC -shared -m64 -o libA.so
$ file libA.so
libA.so: ELF 64-bit LSB dynamic lib AMD64 Version 1, ...
Now, let’s build an a.out
that uses the library:
$ cat main.c
int foo();
int main()
{
foo();
}
$ gcc -R. -L. -lA -m64 main.c
$ file a.out
a.out: ELF 64-bit LSB executable AMD64 Version 1, ...
Note: the main executable is a C program, compiled and linked with
gcc
rather thang++
. This is important.
Running the program, you can witness the crash:
./a.out
Segmentation Fault (core dumped)
Debugging
If you observed uncaught exceptions often enough, you’ll notice that this is the wrong kind of crash; the one caused by an uncaught exception looks like this:
terminate called after throwing an instance of 'int*'
Abort (core dumped)
In our case, we clearly have something of a different nature. Running the
program under dbx
, we can see that it died trying to throw:
$ dbx a.out
...
(dbx:main) run
(dbx:__cxa_throw) where -h -l
[1] 0x10c75(0x1, 0x1, 0x474e5543432b2b00, 0x411140, 0xfffffd7fffdff5f0, 0x10c75), at 0x10c75
[2] libc.so.1:_Unwind_RaiseException_Body(), at 0xfffffd7fff2d1d1c
[3] libc.so.1:_SUNW_Unwind_RaiseException(0x411140), at 0xfffffd7fff2d1f09
=>[4] libstdc++.so.6.0.18:__cxa_throw(obj = <value unavailable>, tinfo = <value unavailable>, dest = <value unavailable>), line 79 in "eh_throw.cc"
[5] libA.so:foo(), at 0xfffffd7fff340eb3
[6] a.out:main(), at 0x400cc3
That -h
option to the where
command is essential. Otherwise, dbx would’ve
hidden the unwinder frames. The -l
option shows the libraries the functions
belong to; this is almost always useful, so I have an alias in ~/.dbxrc
.
The topmost frame is obviously bad - 0x10c75
address is far away from any
user code in an x64 process (consult the proc -map
dbx command to make sure).
So here we seem to have an indirect call that used something bad as the callee
address. Looking at the
code,
it is almost certainly this line:
res = (*ctx_who(ctx))(1, phase,
exception_object->exception_class,
exception_object, ctx);
So what went wrong? To get an idea, let’s compare the stack with that of a working program. For instance, compile the main executable with g++ instead of gcc. This combination works:
$ g++ -R. -L. -lA main.c -m64
$ ./a.out
# no crash
$ dbx a.out
# a bit of debugging involved, skipped for brevity
...
(dbx) where -l -h
=>[1] libgcc_s.so.1:_Unwind_RaiseException(exc = 0x4111c0), line 88 in "unwind.inc"
[2] libstdc++.so.6.0.18:__cxa_throw(obj = <value unavailable>, tinfo = <value unavailable>, dest = <value unavailable>), line 79 in "eh_throw.cc"
[3] libA.so:foo(), at 0xfffffd7fff340eb3
[4] a.out:main(), at 0x400d0e
Aha, so the variant that works uses the unwinder provided by libgcc_s
instead
of libc
! Yes, the Solaris libc does implement the same unwinding interface,
but does that for the Sun ABI.
Sun ABI and Itanium C++ ABI differ in the contents of the opaque structure
called _Unwind_Context
. Apparently, where one expects a personality routine
address, the other has something else. Which is why our program ends up
executing something that is not code.
So why the problem doesn’t appear on other architectures like x86 or sparc? The answer’s very simple: there’s no unwinder implementation there. Compare
$ elfdump -s /lib/64/libc.so.1 | grep Unwind | wc -l
46
with the same for x86
$ elfdump -s /lib/libc.so.1 | grep Unwind | wc -l
0
Solutions
The general idea of the solution is to avoid having the second implementation
of the stack unwinder in the first place; get rid of the second one completely.
That’s beyond user’s control as it essentially requires rebuilding the
/lib/64/libc.so.1
library.
The second best idea is to hide one unwinder behind the other. There are several ways to achieve that.
A temporary solution
Use LD_LIBRARY_PATH
so that when the application has started, libgcc_s.so
had already been loaded. The effect is that libgcc_s will be the first in the
list of the libraries searched. Therefore, all unwinding will go through that.
Let’s verify using the original a.out
:
LD_PRELOAD=/.../lib/amd64/libgcc_s.so.1 ./a.out
No crash!
Link a.out with libgcc_s
If you link the main executable with libgcc_s
explicitly, it is easy to force
the loader to load it before libc
and, therefore, effectively leave only one
version of unwinding. Then things will just work.
$ gcc -R. -L. -lA main.c -m64 -l gcc_s
$ ./a.out
Again, no crash.
The same effect can be achieved by linking with g++
, which adds libgcc_s
behind the scenes. This may be the reason most people never run into this problem
because the main program for a C++ library usually also is a C++ program. It
doesn’t have to be, though.
Use a different version of the GCC runtime
I noticed that different versions of gcc have libstdc++.so
with a different
order of recorded dependencies on other libraries. For instance, here we have
libc
preceding libgcc_s
:
$ elfdump -d .../gcc/4.8.2/intel-S2/lib/amd64/libstdc++.so | grep NEED
[0] NEEDED 0x26e36 libm.so.2
[1] NEEDED 0x26e52 libc.so.1
[2] NEEDED 0x26e92 libgcc_s.so.1
And here, the order is reversed:
$ elfdump -d .../amd64/libstdc++.so | grep NEED
[0] NEEDED 0x270d7 libm.so.2
[1] NEEDED 0x270f3 librt.so.1
[2] NEEDED 0x27107 libgcc_s.so.1
[3] NEEDED 0x2712f libc.so.1
So in this case, the linker would load libgcc_s
before libc
and we’re fine.
This can be observed by running ldd
on a.out
:
$ LD_LIBRARY_PATH=.../4.9.0/intel-S2/lib/amd64/ ldd a.out
libA.so => ./libA.so
libstdc++.so.6 =>.../amd64//libstdc++.so.6
libm.so.2 => /lib/64/libm.so.2
libgcc_s.so.1 => .../amd64//libgcc_s.so.1
libc.so.1 => /lib/64/libc.so.1
...
Useful tools
elfdump(1)
:
-s
option shows symbols (and doesn’t truncate the names asreadelf
does! This difference caused me a lot of grief)-d
option shows, among other things, the libraries the loadobject depends on
LD_DEBUG
environment variable; it is described in ld.so.1(1)
and is incredibly useful:
LD_DEBUG=help /bin/true
to get usageLD_DEBUG=symbols,detail ./a.out 2>ld.log.txt
to learn how and where symbols are found by the loader,ld.so
References
- Another investigation of the same problem
- Mixing libc and libgcc_s unwinders on 64-bit Solaris 10+/x86 breaks EH - a related gcc bug
- Itanium C++ ABI, the exception unwinder part
- libunwind - a standalone unwinder
- clang’s libunwind - yet another unwinder library