pstack, coreadm and symbol tables

When investigating a crash, you sometimes see question marks ???????? appearing in the stack trace instead of the function name. Why is that? Can this be fixed somehow? The article explores this problem and suggests some solutions.

Where do the symbol names come from?

In ELF files, symbols reside in two sections: .symtab and .dynsym.

On recent versions of Solaris, there is a new section, .SUNW_ldynsym, but for the purpose of this article it is identical to .dynsym, so in the interest of keeping things simple, I’m not going to mention it again.

Both sections are essentially tables that map a name to a value; here, we are interested in the function names, so that value would be the function’s address.

When pstack unwinds the stack (starting from the value of $pc and $fp/$sp registers that comes from special NOTE segment of core file), it goes through the symbol tables of all the files involved and find a symbol with closest value.

For example, suppose we have this core file:

$ pstack core
core 'core' of 7719:    ./a.out
 **fece586c** strlen   (8050ada, 8047a38, fed91c20, 0) + c
 fed40814 printf   (8050ad8, 0) + a8
 08050969 ???????? (0, 8047b30, 8047a84, 80508bd, 1, 8047a90)
 080509a2 main     (1, 8047a90, 8047a98, fed93e40) + 12
 080508bd _start   (1, 8047b98, 0, 8047ba0, 8047bdc, 8047be7) + 7d

fece586c address belongs to libc.so.1 as can be seen from the pmap(1) output:

$ pmap core
core 'core' of 7719:    ./a.out
08046000       8K rwx--    [ stack ]
08050000       4K r-x--
08060000       4K rwx--
08061000     128K rwx--    [ heap ]
FECC0000     760K r-x--  /lib/libc.so.1
FED8E000      32K rw---  /lib/libc.so.1
FED96000       8K rw---  /lib/libc.so.1
...

It is in the code segment (the r-x-- permissions gave that away) of /lib/libc.so.1. Looking at libc.so.1 with elfdump, we can see that the global function strlen starts at the offset 0x25860.

$ elfdump -s /usr/lib/libc.so.1 | grep strlen
    [2603]  **0x00025860** 0x00000045  FUNC GLOB  D   37 .text          strlen

So in our process, that has passed away, it would’ve resided at 0xFECC0000 (the base address of libc.so.1 in memory) + 0x25860 = 0xFECE5860. Hence, 0xfece586c is 0xFECE5860+0xc, which is strlen+0xc.

Symbol tables

As you can see from the above example, not all symbols have been found. In this case, the address 0x08050969 hasn’t been mapped to any symbol. That address belongs to a.out code segment starting at 0x08050000 and that’s all we can tell. Yet the other symbol from the same segment is visible: main at 0x080509a2.

The difference is because those two symbols come from different symbol tables while executable files are permitted to have only one: .dynsym (strictly speaking, that probably applies to dynamic executables only, but since Solaris 10 strongly discourages static linking, we almost always have to deal with dynamic executables and shared libraries).

This .dynsym section is used by the run-time linker (ld.so.1(1)) and contains global names that the program “exports” or “imports” from libraries; a call to “main” is resolved at run time by looking up the name “main” in the .dynsym section and jumping to the address associated with the symbol found. Since this information is absolutely necessary at run time, the .dynsym section always resides in a loadable segment and is always a part of the process’ memory image (and, therefore, the core file).

On the other hand, the .symtab section that contains all symbols - including the local ones - was useful mostly when linking relocatable object files (*.o). References inside one file can be resolved at compile time using offsets, so the static functions do not have to have a name at run time, they are called directly using an offset from the current position. This is why the .symtab section does not belong to a loadable segment and does not contribute to the process’ memory image in any way. And this is why it [used to be] customary to remove the symbol table from the final executables (using strip(1), for example) to save space and make life of the support engineers harder.

In our case, ./a.out has indeed been stripped:

$ elfdump -c a.out | grep symtab
$ elfdump -c a.out | grep dynsym
Section Header[4]:  sh_name: .dynsym

It does have .dynsym, but no .symtab. By the way, the main symbol indeed is present in .dynsym and has the address 0x08050990:

$ elfdump -s -N .dynsym a.out | grep **main**
      [28]  0x08050990 0x0000001a  FUNC GLOB  D    0 .text          **main**

Loadable objects (executables and shared libraries)

Let’s recompile a.out and see if it helps:

$ CC a.cc
$ ./a.out
Segmentation Fault (core dumped)
$ pstack core
core 'core' of 11761:   ./a.out
 fece586c strlen   (8050ada, 8047a38, fed91c20, 0) + c
 fed40814 printf   (8050ad8, 0) + a8
 08050969 **__1cDfoo6F_i_** (0, 8047b30, 8047a84, 80508bd, 1, 8047a90) + 19
 080509a2 main     (1, 8047a90, 8047a98, fed93e40) + 12
 080508bd _start   (1, 8047b98, 0, 8047ba0, 8047bdc, 8047be7) + 7d

We now can see the name __1cDfoo6F_i_ (mangled name of int foo()) instead of ???, but where would pstack get this information? __1cDfoo6F_i_ is not present in .dynsym, so there was no information about this name in the memory image of the process when it died:

$ strings core | grep __1cDfoo6F_i_

pstack(1) is smarter that that: it finds out which program has generated this core file, locates the program and uses its .symtab (if present, of course) to map the symbols. Here’s an excerpt from proc(1):

     Some of the proc tools need to derive the  name  of  the
     executable corresponding to the process which dumped core or
     the names of shared libraries associated with  the  process.
     These files are needed, for example, to provide symbol table
     information for pstack(1). If the proc tool in  question  is
     unable  to  locate  the needed executable or shared library,
     some symbol information is unavailable  for  display.

Let’s delete a.out and see what happens:

$ rm a.out
$ pstack core
core 'core' of 11761:   ./a.out
 fece586c strlen   (8050ada, 8047a38, fed91c20, 0) + c
 fed40814 printf   (8050ad8, 0) + a8
 08050969 **????????** (0, 8047b30, 8047a84, 80508bd, 1, 8047a90)
 080509a2 main     (1, 8047a90, 8047a98, fed93e40) + 12
 080508bd _start   (1, 8047b98, 0, 8047ba0, 8047bdc, 8047be7) + 7d

We immediately get our ???’s back.

So pstack uses the core file and the executable/libraries in order to print readable names in the stack trace.

Core file contents

If you have to send your core file to another person for inspection, you have him/her at a disadvantage: that person might not have your executable and even the system libraries might be slightly different. If pstack would go look for the address-to-symbol mapping there, it might end up printing wrong symbol names and question marks, making the core file more harmful than helpful.

There is a way to embed the symbol tables into the core file: use the coreadm(1M) command. It allows to specify what kind of content you want the system to put into the core file and it can even dictate the system to pull .symtab from executable and shared libraries:

# under root:
$ coreadm -I default+symtab

More information on coreadm can be found in its man page: coreadm(1M).

Side note: in fact, the symbol tables of libc.so.1 and ld.so.1 were present in my core file even without “symtab” content requested as can be seen by elfdump -c core; seems to be an undocumented, but useful feature.

Let’s turn .symtab inclusion on and see if it helps:

$ su -
# coreadm -I default+symtab
# exit
$ ./a.out   
Segmentation Fault (core dumped)
$ rm a.out
$ pstack core
core 'core' of 13604:   ./a.out
 fece586c strlen   (8050ada, 8047a38, fed91c20, 0) + c
 fed40814 printf   (8050ad8, 0) + a8
 08050969 **__1cDfoo6F_i_** (0, 8047b30, 8047a84, 80508bd, 1, 8047a90) + 19
 080509a2 main     (1, 8047a90, 8047a98, fed93e40) + 12
 080508bd _start   (1, 8047b98, 0, 8047ba0, 8047bdc, 8047be7) + 7d

Core file now contains many symbol tables, one per loadobject:

$ elfdump -c core | grep symtab
Section Header[1]:  sh_name: .symtab
Section Header[3]:  sh_name: .symtab
Section Header[6]:  sh_name: .symtab
Section Header[8]:  sh_name: .symtab
Section Header[10]:  sh_name: .symtab
Section Header[12]:  sh_name: .symtab

and one of them has the definition of our int foo() function that starts at 0x08050950:

$ elfdump -s core | grep foo
      [56]  **0x08050950** 0x00000034  FUNC LOCL  D    0      __1cDfoo6F_i_

How to prevent ??? from appearing in the stack trace?

Use pstack on the same machine

First and foremost, you can avoid many problems by using pstack on the same machine where core file was generated. This will ensure that pstack uses the same binary and libraries as the process that generated the core. Otherwise, you might end up looking at the wrong symbols or (best case scenario, really) a lot of question marks.

Don’t strip binaries

On Solaris, it is no longer customary to strip binaries. Space savings are questionable and performance of unstripped binary does not suffer, so why having lives of those who will debug it difficult?

Don’t delete binaries

By default, Solaris does not include .symtab into core files (except for libc.so and ld.so as I mentioned earlier, but that is not relevant here when we talk about user executables and libraries). So if you delete or move an executable/library after the core file has been generated, pstack won’t be able to find its .symtab and thus map addresses to local function names.

In other words, unless you’ve changed core file contents with coreadm(1M), don’t delete your binaries before you have a chance to inspect the core file. They are still useful.

Use coreadm

Most of the problems above can be eliminated at a single blow:

# coreadm -I default+symtab

This tells the system to pull the .symtab sections from every binary involved in the process and put them into the core file. You no longer need binaries to see names instead of numbers in the stack trace.

References

Maxim Kartashev

Maxim Kartashev
Pragmatic, software engineer. Working for Altium Tasking on compilers and tools for embedded systems.