pstack, coreadm and symbol tables
When investigating a crash, you sometimes see question marks
????????
appearing in the stack trace instead of the function name. Why is that?
Can this be fixed somehow? The article explores this problem and suggests
some solutions.
Where do the symbol names come from?
In ELF files, symbols reside in two sections: .symtab
and .dynsym
.
On recent versions of Solaris, there is a new section,
.SUNW_ldynsym
, but for the purpose of this article it is identical to.dynsym
, so in the interest of keeping things simple, I’m not going to mention it again.
Both sections are essentially tables that map a name to a value; here, we are interested in the function names, so that value would be the function’s address.
When pstack
unwinds the stack (starting from the value of $pc
and $fp/$sp
registers that comes from special NOTE
segment of core
file), it goes
through the symbol tables of all the files involved and find a symbol with
closest value.
For example, suppose we have this core
file:
$ pstack core
core 'core' of 7719: ./a.out
**fece586c** strlen (8050ada, 8047a38, fed91c20, 0) + c
fed40814 printf (8050ad8, 0) + a8
08050969 ???????? (0, 8047b30, 8047a84, 80508bd, 1, 8047a90)
080509a2 main (1, 8047a90, 8047a98, fed93e40) + 12
080508bd _start (1, 8047b98, 0, 8047ba0, 8047bdc, 8047be7) + 7d
fece586c
address belongs to libc.so.1
as can be seen from the pmap(1)
output:
$ pmap core
core 'core' of 7719: ./a.out
08046000 8K rwx-- [ stack ]
08050000 4K r-x--
08060000 4K rwx--
08061000 128K rwx-- [ heap ]
FECC0000 760K r-x-- /lib/libc.so.1
FED8E000 32K rw--- /lib/libc.so.1
FED96000 8K rw--- /lib/libc.so.1
...
It is in the code segment (the r-x--
permissions gave that away) of /lib/libc.so.1
.
Looking at libc.so.1
with elfdump
, we can see that the global function strlen
starts at the offset 0x25860
.
$ elfdump -s /usr/lib/libc.so.1 | grep strlen
[2603] **0x00025860** 0x00000045 FUNC GLOB D 37 .text strlen
So in our process, that has passed away, it would’ve resided at 0xFECC0000
(the base address of
libc.so.1
in memory) + 0x25860
= 0xFECE5860
. Hence, 0xfece586c
is
0xFECE5860+0xc
, which is strlen+0xc
.
Symbol tables
As you can see from the above example, not all symbols have been found. In this
case, the address 0x08050969
hasn’t been mapped to any symbol. That address belongs
to a.out
code segment starting at 0x08050000
and that’s all we can tell.
Yet the other symbol from the same segment is visible: main
at 0x080509a2
.
The difference is because those two symbols come from
different symbol tables while executable files are permitted to have
only one: .dynsym
(strictly speaking, that probably applies
to dynamic executables only, but since Solaris 10 strongly discourages
static linking, we almost always have to deal with dynamic
executables and shared libraries).
This .dynsym
section is used by the run-time linker (ld.so.1(1)
)
and contains global names that the program “exports” or “imports” from
libraries; a call to “main” is resolved at run time by looking up the name
“main” in the .dynsym
section and jumping to the address associated with the symbol
found. Since this information is absolutely necessary at run time, the .dynsym
section always resides in a loadable segment and is always a part of the process’
memory image (and, therefore, the core
file).
On the other hand, the .symtab
section that contains all symbols - including
the local ones - was useful mostly when linking relocatable object files
(*.o
). References inside one file can be resolved at compile time using
offsets, so the static functions do not have to have a name at run time, they
are called directly using an offset from the current position. This is why the
.symtab
section does not belong to a loadable segment and does not contribute
to the process’ memory image in any way. And this is why it [used to be]
customary to remove the symbol table from the final executables (using
strip(1)
, for example) to save space and make life of the support engineers
harder.
In our case, ./a.out
has indeed been stripped:
$ elfdump -c a.out | grep symtab
$ elfdump -c a.out | grep dynsym
Section Header[4]: sh_name: .dynsym
It does have .dynsym
, but no .symtab
. By the way, the main
symbol indeed
is present in .dynsym
and has the address 0x08050990
:
$ elfdump -s -N .dynsym a.out | grep **main**
[28] 0x08050990 0x0000001a FUNC GLOB D 0 .text **main**
Loadable objects (executables and shared libraries)
Let’s recompile a.out
and see if it helps:
$ CC a.cc
$ ./a.out
Segmentation Fault (core dumped)
$ pstack core
core 'core' of 11761: ./a.out
fece586c strlen (8050ada, 8047a38, fed91c20, 0) + c
fed40814 printf (8050ad8, 0) + a8
08050969 **__1cDfoo6F_i_** (0, 8047b30, 8047a84, 80508bd, 1, 8047a90) + 19
080509a2 main (1, 8047a90, 8047a98, fed93e40) + 12
080508bd _start (1, 8047b98, 0, 8047ba0, 8047bdc, 8047be7) + 7d
We now can see the name __1cDfoo6F_i_
(mangled name of int foo()
) instead
of ???
, but where would pstack
get this information? __1cDfoo6F_i_
is not
present in .dynsym
, so there was no information about this name in
the memory image of the process when it died:
$ strings core | grep __1cDfoo6F_i_
pstack(1)
is smarter that that: it finds out which program has generated this
core
file, locates the program and uses its .symtab
(if present, of course) to map
the symbols. Here’s an excerpt from proc(1)
:
Some of the proc tools need to derive the name of the
executable corresponding to the process which dumped core or
the names of shared libraries associated with the process.
These files are needed, for example, to provide symbol table
information for pstack(1). If the proc tool in question is
unable to locate the needed executable or shared library,
some symbol information is unavailable for display.
Let’s delete a.out
and see what happens:
$ rm a.out
$ pstack core
core 'core' of 11761: ./a.out
fece586c strlen (8050ada, 8047a38, fed91c20, 0) + c
fed40814 printf (8050ad8, 0) + a8
08050969 **????????** (0, 8047b30, 8047a84, 80508bd, 1, 8047a90)
080509a2 main (1, 8047a90, 8047a98, fed93e40) + 12
080508bd _start (1, 8047b98, 0, 8047ba0, 8047bdc, 8047be7) + 7d
We immediately get our ???
’s back.
So pstack
uses the core file and the executable/libraries in order to print
readable names in the stack trace.
Core file contents
If you have to send your core file to another person for
inspection, you have him/her at a disadvantage: that person might not have
your executable and even the system libraries might be slightly different.
If pstack
would go look for the address-to-symbol mapping there, it might
end up printing wrong symbol names and question marks, making the core file
more harmful than helpful.
There is a way to embed the symbol tables into the core
file: use the
coreadm(1M)
command. It allows to specify what kind of content you want the
system to put into the core
file and it can even dictate the system to pull
.symtab
from executable and shared libraries:
# under root:
$ coreadm -I default+symtab
More information on coreadm
can be found in its man page: coreadm(1M)
.
Side note: in fact, the symbol tables of
libc.so.1
andld.so.1
were present in my core file even without “symtab” content requested as can be seen byelfdump -c core
; seems to be an undocumented, but useful feature.
Let’s turn .symtab
inclusion on and see if it helps:
$ su -
# coreadm -I default+symtab
# exit
$ ./a.out
Segmentation Fault (core dumped)
$ rm a.out
$ pstack core
core 'core' of 13604: ./a.out
fece586c strlen (8050ada, 8047a38, fed91c20, 0) + c
fed40814 printf (8050ad8, 0) + a8
08050969 **__1cDfoo6F_i_** (0, 8047b30, 8047a84, 80508bd, 1, 8047a90) + 19
080509a2 main (1, 8047a90, 8047a98, fed93e40) + 12
080508bd _start (1, 8047b98, 0, 8047ba0, 8047bdc, 8047be7) + 7d
Core file now contains many symbol tables, one per loadobject:
$ elfdump -c core | grep symtab
Section Header[1]: sh_name: .symtab
Section Header[3]: sh_name: .symtab
Section Header[6]: sh_name: .symtab
Section Header[8]: sh_name: .symtab
Section Header[10]: sh_name: .symtab
Section Header[12]: sh_name: .symtab
and one of them has the definition of our int foo()
function that starts at 0x08050950
:
$ elfdump -s core | grep foo
[56] **0x08050950** 0x00000034 FUNC LOCL D 0 __1cDfoo6F_i_
How to prevent ??? from appearing in the stack trace?
Use pstack on the same machine
First and foremost, you can avoid many problems by using pstack
on the same
machine where core
file was generated. This will ensure that pstack
uses
the same binary and libraries as the process that generated the core.
Otherwise, you might end up looking at the wrong symbols or (best case
scenario, really) a lot of question marks.
Don’t strip binaries
On Solaris, it is no longer customary to strip binaries. Space savings are questionable and performance of unstripped binary does not suffer, so why having lives of those who will debug it difficult?
Don’t delete binaries
By default, Solaris does not include .symtab
into core
files (except for
libc.so
and ld.so
as I mentioned earlier, but that is not relevant here
when we talk about user executables and libraries). So if you delete or move
an executable/library after the core
file has been generated, pstack
won’t be able to
find its .symtab
and thus map addresses to local function names.
In other words, unless you’ve changed core file contents with coreadm(1M)
,
don’t delete your binaries before you have a chance to inspect the core file.
They are still useful.
Use coreadm
Most of the problems above can be eliminated at a single blow:
# coreadm -I default+symtab
This tells the system to pull the .symtab
sections from
every binary involved in the process and put them into the core file. You no
longer need binaries to see names instead of numbers in the stack trace.
References
- pstack(1)
- coreadm(1M)
- Inside ELF Symbol Tables
- Understanding ELF using readelf and objdump
- Linker and Libraries Guide, the source linker/ELF information
- An article about Core Dump Management on the Solaris OS