(Not) Managing More Than One Of The Same Object In A Process

Ali Bahrami — Wednesday January 06, 2016

Surfing with the Linker-Aliens

I had a conversation with a coworker this week about one of those recurring questions that come up from time to time. There is an existing and widely used system library, and there's a desire to provide a better variant of it, using the same library name, with the same SONAME. The two objects offer the same interfaces, but they cannot coexist within a given process. The question was about whether the linkers can prevent both from being loaded into a single process. I had to deliver the unwelcome news that the linkers cannot do that with 100% reliability, and that they aren't intended to support that sort of design. This sort of discussion comes up frequently enough that I think it would be useful to explore the underlying issues.

For the purposes of this discussion, let's call that library libbar. The 2 copies of libbar live in different locations on the system, but are otherwise the same from a linking point of view. The question was: If the main a.out program uses one libbar, and other dependencies of a.out use the other libbar, is there any way to guarantee that only the good libbar gets loaded, and is used by both?

The short answer is no. The only completely safe way to manage this is to only have one libbar. If you need a variant, give it a different name, and SONAME, and possibly different function names too, or at least use direct bindings. The only really simple thing is one library.

Note that this is not the same question as "How can I design libbar such that having multiple copies loaded and used simultaneously is safe?". That's an easier question to answer: You do it by making the library completely reentrant, and by ensuring that all new APIs are backward compatible with old versions. That is of course, easier said than done.

To demonstrate how things can go wrong with 2 copies of 1 library, I wrote a small test program. Before we dive into this, you might find it useful to review how library names and SONAMES work: Please see How To Name A Solaris Shared Object.

It helps to first understand how the runtime linker finds dependencies for an object. It's pretty simple:

If the ld -R option was used to record a runpath for the object when it was built, the runtime linker looks in each directory specified by the runpath, stopping at the first directory that supplies the needed library.
If the runpath does not lead to the dependency, the default system library directories are examined. The default is subject to change, but currently is /lib and /usr/lib for 32-bit processes, and /lib/64 and /usr/lib/64 for 64-bit processes.
If a dependency is found, the dev/inode of the file is compared to the dev/inode for any objects already loaded in the process. If there is a match, these are the same file in the same filesystem, and so, the already in memory copy is used. Note that if the dev/inode does not match, ld.so.1 is perfectly willing to load a library with the same name and SONAME as an already loaded object. These are physically different files.

Users often assume that the runtime linker does more to prevent multiple instances of one library from entering the process than it really does. The dev/inode check is mainly there to catch cases where a process tries to load the same object under its real name as well as through its compilation symlink. Unix allows a given file to have multiple names, and it's a simple matter to catch those aliases and collapse them to a single loaded object. If however, the two libraries are physically different files, then determining that they are the same library is not possible:

If the two objects have the same name, or SONAME, then one might expect that they are identical. That is expected, but it is not always true. The name is not a sufficient test for ensuring that two libraries are interchangeable.
If the two objects are byte for byte identical, then one might conclude that only one should be loaded. However, that's an unacceptably expensive check to make at runtime, and a fairly unlikely scenario.
If the two objects differ, but offer exactly the same symbols, then one might decide to treat them as the same library. However, this can easily be wrong. Consider the case where both offer a symbol for a function foo(), but the two functions accept different arguments. The linker cannot detect that. Even if it could, it would not be possible to know which library was the "better" one.
The two objects might differ, and offer different symbols, but provide a common subset that is sufficient for the program. This is also extremely costly to determine, and might prove to be wrong when a later call to dlopen() changes the set of required interfaces.
I am told that in the very early days of Solaris, before my time, any dependency defined as a simple filename was pattern matched against any existing objects. This pattern matching resulted in some expense, but more importantly there were legitimate requirements that one object load 'error.so' and another object load a different 'error.so'. Thus the model became that each object searched for it's dependencies using its own runpath.

You can see that it's a hard problem, in the computer science sense of that term, and would be intractable for the runtime linker to implement. If the files are different (different dev/inode), then the runtime linker must assume that they are different libraries. Responsibility for making libraries unique and compatible with each other has always fallen to the the system and library designers, and not the linkers. The way to keep things consistent, and deterministic is to avoid the situation where more than only library with a given name exists, and to avoid the temptation of believing that it can be made to work, or that the system was intended to support it.

Sometimes, people believe that the runpath of the main program somehow controls how dependencies of dependencies are found. They'll say something like:

My program calls libfoo, and libbar. libfoo is itself linked to a different copy of libbar. I want only one libbar to be loaded, but I've been told by the linker experts that these two copies will both be loaded. And yet, I only see the first one being used, which is what I was hoping for, but which seems to contradict the experts. What is really going on, and why can't I just do this?

The confusion stems from the fact that there's more involved than merely finding and loading libraries in this particular game of pachinko. After finding and loading libraries, the runtime linker carries out the process of symbol resolution, the process of determining how symbols are bound between objects at runtime. If direct bindings are in play, then symbols are bound as the direct bindings dictate. Otherwise, it's done by interposition: The objects in the process are examined in the order that they were loaded, and the first object to provide the desired symbol wins. You can therefore see that it's possible in our question above for 2 copies of libbar to be loaded, while only one is used. It's more complicated that that however. It's easy to imagine scenarios in which the one used changes, as well as scenarios where both are used.

Let's make this concrete with an example. I have a main program that calls functions foo() and bar(), each of which is in a library (libfoo, and libbar respectively):

% cat main.c
#include <stdio.h>

extern void foo(void);
extern void bar(void);

int
main(int argc, char **argv)
{
        (void) printf("main calls foo\n");
        foo();

        (void) printf("main calls bar\n");
        bar();
}

foo() also calls bar():

% cat foo.c
#include <stdio.h>

extern void bar(void);

void foo(void)
{
        (void) printf("    foo calls bar\n");
         bar();
}

Now the twist: There are actually 2 libraries named bar. foo() is linked to lib1/libbar.so.1, while main is linked to lib2/libbar.so.1. Both libbar's have the same object name, and the same SONAME.

% cat bar.c
#include <stdio.h>

void bar(void)
{
        printf("        bar is in library %s\n", BAR_STR);
}

BAR_STR is set via -D on the cc command line when the 2 libbar directories are built.

I have provided a tarball with these files, and a Makefile, that you can download and use to reproduce these experiments. Unpack it in an empty directory, and follow along below:

First, let's build it without any special options. I'll show the output from make for this first experiment to give you a sense of what it does, but will elide it from following ones in the interest of brevity:

% make
mkdir lib1
cc -G -Kpic -DBAR_STR=\"bar_lib1\" bar.c -hlibbar.so.1 \
        -o lib1/libbar.so.1 -zdefs -lc
rm -f lib1/libbar.so
ln -s libbar.so.1 lib1/libbar.so
cc -G -Kpic foo.c -hlibfoo.so.1 \
        -o libfoo.so.1 -L lib1 -R lib1 -zdefs -lbar -lc
rm -f libfoo.so
ln -s libfoo.so.1 libfoo.so
mkdir lib2
cc -G -Kpic -DBAR_STR=\"bar_lib2\" bar.c -hlibbar.so.1 \
        -o lib2/libbar.so.1 -zdefs -lc
rm -f lib2/libbar.so
ln -s libbar.so.1 lib2/libbar.so
cc main.c -o main -L. -Llib2  -R. -Rlib2 -zdefs -lfoo -lbar

ldd shows that there will be 2 libbar objects in the process:

% ldd main
        libfoo.so.1 =>   ./libfoo.so.1
        libbar.so.1 =>   lib2/libbar.so.1
        libc.so.1 =>     /lib/libc.so.1
        libbar.so.1 =>   lib1/libbar.so.1

and debug output shows that both are actually pulled into the process:

    % LD_DEBUG=all ./main 2>&1 | grep 'link map' | grep libbar.so
    04689: file=lib2/libbar.so.1  [ ELF ]; generating link map
    04689: file=lib1/libbar.so.1  [ ELF ]; generating link map

However, only one is actually used, the one "controlled" by the a.out:

% ./main
main calls foo
    foo calls bar
        bar is in library bar_lib2
main calls bar
        bar is in library bar_lib2

This didn't happen because the a.out controlled the loading of objects though. It happened because the a.out's libbar was already in memory, and symbol binding is being done via the traditional interposition rules. The symbol bar() could have come from any object in the process, not necessarily from libbar.

It is not safe to assume that the copy of libbar tied to the a.out will always be the one that "wins". One way to change that is to enable lazy loading, which defers object loading until the first access to the object is made:

% make clean
rm -rf lib? libfoo.so* main
% LD_OPTIONS=-zlazyload make
<...make output elided...>
% ./main
main calls foo
    foo calls bar
        bar is in library bar_lib1
main calls bar
        bar is in library bar_lib1

Now, lib1/libbar wins, rather than lib2/libbar as before. Thanks to lazy loading, libfoo pulled in lib1/libbar before main got around to pulling in lib2/libbar.

Direct bindings offer another way to perturb the results, and can lead to both libraries being called.

% make clean
rm -rf lib? libfoo.so* main
% LD_OPTIONS=-Bdirect make
<...make output elided...>
% ./main
main calls foo
    foo calls bar
        bar is in library bar_lib1
main calls bar
        bar is in library bar_lib2

Preloading is yet another way to change the outcome:

% LD_PRELOAD=lib1/libbar.so.1 ./main
main calls foo
    foo calls bar
        bar is in library bar_lib1
main calls bar
        bar is in library bar_lib1

% LD_PRELOAD=lib2/libbar.so.1 ./main
main calls foo
    foo calls bar
        bar is in library bar_lib2
main calls bar
        bar is in library bar_lib2

There are probably other ways too. For instance, we haven't even discussed the use of dlopen().

It is indeed true that normal small programs can manage these pitfalls without much issue. But consider the complexity of a situation like that in firefox, where multiple dependencies have dependencies on each other:

% ldd /usr/bin/firefox | wc -l 
90

At some point, the interdependencies will overwhelm your ability to predict, or to manage. At the limit, the only 100% safe and predictable way to manage this issue is to ensure that there is never more than one instance of a given library on the system.

Surfing with the Linker-Aliens

Published Elsewhere

https://blogs.oracle.com/ali/entry/not_managing_more_than_one/
https://blogs.oracle.com/ali/not-managing-more-than-one-of-the-same-object-in-a-process/

Surfing with the Linker-Aliens

[33] ELF Section Compression Blog Index (ali) [35] How To Strip An ELF Object