Regular Expression and Glob Matching for Mapfiles

Ali Bahrami — Wednesday December 23, 2015

Surfing with the Linker-Aliens

In Solaris 11 Update 4, the ld mapfile language has gained MATCH and MATCH_REF expressions, which provide the ability to match sections and symbols using regular expressions or glob patterns. In the case of regular expressions, you can also use back substitution to generate replacement names, using substrings from the original matched name. Support for the GNU --version-script option is also added.

History

These abilities were a long time coming. Around the 2007-2008 time frame, we undertook to add support for a large number of GNU link-editor command line options, as a way to lower the barrier to FOSS code moving from Linux to Solaris. Many of their options are simple aliases for our options, differing only in the name of the option itself. Not understanding GNU version scripts sufficiently, we added --version-script to ld as an alias for our -M (mapfile) option. We quickly discovered that to be a bad idea, when we received a report that the new ld was breaking the ability to build software that had previously worked, and yanked it back out, having gained a bit of hard won knowledge through the experience. The problem was wildcard support. GNU version scripts allow the use of glob wildcards to match symbol names, which is something that Solaris mapfiles did not support. An implementation of --version-script that does not support wildcards sets a trap for FOSS software, since any package that uses wildcards will try to use the feature, and then fail to build.

Mapfiles are often used to establish stable interfaces, sometimes called an ABI (Application Binary Interface). In such cases, using wildcards to match symbols is a bad idea, since a small programming error can alter the ABI without drawing any error from the link-editor. Other times, mapfiles are used to specify symbols for non-ABI related purposes, and in that case, wildcard matching can be useful. We were interested in providing the option back in 2008, but as I've written previously, we were in no position to provide it, as we were saddled with a deeply inadequate mapfile language that had already been pushed past its limits:

The Problem(s) With Solaris SVR4 Link-Editor Mapfiles
Before we could entertain adding a feature like wildcard matching, it was necessary to do something about that:
A New Mapfile Syntax for Solaris.
And so, here we are 7 years later. It seems appropriate at this point to take a moment to contemplate the notion of technical debt. The folks who designed the original mapfile language back in the 1980's are surely not to blame for our difficulties decades later, but decisions made early in any software project have long term impacts that can be hard to foresee.

The desire to someday support wildcard matching was discussed at the time that we designed the replacement version 2 mapfile language. However, that was already a very large and complicated project, so we made the choice to defer these new abilities to a later time, knowing that the new version 2 syntax would establish the necessary foundation.

MATCH and MATCHREF Expressions

[Much of this section is lifted from the Solaris Linker and Libraries Manual]

A MATCH expression allows strings to be matched against a pattern, delimited by slash (/) characters.

MATCH(g/match-pattern/[i])
MATCH(r/match-pattern/[i])
MATCH(t/match-pattern/[i])
The type of matching to be done is specified by a single character code that precedes the pattern. The type of matching to be done defines the syntax of the match-pattern.

g
Glob pattern matching. The match-pattern is specified using the glob syntax described by the fnmatch(5) manpage.
r
Regular Expression matching. The match-pattern is specified using the extended regular expression (ERE) syntax described by the regex(5) manpage.
t
Plain text matching. The match-pattern follows the standard mapfile syntax for double quoted strings, where the slash (/) character is used in place of the usual (") quote character. The rules for double quoted strings are described in the Linker and Libraries Guide.
By default, case sensitive pattern matching is employed. Case insensitive matching can be specified by specifying the character i immediately following the closing slash (/) character.

The MATCHREF expression is used to generate a new string, based on a template string, which can incorporate substrings matched by a previous MATCH. MATCHREF, particularly in conjunction with a regular expression MATCH, offer a powerful mechanism for renaming.

MATCHREF(/template-string/)
The template-string follows the standard mapfile syntax for double quoted strings, where the slash (/) character is used in place of the usual (") quote character. The rules for double quoted strings are described in the Linker and Libraries Guide.

Within template-string, substrings to be copied from the related MATCH expression are indicated by tokens of the form ${cN}, where c is a single character that identifies a MATCH directive, and N is an integer that identifies a substring within that MATCH. The identifier characters allowed with MATCHREF depend on the mapfile directive that employs them. The documentation for each mapfile directive that supports the use of MATCH and MATCHREF defines the set of MATCHREF identifier characters that are allowed by that directive.

The string that results from a MATCHREF expression consists of the template-string, with all ${cN} tokens replaced by the MATCH substrings that they refer to. The zeroth token, ${c0}, represents the full string matched by the MATCH expression, and is supported with all MATCH expressions. Tokens specifying a value of n larger than 0 are only supported with regular expression MATCH expressions. When used with a regular expression, a value of n larger than 0 corresponds to the nth open parenthesis found in the MATCH pattern, and represents the substring matched by that subpart of the regular expression.

If a given ${cN} does not correspond to any MATCH substring, an empty ("") string is substituted. This occurs for any non-zero value of n with glob or text matching, or for a value of n greater than the number of parenthesis within a regular expression pattern.

Redirecting Sections

Normally, the link-editor copies input sections to the output object, creating output sections with the same names as the input. The LOAD_SEGMENT directive allows the use of MATCH and MATCHREF to match sections by name, and optionally to redirect them to differently named output sections. The following mapfile will redirect all readonly allocable sections with a name staring with the string ".appXtext.", and redirect them to an output section named by replacing this prefix with ".text.".
$mapfile_version 2

LOAD_SEGMENT text {
	ASSIGN_SECTION apptext {
		IS_NAME = MATCH(r/^\.appXtext\.(.*)$/);
		FLAGS = ALLOC !WRITE;
		OUTPUT_SECTION {
			NAME = MATCHREF(/.text.${n1}/);
		};
	};
};
A more powerful example comes from KSplice, the mechanism by which a running operating system kernel can be patched without requiring a reboot. As mentioned above, the ideas behind MATCH and MATCH_REF had been germinating for years. The impetus to finally add them to the mapfile language came from the KSplice project. KSplice originates on Linux, and makes extensive use of GNU linker script matching. Note that I have numbered each rule, to aid the discussion, and that lines have been wrapped for readability:
  SECTIONS {
1   .text : { *(.text .text* .exit.text .sched.text) }
2   .ksplice_relocs : { ksplice_relocs = .; KEEP(*(.ksplice_relocs*))
          ksplice_relocs_end = .; }
3   .ksplice_sections : { ksplice_sections = .; KEEP(*(.ksplice_sections*))
          ksplice_sections_end = .; }
4   .ksplice_patches : { ksplice_patches = .; KEEP(*(.ksplice_patches*))
          ksplice_patches_end = .; }
5   .ksplice_symbols : { ksplice_symbols = .; KEEP(*(.ksplice_symbols))
          ksplice_symbols_end = .; }
6   .ksplice_call_pre_apply : { ksplice_call_pre_apply = .;
          KEEP(*(.data*_KHOOK_ksplice_call_pre_apply*))
          ksplice_call_pre_apply_end = .; }
7   .ksplice_call_check_apply : { ksplice_call_check_apply = .;
          KEEP(*(.data*_KHOOK_ksplice_call_check_apply*))
          ksplice_call_check_apply_end = .; }
8   .ksplice_call_apply : { ksplice_call_apply = .;
          KEEP(*(.data*_KHOOK_ksplice_call_apply*))
           ksplice_call_apply_end = .; }
9   .ksplice_call_post_apply : { ksplice_call_post_apply = .;
          KEEP(*(.data*_KHOOK_ksplice_call_post_apply*))
          ksplice_call_post_apply_end = .; }
10  .ksplice_call_fail_apply : { ksplice_call_fail_apply = .;
          KEEP(*(.data*_KHOOK_ksplice_call_fail_apply*))
          ksplice_call_fail_apply_end = .; }
11  .ksplice_call_pre_reverse : { ksplice_call_pre_reverse = .;
          KEEP(*(.data*_KHOOK_ksplice_call_pre_reverse*))
          ksplice_call_pre_reverse_end = .; }
12  .ksplice_call_check_reverse : { ksplice_call_check_reverse = .;
          KEEP(*(.data*_KHOOK_ksplice_call_check_reverse*))
          ksplice_call_check_reverse_end = .; }
13  .ksplice_call_reverse : { ksplice_call_reverse = .;
          KEEP(*(.data*_KHOOK_ksplice_call_reverse*))
          ksplice_call_reverse_end = .; }
14  .ksplice_call_post_reverse : { ksplice_call_post_reverse = .;
          KEEP(*(.data*_KHOOK_ksplice_call_post_reverse*))
          ksplice_call_post_reverse_end = .; }
15  .ksplice_call_fail_reverse : { ksplice_call_fail_reverse = .;
          KEEP(*(.data*_KHOOK_ksplice_call_fail_reverse*))
          ksplice_call_fail_reverse_end = .; }
  }
Each of these rules does three things:

  1. Establish an output section with a specified name.

  2. Assign input sections to these output sections, using patterns containing glob wildcards to specify the section names.

  3. Generate a pair of "begin" and "end" symbols for each output section.
For example, rule #2:
.ksplice_relocs : { ksplice_relocs = .; KEEP(*(.ksplice_relocs*))
         ksplice_relocs_end = .; }

  1. Creates output section .ksplice_relocs

  2. Causes any input section starting with a .ksplice_relocs prefix to be assigned to that section.

  3. Creates begin symbol ksplice_relocs, and end symbol ksplice_relocs_end.

The begin/end symbols are easily created in other ways, and are not interesting to this example. This example focuses on (1) and (2).

As shown in the above linker script, the GNU ld allows glob style pattern matching. Although this is more powerful than what we've offered to date, there is still a lot of repetition, because each unique output section name suffix requires a unique rule. Our new MATCH expressions also offer glob style matching, and can be be used to produce a 1:1 mapping from the above linker script rules to a Solaris mapfile that does the same thing. However, MATCH expressions have additional power, in the form of regular expressions, which can be used with back substitution of matched substrings to shrink the number of required rules from 15 down to 3, with a corresponding improvement in readability:

$mapfile_version 2

LOAD_SEGMENT text {
    # Corresponds to GNU linker script rule 1
    ASSIGN_SECTION {
                IS_NAME = MATCH(r/^\.((text.*)|(exit.text)|(sched.text))$/);
                OUTPUT_SECTION { NAME = .text };
        }
};

LOAD_SEGMENT data {
    # Corresponds to GNU linker script rules 2-5
    ASSIGN_SECTION {
        IS_NAME = MATCH(r/^.ksplice_(relocs|sections|patches|symbols).*/);
        OUTPUT_SECTION { NAME = MATCHREF(/.ksplice_${n1}/) };
    };

    # Corresponds to GNU linker script rules 6-15
    ASSIGN_SECTION {
        IS_NAME = MATCH(r/^.data.*_KHOOK_ksplice_call_(.*_)?(apply|reverse).*/);
        OUTPUT_SECTION { NAME = MATCHREF(/.ksplice_call_${n1}${n2}/) };
    };
};

Matching and Renaming Symbols

The SYMBOL_SCOPE and SYMBOL_VERSION directives allow the use of MATCH and MATCHREF to match symbols by name, and optionally to rename them. The following mapfile matches all symbols starting with the letter "i" followed by an integer value, and gives them protected scope.
$mapfile_version 2

SYMBOL_SCOPE {
    protected:
        MATCH(r/^i[0-9]+$/);
};
The following mapfile renames all symbols starting with the letter "i" followed by an integer value to have the prefix "interface_", and gives them protected scope.
$mapfile_version 2

SYMBOL_SCOPE {
    protected:
        MATCH(r/^i([0-9]+)$/)
                { RENAME = MATCHREF(/interface_$(n1}/) };
};
Symbol renaming is usually better done at the source code level. I have to confess that I had, and have, doubts that allowing symbol renaming is a good idea. In the end, the symmetry between MATCH and MATCH_REF seemed more compelling than the fact that the feature can be misused. Please try to reserve your use MATCH_REF on symbols of it for those rare unicorn moments when it really is a good idea.

And Finally: --version-script

In Solaris 11 Update 4, the link-editor (ld) does accept the GNU --version-script option. A file specified with --version-script, rather than the native -M, receives a bit of extra processing — any symbol containing a glob meta character is treated as if it has an implicit MATCH(g/.../) expression wrapped around it.

Our version script implementation is still a subset of the GNU version. GNU version scripts support an extern "lang" syntax, derived from C++, allowing symbols from languages like C++ and Java to be specified in their natural, non-mangled forms, as per this example I found online:

{
    global:
        extern "C++" {
            *scifi::Spaceship;
            scifi::Spaceship::Spaceship*;
            scifi::Spaceship::?Spaceship*;
            scifi::Spaceship::stabliseIonFluxers*;
            scifi::Spaceship::initiateHyperwarp*;
        };
    local:
        *;
};
This is an obviously useful ability, but perhaps one that's more difficult for Solaris to provide, as we have multiple C++ compilers to support that follow different demangling rules. We may, or may not, ever add this. Fortunately, the vast majority of version scripts in the FOSS software we've seen don't use the feature, so an implementation of --version-script that only understands the GNU wildcard feature remains useful in most cases. We've therefore provided --version-script. If ld detects the use of extern "lang" in a version script, an error to that effect is issued.

To prevent breaking breaking software that requires the extern "lang" syntax, the Solaris ld requires the -z gnu-version-script-compat option to also be specified. From the ld(1) manpage:


-z gnu-version-script=mapfile
-z gnu-version-script-compat
--version-script mapfile

    Provides partial support for the GNU version script style
    of mapfile. Version scripts are based on the original Solaris
    version 1 symbol definition syntax, with some extensions. ld
    supports the most common such extension, the use of wildcard
    characters in the specified symbol names. Other GNU-specific
    extensions may not be supported. ld will issue an appropriate
    error if an unsupported extension is encountered.

    For convenience in building software developed with GNU
    version scripts, the native GNU --version-script option
    is accepted as an alias for -z gnu-version-script. Due
    to the partial nature of the support for GNU version scripts,
    the use of --version-script must be explicitly enabled by
    specifying -z gnu-version-script-compat.
It's a little clunky, but I don't want a repeat of 2008. This time, --version-script is going to stick. Should we ever add extern "lang" support, we'll also phase out the need for -z gnu-version-script-compat.
Surfing with the Linker-Aliens

Published Elsewhere

https://blogs.oracle.com/ali/regex_and_glob_for_mapfiles/

Surfing with the Linker-Aliens

[30] New CRT Objects
Blog Index (ali)
[32] Explicitly Tagged Kernel Modules