Adventures in Open-Source: the lit profraw race
Intro
This is the start of an ongoing mini-series where I document contributions I make to open source projects. These aren't tutorials; I won't be teaching background theory. Instead, I want to walk through the process: how I identified the issue, designed a fix, and went through code review. Contributing to a project can feel intimidating the first time, so hopefully this makes it feel more approachable.
The Crash
If you read the ariadne post, you know I've been on the CI grind recently. One of the optimizations I shipped was switching our code coverage pipeline from lcov to llvm-cov. Coverage time went from 24 minutes down to 8. The speedup comes from llvm-cov's ability to run tests and merge profraw files in parallel; lcov processes everything serially, which is safe but slow. That parallelism is also what introduced the bug: with lcov, there's no race to have. But every few runs with llvm-cov, we saw something like:
#3 __memchr_evex (libc.so.6+0x16b220)
#4 llvm::StringRef::find (llvm-profdata+0x6bfb3b)
#5 llvm::StringRef::split (llvm-profdata+0x6c00d6)
#6 llvm::readAndDecodeStrings (llvm-profdata+0x5714cb)
#7 llvm::InstrProfSymtab::create (llvm-profdata+0x571d21)
A hard segfault in llvm-profdata during the merge step. The --failure-mode=warn flag didn't help because the crash happened before any Error was returned. The frame signature (readAndDecodeStrings -> StringRef::split -> memchr) was consistent across builds, which made it relatively straightforward to pin down.
Tracing the Malformed Input
The same run that produced the segfault also emitted these warnings:
warning: .../test/Tools/XXX/trace_runtime0.profraw: failed to uncompress data (zlib)
warning: .../test/Tools/XXX/LTO/lto2.profraw: failed to uncompress data (zlib)
warning: .../test/Tools/XXX/disable-opt-flags0.profraw: failed to uncompress data (zlib)
warning: .../test/Tools/XXX/override0.profraw: malformed instrumentation profile data: symbol name is empty
The clue wasn’t the warning text but rather the filenames. All four of these profraws share a basename with a sibling file in the same directory:
trace_runtime → trace_runtime.{c,cpp}
lto → lto.{mlir,s}
disable-opt-flags → disable-opt-flags.{c,mlir}
override → override.{c,cpp}
The crash was also non-deterministic. A shared basename plus a non-deterministic failure pointed to a race on the profraw files themselves.
The Root Cause: A Filename Collision in lit
Here's how lit names profraws when running with --per-test-coverage:
if litConfig.per_test_coverage:
test_case_name = test.path_in_suite[-1]
test_case_name = test_case_name.rsplit(".", 1)[0]
coverage_index = 0
for i, ln in enumerate(scriptCopy):
...
profile = f"{test_case_name}{coverage_index}.profraw"
coverage_index += 1
command = f"export LLVM_PROFILE_FILE={profile}; {command}"
It takes the test's basename, strips the extension, and appends a counter. That means foo.c and foo.cpp in the same directory both export LLVM_PROFILE_FILE=foo0.profraw. Two parallel test processes race to write the same file: one wins, the other produces a truncated or corrupted symtab, and that corrupted profraw is what llvm-profdata later tries to merge.
Fix 1: Full Path + Runtime Placeholders (PR #203998)
The fix uses the full path_in_suite instead of just the basename, which keeps directory and extension information distinct. It also adds two runtime placeholders from LLVM's profile format: %p (process ID) and %m (binary signature). These are expanded by compiler-rt inside the instrumented binary at write time. %p alone isn't sufficient: when a process calls execve(), the PID is preserved but the binary changes. A compiler driver that spawns a backend via exec would produce the same PID for two different instrumented binaries, so they'd still collide. %m is a hash of the binary itself, so it disambiguates them even when the PID is identical. PID reuse across a long test run (process A exits, process B gets the same PID) is a secondary concern; exec chains are the main one.
if litConfig.per_test_coverage:
# Use full path_in_suite plus %p (pid) and %m (binary signature) so
# same-basename siblings and fork/exec chains don't collide.
test_case_name = "_".join(test.path_in_suite)
coverage_index = 0
for i, ln in enumerate(scriptCopy):
...
profile = f"{test_case_name}-%p-%m{coverage_index}.profraw"
coverage_index += 1
command = f"export LLVM_PROFILE_FILE={profile}; {command}"
foo.c now exports foo_c-%p-%m0.profraw and foo.cpp exports foo_cpp-%p-%m0.profraw. No collision.
Fix 2: Bounds Checking in readAndDecodeStrings (PR #203999)
Fix 1 stops the malformed input from being produced. Fix 2 makes the parser fail gracefully if malformed input shows up anyway, regardless of origin.
The problem in readAndDecodeStrings was that it used the unchecked decodeULEB128(P, &N) overload and constructed a StringRef without verifying that P + UncompressedSize <= EndP. On a truncated symtab, the resulting StringRef length runs past the buffer, and the subsequent split() call memchrs off the end of mapped memory:
// Before: no bounds checking
uint64_t UncompressedSize = decodeULEB128(P, &N);
P += N;
uint64_t CompressedSize = decodeULEB128(P, &N);
P += N;
// ...
NameStrings = StringRef(reinterpret_cast<const char *>(P), UncompressedSize);
The fix switches to the bounds-checked overload for both length fields and explicitly rejects sizes larger than the remaining buffer:
// After: bounds-checked overloads + explicit size validation
const char *ULEBError = nullptr;
uint64_t UncompressedSize = decodeULEB128(P, &N, EndP, &ULEBError);
if (ULEBError)
return make_error<InstrProfError>(instrprof_error::truncated);
P += N;
uint64_t CompressedSize = decodeULEB128(P, &N, EndP, &ULEBError);
if (ULEBError)
return make_error<InstrProfError>(instrprof_error::truncated);
P += N;
// ...
if (UncompressedSize > (uint64_t)(EndP - P))
return make_error<InstrProfError>(instrprof_error::truncated);
NameStrings = StringRef(reinterpret_cast<const char *>(P), UncompressedSize);
Now, on a truncated or malformed symtab, the parser returns instrprof_error::truncated instead of walking off the end of the buffer.
Code Review and Testing
I CCed @boomanaiden154 and @ilovepi (they were some of the most active approvers for this directory) for review on Fix 1. @ilovepi approved quickly and suggested giving the other reviewer a chance to weigh in first. @boomanaiden154's question was the interesting one: if the test checks for a LLVM_PROFILE_FILE value that still contains %p-%m literally, is that actually correct? Shouldn't those get substituted?
The answer is no. The substitution is not lit's job. %p and %m are format specifiers defined by compiler-rt's profiling runtime, documented as part of LLVM_PROFILE_FILE's contract. They get expanded inside the instrumented binary when it calls __llvm_profile_write_file() at exit. lit's only job is to set the environment variable correctly before the test runs. So the CHECK lines in the test are asserting the right thing: the literal string that lit writes into the environment, which is supposed to still contain %p-%m. The expansion is a runtime concern, and compiler-rt's own test suite covers it separately (e.g. compiler-rt/test/profile/ContinuousSyncMode/pid-substitution.c).
Once that was clear, @boomanaiden154 approved and the PR merged.
The test itself is fairly minimal. Two sibling files, collision.c and collision.cpp, each containing a single // RUN: true. A lit.cfg enables per_test_coverage for the suite. The test driver uses CHECK-DAG to assert that both get distinct names:
# CHECK-DAG: LLVM_PROFILE_FILE=collision.c-%p-%m0.profraw
# CHECK-DAG: LLVM_PROFILE_FILE=collision.cpp-%p-%m0.profraw
CHECK-DAG is the right tool here because lit may run the two tests in either order. The test doesn't try to simulate an actual race; it just verifies that the names lit generates are distinct. Whether those names prevent a collision in practice is a property of the format string, not something a lit-level test can exercise.
Fix 2 required more care on the testing side. When you're adding bounds checks, you need to prove that each check actually fires on the right input, and that incorrect inputs don't slip through as false negatives. The regression test, SymtabTest.instr_prof_malformed_name_strings, covers three distinct corruption cases:
- Truncated ULEB: the length field is cut off mid-encoding
- Oversized uncompressed size: the declared uncompressed size exceeds the remaining buffer
- Oversized compressed size: the declared compressed size exceeds the remaining buffer
Each case maps to one of the three bounds checks added in the fix. If any check is missing or wrong, its corresponding test case fails. The PR is currently in review with @bogner (listed as a maintainer of this tool).
All in all, this was a fairly straightforward fix, only the diagnosis was particularly tricky. The bounds-checking fix was immediately obvious, but tracing the malformed profraw back to a filename race took some sleuthing.
Takeaway
I want to end each post in this series with one tip for contributing. This one I learned pretty early on: the best way to find something worth fixing in a large project is to actually use it, and ideally to use it in ways the original authors didn't anticipate. Running lit with sibling tests that share a basename isn't the common case, which is exactly why the edge condition slipped through. When you hit something like that, let the authors know. Raise an issue. And if you already understand the fix, contribute it!