Debugging native memory issues in a C# application

time to read 5 min | 916 words

I’m working on improving the performance of Corax, RavenDB’s new search engine. Along the way, I introduced a bug, a fairly nasty one. At a random location, while indexing a ~50 million documents corpus, we are getting an access violation exception. That means that I messed something up.

That makes sense, given that my changes were mostly about making things lower-level. Working directly with pointers and avoiding length checks. At our speed, even the use of Span can be a killer for performance, and we want to be as close to the raw metal as possible. The particular changeset that I was working on was able to improve the indexing speed from 90,000 per second to 120,000 per second. That is a change that I absolutely want to keep, so I started investigating it.

I mentioned that it is a fairly nasty problem. A truly nasty problem would be heap corruption that is discovered after the fact and is very hard to trace. In this case, it was not consistent, which is really strange. One of the important aspects of Corax is that it is single-threaded, which means that a lot of complexity is out the window. It means that for the same input, we always have the same behavior. If there is any variance, such as not crashing all the time, it means that there are external factors involved.

At any rate, given that it happened at least half the time, I was able to attach WinDBG to the process and wait for the exception to happen, this is what I got:

(5e20.1468): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
Corax!Corax.IndexWriter.AddEntriesToTermResultViaSmallPostingList+0x953:
00007ffa`24dcea53 c4e261902411    vpgatherdd xmm4,dword ptr [rcx+xmm2],xmm3 ds:0000026d`516514e7=????????

Now, look at the last line, that is an interesting one, we use the VPGATHERDD assembly instruction. It is gathering packed DWORD values, in C#, this is generated using the Avx2.GatherVector128() method. We are using that to do some bit packing in this case, so this makes a lot of sense.

Next, let’s see what we get from the exception:

0:074> .exr -1
ExceptionAddress: 00007ffafc2bfe7c (KERNELBASE!RaiseException+0x000000000000006c)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000080
NumberParameters: 2
   Parameter[0]: 0000000000000000
   Parameter[1]: 0000026d51650000
Attempt to read from address 0000026d51650000

All of this points to an out-of-bounds read, but why is that? The call we have for GatherVector128() is used inside a method named: ReadAvx2(). And this method is called like this:

private unsafe static ulong Read(int stateBitPos, byte* inputBufferPtr, int bitsToRead, int inputBufferSize, out int outputStateBit)
{
    if ((stateBitPos + bitsToRead) / 8 >= inputBufferSize)
        throw new ArgumentOutOfRangeException();
    if ( Avx2.IsSupported)
    {
        return ReadAvx2(stateBitPos, inputBufferPtr, bitsToRead, out outputStateBit);
    }
    return ReadScalar(stateBitPos, inputBufferPtr, bitsToRead, out outputStateBit);
}

It is an optimized approach to read some bits from a buffer, I’ll skip the details on exactly how this works. As you can see, we have a proper bounds check here, ensuring that we aren’t reading past the end of the buffer.

Except…

That we aren’t actually checking this. What we are doing is checking that we can access the bytes range, but consider the following scenario:

image

We have a memory page and a buffer that is located toward the end of it.  We are now trying to access the last bit in the buffer, using ReadAvx2(). If we’ll check the actual bytes range, it will pass, we are trying to access the last byte.

However, we are going to call GatherVector128(), which means that we’ll actually access 16 bytes(!), and only the first byte is in the valid memory range, the rest is going to be read from the next page, which isn’t mapped.

This also explains why we are not always failing. If the next page is valid (which is subject to the decisions of the operating system allocator), it will pass. So that is why we didn’t have 100% reproduction. In fact, this is the sort of bug that is very easy to hide for a very long time in the system, given that it is dependent on the actual memory structure of the application.

Once we figured out what was going on, it was pretty easy to understand, but the fact that the AVX instructions will read after the end of the buffer was really confusing. Because even when we used Span, and its range checks, it would be completely ignored. Makes total sense, given that those aren’t really methods, but compiler intrinsics that are translated to direct machine instructions.

Amusingly enough, now that we found the problem, we ran into something very similar a long while ago. Then it was the wrong instruction being used (loading a word, instead of a byte), that would fail, but the same overal setup. It will sometimes fail, depending on the state of the next page in the memory.

We actually built some tooling around managing that, we call that electric fence memory. We allocate memory so any out-of-band access would always hit invalid memory, stopping us in our tracks. That means that I can get easy reproduction of those sorts of issues, and once we have that, the rest isn’t really that interesting, to be honest. It’s just a normal bug fix. It’s the hunt for the root cause that is both incredibly frustrating and quite rewarding.