Are You Sure You Want to Use MMAP in Your Database Management System?

A summary of the ‘Are You Sure You Want to Use MMAP in Your Database Management System?’ paper

This year I’ve decided to start writing a quick summary of some of the papers and books I read. I don’t know how long this will last, but I am going to give it a try (I haven’t been reading that much lately anyway).

So, to get myself started I am going to use the first paper I have read this year: Are You Sure You Want to Use MMAP in Your Database Management System?

This paper doesn’t introduce any particular innovation but it does a really great job putting together many of the different problems that different database systems have found while using MMAP as an alternative to their buffer pool implementations.

MMAP Overview

The paper starts providing a short introduction to MMAP . MMAP is an abstraction provided by the underlying OS that maps the content of a file that’s residing on secondary storage into a program’s address space, transparently loading/unloading pages when the program references them. You can imagine how attractive and “simple” this looks like for developers …

The general workflow to access a file using mmap is as follows:

  1. Call mmap and get a pointer to the memory-mapped file back.
  2. The OS reserves a portion of the program’s virtual address but no contents are loaded so far.
  3. Use the original pointer to accesss the contents of the file.
  4. The OS looks for the corresponding page and, since no contents have been loaded, triggers a page fault in order to load into memory the referenced portion of the file
  5. The page table is modified accordingly to point to the new physical address
  6. The CPU where the call was initiated caches this entry in its translation lookaside buffer (TLB).

Most programming languages allows you to use the mmap abstraction in your programs, so, for example, in Rust you can do something like:

use memmap::MmapOptions;
use std::fs::File;
use std::io::Result;

fn main() -> Result<()> {
    let file = File::open("/tmp/mmap-example.db")?;
    let mmap = unsafe { MmapOptions::new().map(&file)? };

    println!("{:?}", &mmap[10..80]);

    Ok(())
}

You don’t need to worry about how big /tmp/mmap-example.db is, the OS will “transparently” load/unload the pages as needed.

There’re a few system calls the database writers can use to perform memory-mapped file I/O :

  • mmap: We’ve already covered this; the OS maps the file into a program’s virtual address space. We can choose to write any change back to the backing file or keep our change private to us.
  • madvise: We can provide different hints to the OS about our expected data access patterns. When a page fault happens, the OS will perform different actions depending on the provided hint (MADV_NORMAL, MADV_RANDOM, MADV_SEQUENTIAL).
  • mlock: we can pin pages in memory, making sure that the OS will never evict them (dirty pages can be flushed at any time, tho)
  • msync: we can explicitly perform the flush of a memory range to the underlying storage.

The problems

The paper mentions a few databases that have tried to use mmap somehow: MongoDB, InfluxDB (here I have experienced the problems myself), SingleStore, LevelDB, … and presents the most common problems people have run into while using this technique.

Transactional safety

Since the OS is transparently handling the load/unload of the pages, a particular page can be flushed to the underlying storage at any point in time, no matter what the status of the current transaction is.

Different, and usually complicated, protocols are used to prevent the previous problem:

  • OS Copy-On-Write: This technique creates to different copies of the database file using mmap. One of them is the primary copy while the other is used as a private workspace (open with the MAP_PRIVATE flag). With this approach, the database needs to make sure that the updates produced by commited transactions have propagated to the primary copy before letting conflicting transaction to move forward and deal with the growth of the private workspace.

  • User Space Copy-On-Write: This technique involves a manual process where the modified pages are copied to a separate buffer residing in user space. SQLite, MonetDB, and RavenDB use some variant of this technique.

  • Shadow Paging: This is used by LMDB, and it maintains separate copies for the primary and the shadowed databases, copying the modified pages from the primary to the shadowed, flushing the changes to the secondary storage and flipping the pointer so now the shadowed database becomes the primary and viceversa.

I/O Stalls

Accesing any page could result in a unexpected I/O stall because the database cannot really know if the page is in memory or not (triggering a blocking page fault in case it isn’t).

Potential solutions can be used to deal with the problem described before:

  • Pin pages (mlock) which are going to be used in the near future. Sadly there’s a limit on the total memory that a process can pin.
  • Use madvise to hint the OS about the potential access patterns. This is much simpler than the previous alternative, offers less control to the developer and the OS is free to ignore the hints and.

Error handling

Using mmap makes ensuring data integrity a complicated task: page-level checksums should be performed on every page, mmap writting corrupted pages by pointer errors in memory-unsafe managed languages, …

Performance issues

Last but no least, the paper introduces performance as the most significant drawback of mmap’s transparent paging management. All the previous issues described before could be overcomed through careful implementations, but mmap’s bottlenecks cannot be avoided without an OS-level redesign.

In theory the benefits that mmap brings on top of the table are:

  • The removal of explicit read/write system calls.
  • The ability to return pointers.
  • Lower memory consumption as the data does not need to be replicated in user space.

The three main bottlenecks the paper identifies are: page table contention, single threaded page eviction, and TLB shootdowns, being the latter the trickiest problem

Experimental analysis

Trying to back the affirmations presented before, the paper presents an experimental analysis where they empirically try to demonstrate the aforementioned issues. They used fio with IO_DIRECT (to bypass the OS page chage) as their baseline and focused in read-only scenarios with two common access patterns, random reads and sequential scans

I am just summarizing their results here but if you want to get deeper into the numbers, the paper includes a bunch of different charts.

Random reads

In this scenario their baseline showed up that fio could fully saturate the NVMe SSD they were using. On the other hand, mmap performed significantlly wrose even in those scenarios where the hint matched the workload’s access pattern

Sequential scans

Again, in this scenario fio showed up stable performance and mmap’s started to drop once the page cache filled up. A slightly different scenario where they used 10 SSDs with RAID 0 showed up roughly 20x performance difference between fio and mmap.

Conclusion

The paper makes the case against the use of mmap for file I/O in a DB and presents it in a really accesible way. Even if you don’t particularly enjoy reading papers, I think this one is really comprehensible and easy to read.

I personally think that the paper shows up a real and valid set of problems happening in mmap based systems but I think that writing your own buffer pool is as easy as presented here

Here you can find a web page with the link to the paper, the corresponding video and the source code for the benchmarks.

I think this is the first time I do something like this after I dropped my Ph.D. studies and I am not sure how long this is going to last, but, in case I keep doing it, I hope I can get better at it. The idea is not only to review papers but anything I read and find interesting.

Avatar
Miguel Ángel Pastor Olivar
Software Architect

I am a proud dad and husband, software architect, speaker, and writer. Passionate reader, chef aficionado, former surf player and current cyclist and runner. I am unsuccessfully pursuing to move my Phd research forward.

Related