NVM

libpmem library

Ray+hue 2016. 8. 9. 06:41

Original Source: http://pmem.io/nvml/libpmem/


The libpmem library

libpmem provides low level persistent memory support. In particular, support for the persistent memory instructions for flushing changes to pmem is provided.

This library is provided for software which tracks every store to pmem and needs to flush those changes to durability. Most developers will find higher level libraries like libpmemobj to be much more convenient.

The libpmem man page contains a list of the interfaces provided.

libpmem Examples

The Basics

If you’ve decided to handle persistent memory allocation and consistency across program interruption yourself, you will find the functions in libpmem useful. It is important to understand that programming to raw pmem means you must create your own transactions or convince yourself you don’t care if a system or program crash leaves your pmem files in an inconsistent state. Libraries like libpmemobj provide transactional interfaces by building on these libpmem functions, but the interfaces in libpmem are non-transactional.

For this simple example, we’re just going to hard code a pmem file size of 4 kilobytes. The lines above create the file, make sure 4k is allocated, and map the file into memory. This illustrates one of the helper functions in libpmem: pmem_map() which takes a file descriptor and calls mmap(2) to memory map the entire file. Calling mmap() directly will work just fine – the main advantage of pmem_map() is that it tries to find an address where mapping is likely to use large page mappings, for better performance when using large ranges of pmem.

Since the system calls for memory mapping persistent memory are the same as the POSIX calls for memory mapping any file, you may want to write your code to run correctly when given either a pmem file or a file on a traditional file system. For many decades it has been the case that changes written to a memory mapped range of a file may not be persistent until flushed to the media. One common way to do this is using the POSIX call msync(2). If you write your program to memory map a file and use msync() every time you want to flush the changes media, it will work correctly for pmem as well as files on a traditional file system. However, you may find your program performs better if you detect pmem explicitly and use libpmem to flush changes in that case.

The libpmem function pmem_is_pmem() can be used to determine if the memory in the given range is really persistent memory or if it is just a memory mapped file on a traditional file system. Using this call in your program will allow you to decide what to do when given a non-pmem file. Your program could decide to print an error message and exit (for example: “ERROR: This program only works on pmem”). But it seems more likely you will want to save the result of pmem_is_pmem() as shown above, and then use that flag to decide what to do when flushing changes to persistence as later in this example program.

The novel thing about pmem is you can copy to it directly, like any memory. The strcpy()call shown on line 80 above is just the usual libc function that stores a string to memory. If this example program were to be interrupted either during or just after the strcpy() call, you can’t be sure which parts of the string made it all the way to the media. It might be none of the string, all of the string, or somewhere in-between. In addition, there’s no guarantee the string will make it to the media in the order it was stored! For longer ranges, it is just as likely that portions copied later make it to the media before earlier portions. (So don’t write code like the example above and then expect to check for zeros to see how much of the string was written.)

How can a string get stored in seemingly random order? The reason is that until a flush function like msync() has returned successfully, the normal cache pressure that happens on an active system can push changes out to the media at any time in any order. Most processors have barrier instructions (like SFENCE on the Intel platform) but those instructions deal with ordering in the visibility of stores to other threads, not with the order that changes reach persistence. The only barriers for flushing to persistence are functions like msync() or pmem_persist().

this example uses the is_pmem flag it saved from the previous call topmem_is_pmem(). This is the recommended way to use this information rather than callingpmem_is_pmem() each time you want to make changes durable. That’s becausepmem_is_pmem() can have a high overhead, having to search through data structures to ensure the entire range is really persistent memory.

For true pmem, the highlighted line 84 above is the most optimal way to flush changes to persistence. pmem_persist() will, if possible, perform the flush directly from user space, without calling into the OS. This is made possible on the Intel platform using instructions like CLWB and CLFLUSHOPT which are described in Intel’s manuals. Of course you are free to use these instructions directly in your program, but the program will crash with an undefined opcode if you try to use the instructions on a platform that doesn’t support them. This is where libpmem helps you out, by checking the platform capabilities on start-up and choosing the best instructions for each operation it supports.

The above example also uses pmem_msync() for the non-pmem case instead of callingmsync(2) directly. For convenience, the pmem_msync() call is a small wrapper aroundmsync() that ensures the arguments are aligned, as requirement of POSIX.

Buildable source for the libpmem manpage.c example above is available in the NVML repository.

Copying to Persistent Memory

Another feature of libpmem is a set of routines for optimally copying to persistent memory. These functions perform the same functions as the libc functions memcpy()memset(), andmemmove(), but they are optimized for copying to pmem. On the Intel platform, this is done using the non-temporal store instructions which bypass the processor caches (eliminating the need to flush that portion of the data path).

The first copy example, called simple_copy, illustrates how pmem_memcpy() is used.

The highlighted line, line 105 above, shows how pmem_memcpy() is used just like memcpy(3)except that when the destination is pmem, libpmem handles flushing the data to persistence as part of the copy.

Buildable source for the libpmem simple_copy.c example above is available in the NVML repository.

Separating the Flush Steps

There are two steps in flushing to persistence. The first step is to flush the processor caches, or bypass them entirely as explained in the previous example. The second step is to wait for any hardware buffers to drain, to ensure writes have reached the media. These steps are performed together when pmem_persist() is called, or they can be called individually by calling pmem_flush() for the first step and pmem_drain() for the second. Note that either of these steps may be unnecessary on a given platform, and the library knows how to check for that and do the right thing. For example, on Intel platforms, pmem_drain()is an empty function.

When does it make sense to break flushing into steps? This example, called full_copyillustrates one reason you might do this. Since the example copies data using multiple calls to memcpy(), it uses the version of libpmem copy that only performs the flush, postponing the final drain step to the end. This works because unlike the flush step, the drain step does not take an address range – it is a system-wide drain operation so can happen at the end of the loop that copies individual blocks of data.


Original Source: https://software.intel.com/sites/default/files/managed/b4/3a/319433-024.pdf

Efficient cache flushing

CLFLUSHOPT is defined to provide efficient cache flushing. 

CLWB instruction (Cache Line Write Back) writes back modified data of a cacheline similar to CLFLUSHOPT, but avoids invalidating the line from the cache (and instead transitions the line to non-modified state). CLWB attempts to minimize the compulsory cache miss if the same data is accessed temporally after the line is flushed.

Writes back to memory the cache line (if dirty) that contains the linear address specified with the memory operand from any level of the cache hierarchy in the cache coherence domain. The line may be retained in the cache hierarchy in non-modified state. Retaining the line in the cache hierarchy is a performance optimization (treated as a hint by hardware) to reduce the possibility of cache miss on a subsequent access. Hardware may choose to retain the line at any of the levels in the cache hierarchy, and in some cases, may invalidate the line from the cache hierarchy. The source operand is a byte memory location. 

The availability of CLWB instruction is indicated by the presence of the CPUID feature flag CLWB (bit 24 of the EBX register, see “CPUID — CPU Identification” in this chapter). The aligned cache line size affected is also indicated with the CPUID instruction (bits 8 through 15 of the EBX register when the initial value in the EAX register is 1). The memory attribute of the page containing the affected line has no effect on the behavior of this instruction. It should be noted that processors are free to speculatively fetch and cache data from system memory regions that are assigned a memory-type allowing for speculative reads (such as, the WB, WC, and WT memory types). PREFETCHh instructions can be used to provide the processor with hints for this speculative behavior. Because this speculative fetching can occur at any time and is not tied to instruction execution, the CLWB instruction is not ordered with respect to PREFETCHh instructions or any of the speculative fetching mechanisms (that is, data can be speculatively loaded into a cache line just before, during, or after the execution of a CLWB instruction that references the cache line). CLWB instruction is ordered only by store-fencing operations. For example, software can use an SFENCE, MFENCE, XCHG, or LOCK-prefixed instructions to ensure that previous stores are included in the write-back. CLWB instruction need not be ordered by another CLWB or CLFLUSHOPT instruction. CLWB is implicitly ordered with older stores executed by the logical processor to the same address. For usages that require only writing back modified data from cache lines to memory (do not require the line to be invalidated), and expect to subsequently access the data, software is recommended to use CLWB (with appropriate fencing) instead of CLFLUSH or CLFLUSHOPT for improved performance. Executions of CLWB interact with executions of PCOMMIT. The PCOMMIT instruction operates on certain store-to memory operations that have been accepted to memory. CLWB executed for the same cache line as an older store causes the store to become accepted to memory when the CLWB execution becomes globally visible. The CLWB instruction can be used at all privilege levels and is subject to all permission checking and faults associated with a byte load. Like a load, the CLWB instruction sets the A bit but not the D bit in the page tables. In some implementations, the CLWB instruction may always cause transactional abort with Transactional Synchronization Extensions (TSX). CLWB instruction is not expected to be commonly used inside typical transactional regions. However, programmers must not rely on CLWB instruction to force a transactional abort, since whether they cause transactional abort is implementation dependent.