Colors of Ray+Hue'


libvmmalloc initialization analysis


INTRO.

libvmmalloc은 애플리케이션의 코드 수정 없이 PM영역 메모리 할당 해제를 가능하게 한다. 훅을 걸기 위해서는 LD_PRELOAD를 사용해야 하며,  모든 메모리 할당/해제는 jemalloc을 바탕으로 동작한다. PM 영역의 파일시스템 path 및 pool(arena)의 크기를 지정해야 하며, 최초 라이브러리가 후킹 될때 path/pool 이 초기화 된다. 해당 풀을 직접 엑세스 하기 위해서는 반드시 page cache가 bypass 되어야 하므로, DAX를 사용하는 것이 바람직하다. 

초기화 과정은 크게 2가지로 나뉘는데, (i)PM 영역에 대한 메모리 매핑 설정, (ii) jemalloc의 풀 생성이다. 기본적으로 jemalloc을 포함한 모든 메모리 관리자는 커널로부터 메모리를 할당 받아 heap영역을 구성하게 되는데, 메모리 할당은 (s)brk/mmap을 통해 buddy의 free page list로부터 할당 받는다. 그러나, pmem.io에서 PM영역은 buddy에서 관리 되지 않고 file mapped IO를 통해 해당 memory address 를 엑세스를 허용하므로,  jemalloc을 비롯한 모든 malloc 의 초기 메모리 영역을 설정할 수 없다. 이를 위해 메모리 풀을 생성하기 위해 요구되는 과정을 init 함수에서 반드시 해줘야 한다. 다시 말해 je_pool_create에 요구되는 파라미터 (address, size)를 모두 셋업하여 jemalloc이 동작할 수 있는 기반을 마련한다. 

편의를 위해 오류체크등 부가적인 모든 소스코드를 생략한다.


최초 LD_PRELOAD에 의해 library가 load 될때 libvmmalloc_init가 자동으로 호출된다. 이하의 함수는 모두 libvmalloc_init의 함수이다.

가장 먼저 할일은 common_init 함수를 통해 log에 대한 메모리 크기를 페이지크기에 맞게 정렬함과 동시에 valgrind 에 관한 옵션을 설정한다. 

common_init(VMMALLOC_LOG_PREFIX, VMMALLOC_LOG_LEVEL_VAR, VMMALLOC_LOG_FILE_VAR, MMALLOC_MAJOR_VERSION, VMMALLOC_MINOR_VERSION); 

다음 jemalloc에 대한 헤더크기를 설정한다. 

Header_size = roundup(sizeof(VMEM), Pagesize); 

이후 PM 영역으로 사용할 디렉토리와 풀의 크기를 확인한다. getenv를 이용하여 환경변수를 통해 디렉토리 string과 크기를 정하고, 만일 풀의 크기가 최소 크기보다 작을 경우, 에러를 리턴한다. 최소 풀의 크기는 14MB 이다. 

if ((env_str = getenv(VMMALLOC_POOL_SIZE_VAR)) == NULL) {

abort();

} else {

long long v = atoll(env_str);

if (v < 0) {

abort();

}

size = (size_t)v;

}

if (size < VMMALLOC_MIN_POOL) {

abort();


#define VMMALLOC_MIN_POOL ((size_t)(1024 * 1024 * 14)) /* min pool size: 14MB */ 

이제 패스설정 및 풀 크기 확인등 초기과정이 완료 되었고, 해당 디렉토리에 파일 생성 및 실제 매핑에 관한 일을 한다. 관련된 모든 일은 libvmmalloc_create 에서 이루어진다. libvmalloc_create 함수는 풀을 생성한 후 유효한 값으로 채워진 vmp 구조체 포인터를 리턴한다. 

Vmp = libvmmalloc_create(Dir, size); 

이제 가장 중요한 libvmmalloc_create함수를 살펴본다. 먼저 풀이 몇페이지가 요구되는지 확인하고 (1), 헤당 디렉토리에 임시파일 (vmem.XXXXXX) 을 생성한다 (2). posix_fallocate를 통해 필요한 페이지에 대해 블록을 미리 할당한다 (3). posix_fallocate의 경우, 실제 블록이 할당 되는 것은 아니고, 파일 시스템에서 bitmap을 검색하여 empty블록을 검색, 해당 파일에 대한 inode에 블록을 할당한후 unwritten 상태로 세팅을 한다. 이후 util_map을 통해 실제 매핑이 이루어진다 (4). 매핑이 매핑 자체는 해당 풀에 대한 VM를 얻어내는 역할에 불과하며, 실제로 해당 메모리에 대한 엑세스는 발생하지 않는다. 이제 virtual memory pool (vmp) 구조체에 header, base addr, size 등을 모두 채워 넣는다 (5). 이제 jemalloc의 pool을 생성할 모든 준비가 끝났다. jemalloc library에서 제공되는 je_pool_create 의 alias 함수인 je_vmem_pool_create를 호출하여  해더를 제외한 메모리로 사용할 수 있는 base address 와 풀의 크기를 넘긴다 (6).  마지막으로 해당 메모리 영역을 다른 프로세스가 엑세스 할 수 없도록 보호한다 (7). 

static VMEM * libvmmalloc_create(const char *dir, size_t size)

{

1. size = roundup(size, Pagesize);

2. Fd = util_tmpfile(dir, "/vmem.XXXXXX");

3. if ((errno = posix_fallocate(Fd, 0, (off_t)size)) != 0)

4. if ((addr = util_map(Fd, size, 0, 4 << 20)) == NULL)

5.1 struct vmem *vmp = addr;

5.2 memset(&vmp->hdr, '\0', sizeof(vmp->hdr));

5.3 memcpy(vmp->hdr.signature, VMEM_HDR_SIG, POOL_HDR_SIG_LEN);

5.4 vmp->addr = addr;

5.5 vmp->size = size;

5.6 vmp->caller_mapped = 0;

6. if (je_vmem_pool_create((void *)((uintptr_t)addr + Header_size),

7. util_range_none(addr, sizeof(struct pool_hdr));

return vmp;


핵심은 jemalloc 뿐만 아니라 모든 메모리 할당 라이브러리가 PM영역을 사용할 수 없으므로, 기존의 라이브러리들이 메모리로써 사용할 수 있도록 기반 작업을 한다. 이후 hooking된 malloc이 해당 풀(arena)로부터 메모리를 할당 받을 수 있도록 한다. 

'NVM' 카테고리의 다른 글

LD_PRELOAD rootkits  (0) 2016.08.26
libvmmalloc  (0) 2016.08.24
libpmem library  (0) 2016.08.09
libnvdimm  (0) 2016.08.05
PCMSIM  (0) 2016.08.05

LD_PRELOAD rootkits

NVM2016. 8. 26. 08:49

# LD_PRELOAD rootkits

original source: http://hyunmini.tistory.com/55

original source: https://blog.gopheracademy.com/advent-2015/libc-hooking-go-shared-libraries/


How LD_PRELOAD rootkits work

An LD_PRELOAD rootkit works by implementing alternative versions of functions provided by the libc library that many Unix binaries link to dynamically. Using these ‘hooks’, the rootkit can evade detection by modifying the behaviour of the functions and bypassing authentication mechanisms, e.g. PAM, to provide an attacker with a backdoor such as an SSH login using credentials configured by the rootkit.

For example, the Azazel rootkit hooks into the fopen function and conceals evidence of network activity or files related to the rootkit. If there is nothing to hide, Azazel invokes the original libc function so that the application behaves as normal from the user’s perspective.

Using LD_PRELOAD to hook into other libraries is an old trick and can usefully be used for debugging applications, especially when you don’t have access to an application’s source code.

LD_PRELOAD

프로세스 실행 과정 중 라이브러리를 로딩 할때, LD_PRELOAD 변수가 설정되어 있으면 해당 변수에 지정된

라이브러리를 먼저 로딩하고, 이중 libc 함수명과 동일한 함수가 있다면 해당 함수를 먼저 호출해 준다.

즉, 자동으로 후킹을 수행하도록 해준다는 말과 같다.

참고 - OS 별 PRELOAD 환경변수


Linux : LD_PRELOAD 

AIX : LDR_PRELOAD

Solaris : LD_PRELOAD

FreeBSD : LD_PRELOAD

간단히 개념 정리를 위한 예제를 살펴보자. 

$ ls

secuholic  test1  test2  test3

ls 명령 수행 시 현재 디렉토리의 파일 목록이 보인다. 파일 중 secuholic 이 보이지 않도록 해 본다.

$ ltrace ls

         ...

strcpy(0x08058758, "test1")                                               = 0x08058758

readdir64(0x08057720, 0x08057700, 0xbffffb84, 1, 0x0805777c) = 0x08057794

malloc(6)                                                                      = 0x08058768

strcpy(0x08058768, "test2")                                               = 0x08058768

readdir64(0x08057720, 0x08057700, 0xbffffb84, 1, 0x08057794) = 0x080577ac

malloc(6)                                                                       = 0x08058778

strcpy(0x08058778, "test3")                                               = 0x08058778

readdir64(0x08057720, 0x08057700, 0xbffffb84, 1, 0x080577ac) = 0x080577c4

malloc(10)                                                                     = 0x08058788

strcpy(0x08058788, "secuholic")                                       = 0x08058788

readdir64(0x08057720, 0x08057700, 0xbffffb84, 1, 0x080577c4) = 0x080577e0

malloc(7)                                                                      = 0x08058798

         ...

secuholic  test1  test2  test3

중간 문자열 처리 과정에서 strcpy 함수 호출 시 src 가 secuholic 인지 비교하여 참인 경우 변조를 하면 된다.

$ cat test.c

#include <stdio.h>

#include <string.h>


char *strcpy(char *dest, const char *src)

{

int i =0;

while (src[i] != '\0')

{

dest[i] = src[i];

i++;

}

dest[i] = '\0';

printf("[hooked] : strcpy(%x,\"%s\")\n",dest,src);

return &dest[0];

}

$ LD_PRELOAD=./hook.so ls

[hooked] : strcpy(8054a48,"xterm-redhat")

[hooked] : strcpy(8054c18,"xterm-xfree86")

[hooked] : strcpy(bffffa87,"46")

[hooked] : strcpy(8054a4c,"li#46")

[hooked] : strcpy(bffffa87,"98")

[hooked] : strcpy(8054c1c,"co#98")

[hooked] : strcpy(8054fa0,"no=00:fi=00:di=01;34:ln=")

[hooked] : strcpy(80549b8,".")

[hooked] : strcpy(80549c8,"test1")

[hooked] : strcpy(80549d8,"test2")

[hooked] : strcpy(80549e8,"test3")

[hooked] : strcpy(80549f8,"secuholic")   // secuholic 문자열 복사

[hooked] : strcpy(8054b28,"test.c")

[hooked] : strcpy(8054b38,"hook.so")

hook.so  secuholic  test.c  test1  test2  test3


이제 해당 부분을 수정해 보자.


$ cat test.c

#include <stdio.h>

#include <string.h>


char *strcpy(char *dest, const char *src)

{

int i =0;

if(strcmp(src,"secuholic")==0){

dest[i] = '\0';

return &dest[0]; // src가 secuholic 인 경우 바로 리턴

}

while (src[i] != '\0')

{

dest[i] = src[i];

i++;

}

dest[i] = '\0';

// printf("[hooked] : strcpy(%x,\"%s\")\n",dest,src);

return &dest[0];

}

gcc -shared -fPIC -o hook.so test.c


$ ls 

 hook.so secuholic test.c test1 test2 test3   // secuholic 존재


$ LD_PRELOAD=./hook.so ls 

 hook.so test.c test1 test2 test3   //  secuholic 숨김

이렇게 간단히 후킹이 가능함을 확인해 보았다. LD_PRELOAD 는 setuid 가 걸려 있으면 동작하지 않으며, 타인의 프로세스에는 영향을

줄 수 없는 등 몇가지 제한 사항이 있으나, 쉽게 후킹이 가능하다는 점에서 유용하다 볼 수있겠다.  :)

'NVM' 카테고리의 다른 글

libvmmalloc initialization analysis  (0) 2016.08.30
libvmmalloc  (0) 2016.08.24
libpmem library  (0) 2016.08.09
libnvdimm  (0) 2016.08.05
PCMSIM  (0) 2016.08.05

libvmmalloc

NVM2016. 8. 24. 08:11

original source: http://pmem.io/nvml/libvmmalloc/libvmmalloc.3.html

libvmmalloc

INTRO.

libvmmalloc은 프로그램을 수정하지 않고, 해당 라이브러리를 삽입하여 SYNOPSIS에 소개 되어 있는 libmalloc 함수들을 NVM영역에서 메모리 할당이 가능하도록 한다. 기존의 메모리 할당이 힙을 통해 할당되는 반면, 일정 NVM영역을 MMFIO를 통해 메모리 할당을 가능하게 한다. 이를 위해 환경변수 세팅을 통해 NVM 영역의 디렉토리와 메모리풀로 사용될 파일의 크기를 지정해야 한다. 이를 통해 해당 파일 영역에 매핑 되어 있는 주소를 엑세스 하여, 라이브러리 로드시 해당 파일 영역이 초기화 되며, 메모리 할당은 Jemalloc을 베이스로 한다. 

EXAMPLE: LD_PRELOAD=./libvmmalloc.so VMMALLOC_POOL_DIR=/mnt/mem VMMALLOC_POOL_SIZE=1073741824 ./ma

NAME

libvmmalloc − general purpose volatile memory allocation library

SYNOPSIS

$ LD_PRELOAD=libvmmalloc.so command [ args... ]

or

#include <stdlib.h>
#include <malloc.h>
#include <libvmmalloc.h>

$ cc [ flag... ] file... -lvmmalloc [ library... ]

void *malloc(size_t size);
void free(void *ptr);
void *calloc(size_t number, size_t size);
void *realloc(void *ptr, size_t size);

int posix_memalign(void **memptr, size_t alignment, size_t size);
void *aligned_alloc(size_t alignment, size_t size);
void *memalign(size_t alignment, size_t size);
void *valloc(size_t size);
void *pvalloc(size_t size);

size_t malloc_usable_size(const void *ptr);
void cfree(void *ptr);

DESCRIPTION

libvmmalloc transparently converts all the dynamic memory allocations into Persistent Memory allocations. The typical usage of libvmmalloc does not require any modification of the target program. It is enough to load libvmmalloc before all other libraries by setting the environment variable LD_PRELOAD. When used in that way,libvmmalloc interposes the standard system memory allocation routines, as defined in malloc(3), posix_memalign(3) and malloc_usable_size(3), and provides that all dynamic memory allocations are made from a memory poolbuilt on memory-mapped file, instead of a system heap. The memory managed by libvmmalloc may have different attributes, depending on the file system containing the memory-mapped file. In particular, libvmmalloc is part of the Non-Volatile Memory Library because it is sometimes useful to use non-volatile memory as a volatile memory pool, leveraging its capacity, cost, or performance characteristics.

libvmmalloc은 동적 메모리 할당을 영속 메모리 할당으로 투명하게 변환한다. libvmalloc 사용은 해당 프로그램의 수정을 요구하지 않는다. 환경변수 LD_PRELOAD를 세팅하기 전에 libvmmaloc을 로드하여 libvmmalloc을 표준 시스템 메모리 할당 루틴에 끼워넣는다. 

malloc(3), posix_memalign(3), malloc_usable_size(3) 등은 시스템 힙 대신 MMFIO를 통해 메모리 풀에서 동적메모리 할당을 지원한다. libvmmaloc의 메모리 관리는 MMFIO를 포함하는 파일 시스템에 의존한다. 특히 NVM을 일반 메모리 풀로 사용이 가능하다.

libvmmalloc may be also linked to the program, by providing the vmmalloc argument to the linker. Then it becomes the default memory allocator for given program.

NOTE: Due to the fact the library operates on a memory-mapped file, it may not work properly with the programs that perform fork(3) not followed by exec(3).

There are two variants of experimental fork() support available in libvmmalloc. The desired library behavior may be selected by setting VMMALLOC_FORK environment variable. By default variant #1 is enabled. See ENVIRONMENT section for more details.

libvmmalloc uses the mmap(2) system call to create a pool of volatile memory. The library is most useful when used with Direct Access storage (DAX), which is memory-addressable persistent storage that supports load/store access without being paged via the system page cache. A Persistent Memory-aware file system is typically used to provide this type of access. Memory-mapping a file from a Persistent Memory-aware file system provides the raw memory pools, and this library supplies the traditional malloc interfaces on top of those pools.

libvmmalloc은 mmap(2) 시스템 콜을 통해 일반 메모리 풀을 생성한다. 라이브러리는 DAX에 가장 효과가 좋다. PMFS는 보통 L/S를 지원한다. PMFS를 통한 MMFIO는 저수준 메모리 풀을 지원하고, 이 라이브러리는 malloc 인터페이스를 해당 풀의 상위에서 지원한다. 

The memory pool acting as a system heap replacement is created automatically at the library initialization time. User may control its location and size by setting the environment variables described in ENVIRONMENT section. The allocated file space is reclaimed when process terminates or in case of system crash.

시스템 힙 대체로서 메모리풀은 라이브러리 초기화 과정에서 자동으로 생성된다. 사용자는 위치와 크기를 ENVIRONMENT 섹션에서 소개되는 환경변수 세팅을 통해 조정된다. 

Under normal usage, libvmmalloc will never print messages or intentionally cause the process to exit. The library uses pthreads(7) to be fully MT-safe, but never creates or destroys threads itself. The library does not make use of any signals, networking, and never calls select() or poll().

ENVIRONMENT

There are two configuration variables that must be set to make libvmmalloc work properly. If any of them is not specified, or if their values are not valid, the library prints the appropriate error message and terminates the process.

다음 2개의 환경 변수는 libvmmalloc에 반드시 필요하다. 아닐 경우, 에러 메시지와 함께 해당 프로세스를 종료 시킨다. 


VMMALLOC_POOL_DIR

Specifies a path to directory where the memory pool file should be created. The directory must exist and be writable.

메모리 풀에 생성되어야 하는 패스를 나타내며, 해당 디렉토리는 쓰기가능하도록 존재해야 한다. 


VMMALLOC_POOL_SIZE

Defines the desired size (in bytes) of the memory pool file. It must be not less than the minimum allowed size VMMALLOC_MIN_POOL as defined in <libvmmalloc.h>. Note that due to the fact the library adds some metadata to the memory pool, the amount of actual usable space is typically less than the size of the memory pool file.

메모리 풀 파일의 크기를 명시한다. VMMALLOC_MIN_POOL 보다는 반드시 커야한다. 




'NVM' 카테고리의 다른 글

libvmmalloc initialization analysis  (0) 2016.08.30
LD_PRELOAD rootkits  (0) 2016.08.26
libpmem library  (0) 2016.08.09
libnvdimm  (0) 2016.08.05
PCMSIM  (0) 2016.08.05

libpmem library

NVM2016. 8. 9. 06:41

Original Source: http://pmem.io/nvml/libpmem/


The libpmem library

libpmem provides low level persistent memory support. In particular, support for the persistent memory instructions for flushing changes to pmem is provided.

This library is provided for software which tracks every store to pmem and needs to flush those changes to durability. Most developers will find higher level libraries like libpmemobj to be much more convenient.

The libpmem man page contains a list of the interfaces provided.

libpmem Examples

The Basics

If you’ve decided to handle persistent memory allocation and consistency across program interruption yourself, you will find the functions in libpmem useful. It is important to understand that programming to raw pmem means you must create your own transactions or convince yourself you don’t care if a system or program crash leaves your pmem files in an inconsistent state. Libraries like libpmemobj provide transactional interfaces by building on these libpmem functions, but the interfaces in libpmem are non-transactional.

For this simple example, we’re just going to hard code a pmem file size of 4 kilobytes. The lines above create the file, make sure 4k is allocated, and map the file into memory. This illustrates one of the helper functions in libpmem: pmem_map() which takes a file descriptor and calls mmap(2) to memory map the entire file. Calling mmap() directly will work just fine – the main advantage of pmem_map() is that it tries to find an address where mapping is likely to use large page mappings, for better performance when using large ranges of pmem.

Since the system calls for memory mapping persistent memory are the same as the POSIX calls for memory mapping any file, you may want to write your code to run correctly when given either a pmem file or a file on a traditional file system. For many decades it has been the case that changes written to a memory mapped range of a file may not be persistent until flushed to the media. One common way to do this is using the POSIX call msync(2). If you write your program to memory map a file and use msync() every time you want to flush the changes media, it will work correctly for pmem as well as files on a traditional file system. However, you may find your program performs better if you detect pmem explicitly and use libpmem to flush changes in that case.

The libpmem function pmem_is_pmem() can be used to determine if the memory in the given range is really persistent memory or if it is just a memory mapped file on a traditional file system. Using this call in your program will allow you to decide what to do when given a non-pmem file. Your program could decide to print an error message and exit (for example: “ERROR: This program only works on pmem”). But it seems more likely you will want to save the result of pmem_is_pmem() as shown above, and then use that flag to decide what to do when flushing changes to persistence as later in this example program.

The novel thing about pmem is you can copy to it directly, like any memory. The strcpy()call shown on line 80 above is just the usual libc function that stores a string to memory. If this example program were to be interrupted either during or just after the strcpy() call, you can’t be sure which parts of the string made it all the way to the media. It might be none of the string, all of the string, or somewhere in-between. In addition, there’s no guarantee the string will make it to the media in the order it was stored! For longer ranges, it is just as likely that portions copied later make it to the media before earlier portions. (So don’t write code like the example above and then expect to check for zeros to see how much of the string was written.)

How can a string get stored in seemingly random order? The reason is that until a flush function like msync() has returned successfully, the normal cache pressure that happens on an active system can push changes out to the media at any time in any order. Most processors have barrier instructions (like SFENCE on the Intel platform) but those instructions deal with ordering in the visibility of stores to other threads, not with the order that changes reach persistence. The only barriers for flushing to persistence are functions like msync() or pmem_persist().

this example uses the is_pmem flag it saved from the previous call topmem_is_pmem(). This is the recommended way to use this information rather than callingpmem_is_pmem() each time you want to make changes durable. That’s becausepmem_is_pmem() can have a high overhead, having to search through data structures to ensure the entire range is really persistent memory.

For true pmem, the highlighted line 84 above is the most optimal way to flush changes to persistence. pmem_persist() will, if possible, perform the flush directly from user space, without calling into the OS. This is made possible on the Intel platform using instructions like CLWB and CLFLUSHOPT which are described in Intel’s manuals. Of course you are free to use these instructions directly in your program, but the program will crash with an undefined opcode if you try to use the instructions on a platform that doesn’t support them. This is where libpmem helps you out, by checking the platform capabilities on start-up and choosing the best instructions for each operation it supports.

The above example also uses pmem_msync() for the non-pmem case instead of callingmsync(2) directly. For convenience, the pmem_msync() call is a small wrapper aroundmsync() that ensures the arguments are aligned, as requirement of POSIX.

Buildable source for the libpmem manpage.c example above is available in the NVML repository.

Copying to Persistent Memory

Another feature of libpmem is a set of routines for optimally copying to persistent memory. These functions perform the same functions as the libc functions memcpy()memset(), andmemmove(), but they are optimized for copying to pmem. On the Intel platform, this is done using the non-temporal store instructions which bypass the processor caches (eliminating the need to flush that portion of the data path).

The first copy example, called simple_copy, illustrates how pmem_memcpy() is used.

The highlighted line, line 105 above, shows how pmem_memcpy() is used just like memcpy(3)except that when the destination is pmem, libpmem handles flushing the data to persistence as part of the copy.

Buildable source for the libpmem simple_copy.c example above is available in the NVML repository.

Separating the Flush Steps

There are two steps in flushing to persistence. The first step is to flush the processor caches, or bypass them entirely as explained in the previous example. The second step is to wait for any hardware buffers to drain, to ensure writes have reached the media. These steps are performed together when pmem_persist() is called, or they can be called individually by calling pmem_flush() for the first step and pmem_drain() for the second. Note that either of these steps may be unnecessary on a given platform, and the library knows how to check for that and do the right thing. For example, on Intel platforms, pmem_drain()is an empty function.

When does it make sense to break flushing into steps? This example, called full_copyillustrates one reason you might do this. Since the example copies data using multiple calls to memcpy(), it uses the version of libpmem copy that only performs the flush, postponing the final drain step to the end. This works because unlike the flush step, the drain step does not take an address range – it is a system-wide drain operation so can happen at the end of the loop that copies individual blocks of data.


Original Source: https://software.intel.com/sites/default/files/managed/b4/3a/319433-024.pdf

Efficient cache flushing

CLFLUSHOPT is defined to provide efficient cache flushing. 

CLWB instruction (Cache Line Write Back) writes back modified data of a cacheline similar to CLFLUSHOPT, but avoids invalidating the line from the cache (and instead transitions the line to non-modified state). CLWB attempts to minimize the compulsory cache miss if the same data is accessed temporally after the line is flushed.

Writes back to memory the cache line (if dirty) that contains the linear address specified with the memory operand from any level of the cache hierarchy in the cache coherence domain. The line may be retained in the cache hierarchy in non-modified state. Retaining the line in the cache hierarchy is a performance optimization (treated as a hint by hardware) to reduce the possibility of cache miss on a subsequent access. Hardware may choose to retain the line at any of the levels in the cache hierarchy, and in some cases, may invalidate the line from the cache hierarchy. The source operand is a byte memory location. 

The availability of CLWB instruction is indicated by the presence of the CPUID feature flag CLWB (bit 24 of the EBX register, see “CPUID — CPU Identification” in this chapter). The aligned cache line size affected is also indicated with the CPUID instruction (bits 8 through 15 of the EBX register when the initial value in the EAX register is 1). The memory attribute of the page containing the affected line has no effect on the behavior of this instruction. It should be noted that processors are free to speculatively fetch and cache data from system memory regions that are assigned a memory-type allowing for speculative reads (such as, the WB, WC, and WT memory types). PREFETCHh instructions can be used to provide the processor with hints for this speculative behavior. Because this speculative fetching can occur at any time and is not tied to instruction execution, the CLWB instruction is not ordered with respect to PREFETCHh instructions or any of the speculative fetching mechanisms (that is, data can be speculatively loaded into a cache line just before, during, or after the execution of a CLWB instruction that references the cache line). CLWB instruction is ordered only by store-fencing operations. For example, software can use an SFENCE, MFENCE, XCHG, or LOCK-prefixed instructions to ensure that previous stores are included in the write-back. CLWB instruction need not be ordered by another CLWB or CLFLUSHOPT instruction. CLWB is implicitly ordered with older stores executed by the logical processor to the same address. For usages that require only writing back modified data from cache lines to memory (do not require the line to be invalidated), and expect to subsequently access the data, software is recommended to use CLWB (with appropriate fencing) instead of CLFLUSH or CLFLUSHOPT for improved performance. Executions of CLWB interact with executions of PCOMMIT. The PCOMMIT instruction operates on certain store-to memory operations that have been accepted to memory. CLWB executed for the same cache line as an older store causes the store to become accepted to memory when the CLWB execution becomes globally visible. The CLWB instruction can be used at all privilege levels and is subject to all permission checking and faults associated with a byte load. Like a load, the CLWB instruction sets the A bit but not the D bit in the page tables. In some implementations, the CLWB instruction may always cause transactional abort with Transactional Synchronization Extensions (TSX). CLWB instruction is not expected to be commonly used inside typical transactional regions. However, programmers must not rely on CLWB instruction to force a transactional abort, since whether they cause transactional abort is implementation dependent.



'NVM' 카테고리의 다른 글

LD_PRELOAD rootkits  (0) 2016.08.26
libvmmalloc  (0) 2016.08.24
libnvdimm  (0) 2016.08.05
PCMSIM  (0) 2016.08.05
Persistent memory and page structures  (0) 2016.08.03

libnvdimm

NVM2016. 8. 5. 10:46

Original Source: 

https://github.com/torvalds/linux/blob/master/Documentation/nvdimm/nvdimm.txt

 LIBNVDIMM: Non-Volatile Devices

     libnvdimm - kernel / libndctl - userspace helper library

  • Glossary
  • Overview
  •    Supporting Documents
  • LIBNVDIMM PMEM and BLK
  • Why BLK?
  •    PMEM vs BLK
  •        BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX
  • Example NVDIMM Platform
  • LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API
  •    LIBNDCTL: Context
  •        libndctl: instantiate a new library context example
  •    LIBNVDIMM/LIBNDCTL: Bus
  •        libnvdimm: control class device in /sys/class
  •        libnvdimm: bus
  •        libndctl: bus enumeration example
  •    LIBNVDIMM/LIBNDCTL: DIMM (NMEM)
  •        libnvdimm: DIMM (NMEM)
  •        libndctl: DIMM enumeration example
  •    LIBNVDIMM/LIBNDCTL: Region
  •        libnvdimm: region
  •        libndctl: region enumeration example
  •        Why Not Encode the Region Type into the Region Name?
  •        How Do I Determine the Major Type of a Region?
  •    LIBNVDIMM/LIBNDCTL: Namespace
  •        libnvdimm: namespace
  •        libndctl: namespace enumeration example
  •        libndctl: namespace creation example
  •        Why the Term "namespace"?
  •    LIBNVDIMM/LIBNDCTL: Block Translation Table "btt"
  •        libnvdimm: btt layout
  •        libndctl: btt creation example
  • Summary LIBNDCTL Diagram


Glossary

  • PMEM: A system-physical-address range where writes are persistent. A block device composed of PMEM is capable of DAX. A PMEM address range may span an interleave of several DIMMs.
  • BLK: A set of one or more programmable memory mapped apertures provided by a DIMM to access its media. This indirection precludes the performance benefit of interleaving, but enables DIMM-bounded failure modes.
  • DPA: DIMM Physical Address, is a DIMM-relative offset.  With one DIMM in the system there would be a 1:1 system-physical-address: DPA association. Once more DIMMs are added a memory controller interleave must be decoded to determine the DPA associated with a given system-physical-address.  BLK capacity always has a 1:1 relationship with a single-DIMM's DPA range.
  • DAX: File system extensions to bypass the page cache and block layer to mmap persistent memory, from a PMEM block device, directly into a process address space.
  • DSM: Device Specific Method: ACPI method to to control specific device - in this case the firmware.
  • DCR: NVDIMM Control Region Structure defined in ACPI 6 Section 5.2.25.5. It defines a vendor-id, device-id, and interface format for a given DIMM.
  • BTT: Block Translation Table: Persistent memory is byte addressable. Existing software may have an expectation that the power-fail-atomicity of writes is at least one sector, 512 bytes.  The BTT is an indirection table with atomic update semantics to front a PMEM/BLK block device driver and present arbitrary atomic sector sizes.
  • LABEL: Metadata stored on a DIMM device that partitions and identifies (persistently names) storage between PMEM and BLK.  It also partitions BLK storage to host BTTs with different parameters per BLK-partition.
  • Note that traditional partition tables, GPT/MBR, are layered on top of a BLK or PMEM device.

Overview

The LIBNVDIMM subsystem provides support for three types of NVDIMMs, namely, PMEM, BLK, and NVDIMM devices that can simultaneously support both PMEM and BLK mode access.  These three modes of operation are described by the "NVDIMM Firmware Interface Table" (NFIT) in ACPI 6.  While the LIBNVDIMM implementation is generic and supports pre-NFIT platforms, it was guided by the superset of capabilities need to support this ACPI 6 definition for NVDIMM resources. The bulk of the kernel implementation is in place to handle the case where DPA accessible via PMEM is aliased with DPA accessible via BLK.  When that occurs a LABEL is needed to reserve DPAfor exclusive access via one mode a time.

Supporting Documents

ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf

NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf

DSM Interface Example: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

Driver Writer's Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf


Git Trees

LIBNVDIMM: https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git

LIBNDCTL: https://github.com/pmem/ndctl.git

PMEM: https://github.com/01org/prd


LIBNVDIMM PMEM and BLK

Prior to the arrival of the NFIT, non-volatile memory was described to a system in various ad-hoc ways.  Usually only the bare minimum was provided, namely, a single system-physical-address range where writes are expected to be durable after a system power loss.  Now, the NFIT specification standardizes not only the description of PMEM, but also BLK and platform message-passing entry points for control and configuration.

For each NVDIMM access method (PMEM, BLK), LIBNVDIMM provides a block device driver:

1. PMEM (nd_pmem.ko): Drives a system-physical-address range. This range is contiguous in system memory and may be interleaved (hardware memory controller striped) across multiple DIMMs.  When interleaved the platform may optionally provide details of which DIMMs are participating in the interleave.

Note that while LIBNVDIMM describes system-physical-address ranges that may alias with BLK access as ND_NAMESPACE_PMEM ranges and those without alias as ND_NAMESPACE_IO ranges, to the nd_pmem driver there is no distinction.  The different device-types are an implementation detail that userspace can exploit to implement policies like "only interface with address ranges from certain DIMMs".  It is worth noting that when aliasing is present and a DIMM lacks a label, then no block device can be created by default as userspace needs to do at least one allocation of DPA to the PMEM range. In contrast ND_NAMESPACE_IO ranges, once registered, can be immediately attached to nd_pmem.

2. BLK (nd_blk.ko): This driver performs I/O using a set of platform defined apertures.  A set of apertures will access just one DIMM. Multiple windows (apertures) allow multiple concurrent accesses, much like tagged-command-queuing, and would likely be used by different threads or different CPUs.

The NFIT specification defines a standard format for a BLK-aperture, but the spec also allows for vendor specific layouts, and non-NFIT BLK implementations may have other designs for BLK I/O.  For this reason "nd_blk" calls back into platform-specific code to perform the I/O. One such implementation is defined in the "Driver Writer's Guide" and "DSM Interface Example".

Why BLK?

While PMEM provides direct byte-addressable CPU-load/store access to NVDIMM storage, it does not provide the best system RAS (recovery, availability, and serviceability) model.  An access to a corrupted system-physical-address address causes a CPU exception while an access to a corrupted address through an BLK-aperture causes that block window to raise an error status in a register. The latter is more aligned withthe standard error model that host-bus-adapter attached disks present. Also, if an administrator ever wants to replace a memory it is easier to service a system at DIMM module boundaries.  Compare this to PMEM where data could be interleaved in an opaque hardware specific manner across several DIMMs.

PMEM vs BLK

BLK-apertures solve these RAS problems, but their presence is also the major contributing factor to the complexity of the ND subsystem. They complicate the implementation because PMEM and BLK alias in DPA space. Any given DIMM's DPA-range may contribute to one or more system-physical-address sets of interleaved DIMMs, *and* may also be accessed in its entirety through its BLK-aperture. Accessing a DPA through a system-physical-address while simultaneously accessing the same DPA through a BLK-aperture has undefined results. For this reason, DIMMs with this dual interface configuration include a DSM function to store/retrieve a LABEL.  The LABEL effectively partitions the DPA-space into exclusive system-physical-address and BLK-aperture accessible regions.  For simplicity a DIMM is allowed a PMEM "region" per each interleave set in which it is a member. The remaining DPA space can be carved into an arbitrary number of BLK devices with discontiguous extents.


BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX

One of the few reasons to allow multiple BLK namespaces per REGION is so that each BLK-namespace can be configured with a BTT with unique atomic sector sizes.  While a PMEM device can host a BTT the LABEL specification does not provide for a sector size to be specified for a PMEM namespace. This is due to the expectation that the primary usage model for PMEM is via DAX, and the BTT is incompatible with DAX.  However, for the cases where an application or filesystem still needs atomic sector updateguarantees it can register a BTT on a PMEM device or partition. See LIBNVDIMM/NDCTL: Block Translation Table "btt"


LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API

What follows is a description of the LIBNVDIMM sysfs layout and a corresponding object hierarchy diagram as viewed through the LIBNDCTL API.  The example sysfs paths and diagrams are relative to the Example NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit test.

LIBNDCTL: Context Every API call in the LIBNDCTL library requires a context that holds the logging parameters and other library instance state.  The library is based on the libabc template:

https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git

LIBNDCTL: instantiate a new library context example

struct ndctl_ctx *ctx;

if (ndctl_new(&ctx) == 0)

return ctx;

else

return NULL;


LIBNVDIMM/LIBNDCTL: Bus

A bus has a 1:1 relationship with an NFIT.  The current expectation for ACPI based systems is that there is only ever one platform-global NFIT. That said, it is trivial to register multiple NFITs, the specification does not preclude it.  The infrastructure supports multiple buses and we we use this capability to test multiple NFIT configurations in the unit test.

LIBNVDIMM: control class device in /sys/class

This character device accepts DSM messages to be passed to DIMM identified by its NFIT handle.

/sys/class/nd/ndctl0

|-- dev

|-- device -> ../../../ndbus0

|-- subsystem -> ../../../../../../../class/nd

LIBNVDIMM: bus

struct nvdimm_bus *nvdimm_bus_register(struct device *parent,

      struct nvdimm_bus_descriptor *nfit_desc);


LIBNDCTL: bus enumeration example. Find the bus handle that describes the bus from Example NVDIMM Platform

static struct ndctl_bus *get_bus_by_provider(struct ndctl_ctx *ctx,

const char *provider)

{

struct ndctl_bus *bus;

ndctl_bus_foreach(ctx, bus)

if (strcmp(provider, ndctl_bus_get_provider(bus)) == 0)

return bus;

return NULL;

}

bus = get_bus_by_provider(ctx, "nfit_test.0");


LIBNVDIMM/LIBNDCTL: DIMM (NMEM)

The DIMM device provides a character device for sending commands to hardware, and it is a container for LABELs.  If the DIMM is defined by NFIT then an optional 'nfit' attribute sub-directory is available to add NFIT-specifics.

Note that the kernel device name for "DIMMs" is "nmemX".  The NFIT describes these devices via "Memory Device to System Physical Address Range Mapping Structure", and there is no requirement that they actually be physical DIMMs, so we use a more generic name.

LIBNVDIMM: DIMM (NMEM)

struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data,

const struct attribute_group **groups, unsigned long flags,

unsigned long *dsm_mask);


LIBNDCTL: DIMM enumeration example

Note, in this example we are assuming NFIT-defined DIMMs which are identified by an "nfit_handle" a 32-bit value where:

  • Bit 3:0 DIMM number within the memory channel
  • Bit 7:4 memory channel number
  • Bit 11:8 memory controller ID
  • Bit 15:12 socket ID (within scope of a Node controller if node controller is present)
  • Bit 27:16 Node Controller ID
  • Bit 31:28 Reserved

static struct ndctl_dimm *get_dimm_by_handle(struct ndctl_bus *bus,

      unsigned int handle)

{

struct ndctl_dimm *dimm;

ndctl_dimm_foreach(bus, dimm)

if (ndctl_dimm_get_handle(dimm) == handle)

return dimm;

return NULL;

}

#define DIMM_HANDLE(n, s, i, c, d) \

(((n & 0xfff) << 16) | ((s & 0xf) << 12) | ((i & 0xf) << 8) \

| ((c & 0xf) << 4) | (d & 0xf))

dimm = get_dimm_by_handle(bus, DIMM_HANDLE(0, 0, 0, 0, 0));


LIBNVDIMM/LIBNDCTL: Region

A generic REGION device is registered for each PMEM range or BLK-apertureset. Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture sets on the "nfit_test.0" bus.  The primary role of regions are to be a container of "mappings".  A mapping is a tuple of <DIMM, DPA-start-offset, length>.

LIBNVDIMM provides a built-in driver for these REGION devices.  This driver is responsible for reconciling the aliased DPA mappings across all regions, parsing the LABEL, if present, and then emitting NAMESPACE devices with the resolved/exclusive DPA-boundaries for the nd_pmem or nd_blk device driver to consume.

In addition to the generic attributes of "mapping"s, "interleave_ways" and "size" the REGION device also exports some convenience attributes."nstype" indicates the integer type of namespace-device this region emits, "devtype" duplicates the DEVTYPE variable stored by udev at the 'add' event, "modalias" duplicates the MODALIAS variable stored by udev at the 'add' event, and finally, the optional "spa_index" is provided in the case where the region is defined by a SPA.

LIBNVDIMM: region

struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus,

struct nd_region_desc *ndr_desc);

struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus,

struct nd_region_desc *ndr_desc);


LIBNDCTL: region enumeration example

Sample region retrieval routines based on NFIT-unique data like "spa_index" (interleave set id) for PMEM and "nfit_handle" (dimm id) for BLK.


static struct ndctl_region *get_pmem_region_by_spa_index(struct ndctl_bus *bus,

unsigned int spa_index)

{

struct ndctl_region *region;

ndctl_region_foreach(bus, region) {

if (ndctl_region_get_type(region) != ND_DEVICE_REGION_PMEM)

continue;

if (ndctl_region_get_spa_index(region) == spa_index)

return region;

}

return NULL;

}

static struct ndctl_region *get_blk_region_by_dimm_handle(struct ndctl_bus *bus,

unsigned int handle)

{

struct ndctl_region *region;

ndctl_region_foreach(bus, region) {

struct ndctl_mapping *map;


if (ndctl_region_get_type(region) != ND_DEVICE_REGION_BLOCK)

continue;

ndctl_mapping_foreach(region, map) {

struct ndctl_dimm *dimm = ndctl_mapping_get_dimm(map);


if (ndctl_dimm_get_handle(dimm) == handle)

return region;

}

}

return NULL;

}


Why Not Encode the Region Type into the Region Name?

At first glance it seems since NFIT defines just PMEM and BLK interface types that we should simply name REGION devices with something derived from those type names.  However, the ND subsystem explicitly keeps the REGION name generic and expects userspace to always consider the region-attributes for four reasons:

1. There are already more than two REGION and "namespace" types.  For PMEM there are two subtypes.  As mentioned previously we have PMEM where the constituent DIMM devices are known and anonymous PMEM. For BLK regions the NFIT specification already anticipates vendor specific implementations.  The exact distinction of what a region contains is in the region-attributes not the region-name or the region-devtype.

2. A region with zero child-namespaces is a possible configuration.  For example, the NFIT allows for a DCR to be published without a corresponding BLK-aperture.  This equates to a DIMM that can only accept control/configuration messages, but no i/o through a descendant block device.  Again, this "type" is advertised in the attributes ('mappings'== 0) and the name does not tell you much.

3. What if a third major interface type arises in the future?  Outside of vendor specific implementations, it's not difficult to envision a third class of interface type beyond BLK and PMEM.  With a generic name for the REGION level of the device-hierarchy old userspace    implementations can still make sense of new kernel advertised region-types.  Userspace can always rely on the generic region attributes like "mappings", "size", etc and the expected child devices named "namespace".  This generic format of the device-model hierarchy allows the LIBNVDIMM and LIBNDCTL implementations to be more uniform and future-proof.

4. There are more robust mechanisms for determining the major type of a region than a device name.  See the next section, How Do I Determine the Major Type of a Region?


How Do I Determine the Major Type of a Region?

Outside of the blanket recommendation of "use libndctl", or simply looking at the kernel header (/usr/include/linux/ndctl.h) to decode the "nstype" integer attribute, here are some other options.

1. module alias lookup: The whole point of region/namespace device type differentiation is to     decide which block-device driver will attach to a given LIBNVDIMM namespace. One can simply use the modalias to lookup the resulting module.  It's important to note that this method is robust in the presence of a vendor-specific driver down the road.  If a vendor-specific    implementation wants to supplant the standard nd_blk driver it can with minimal impact to the rest of LIBNVDIMM.

In fact, a vendor may also want to have a vendor-specific region-driver (outside of nd_region).  For example, if a vendor defined its own LABEL format it would need its own region driver to parse that LABEL and emit the resulting namespaces.  The output from module resolution is more accurate than a region-name or region-devtype.

2. udev: The kernel "devtype" is registered in the udev database

    # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region0

    P: /devices/platform/nfit_test.0/ndbus0/region0

    E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region0

    E: DEVTYPE=nd_pmem

    E: MODALIAS=nd:t2

    E: SUBSYSTEM=nd


    # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region4

    P: /devices/platform/nfit_test.0/ndbus0/region4

    E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region4

    E: DEVTYPE=nd_blk

    E: MODALIAS=nd:t3

    E: SUBSYSTEM=nd


...and is available as a region attribute, but keep in mind that the "devtype" does not indicate sub-type variations and scripts should really be understanding the other attributes.

3. type specific attributes: As it currently stands a BLK-aperture region will never have a    "nfit/spa_index" attribute, but neither will a non-NFIT PMEM region.  A BLK region with a "mappings" value of 0 is, as mentioned above, a DIMM that does not allow I/O.  A PMEM region with a "mappings" value of zero is a simple system-physical-address range.


LIBNVDIMM/LIBNDCTL: Namespace

A REGION, after resolving DPA aliasing and LABEL specified boundaries, surfaces one or more "namespace" devices.  The arrival of a "namespace" device currently triggers either the nd_blk or nd_pmem driver to load and register a disk/block device.

LIBNVDIMM: namespace

Here is a sample layout from the three major types of NAMESPACE where namespace0.0 represents DIMM-info-backed PMEM (note that it has a 'uuid' attribute), namespace2.0 represents a BLK namespace (note it has a 'sector_size' attribute) that, and namespace6.0 represents an anonymous PMEM namespace (note that has no 'uuid' attribute due to not support a LABEL).

(See the Original Source)

LIBNDCTL: namespace enumeration example

Namespaces are indexed relative to their parent region, example below. These indexes are mostly static from boot to boot, but subsystem makesno guarantees in this regard.  For a static namespace identifier use its 'uuid' attribute.

static struct ndctl_namespace *get_namespace_by_id(struct ndctl_region *region,

                unsigned int id)

{

        struct ndctl_namespace *ndns;


        ndctl_namespace_foreach(region, ndns)

                if (ndctl_namespace_get_id(ndns) == id)

                        return ndns;

        return NULL;

}


LIBNDCTL: namespace creation example

Idle namespaces are automatically created by the kernel if a given region has enough available capacity to create a new namespace. Namespace instantiation involves finding an idle namespace and configuring it.  For the most part the setting of namespace attributes can occur in any order, the only constraint is that 'uuid' must be set before 'size'.  This enables the kernel to track DPA allocations internally with a static identifier.

static int configure_namespace(struct ndctl_region *region,

                struct ndctl_namespace *ndns,

                struct namespace_parameters *parameters)

{

        char devname[50];

        snprintf(devname, sizeof(devname), "namespace%d.%d",

                        ndctl_region_get_id(region), paramaters->id);

        ndctl_namespace_set_alt_name(ndns, devname);

        /* 'uuid' must be set prior to setting size! */

        ndctl_namespace_set_uuid(ndns, paramaters->uuid);

        ndctl_namespace_set_size(ndns, paramaters->size);

        /* unlike pmem namespaces, blk namespaces have a sector size */

        if (parameters->lbasize)

                ndctl_namespace_set_sector_size(ndns, parameters->lbasize);

        ndctl_namespace_enable(ndns);

}


Why the Term "namespace"?

1. Why not "volume" for instance?  "volume" ran the risk of confusing ND (libnvdimm subsystem) to a volume manager like device-mapper.

2. The term originated to describe the sub-devices that can be created within a NVME controller (see the nvme specification: http://www.nvmexpress.org/specifications/), and NFIT namespaces are meant to parallel the capabilities and configurability of NVME-namespaces.


LIBNVDIMM/LIBNDCTL: Block Translation Table "btt"

A BTT (design document: http://pmem.io/2014/09/23/btt.html) is a stacked block device driver that fronts either the whole block device or a partition of a block device emitted by either a PMEM or BLK NAMESPACE.

LIBNVDIMM: btt layout Every region will start out with at least one BTT device which is the seed device. To activate it set the "namespace", "uuid", and "sector_size" attributes and then bind the device to the nd_pmem or nd_blk driver depending on the region type.

LIBNDCTL: btt creation example

Similar to namespaces an idle BTT device is automatically created per region.  Each time this seed" btt device is configured and enabled a new seed is created.  Creating a BTT configuration involves two steps of finding and idle BTT and assigning it to consume a PMEM or BLK namespace.

static struct ndctl_btt *get_idle_btt(struct ndctl_region *region)

{

struct ndctl_btt *btt;


ndctl_btt_foreach(region, btt)

if (!ndctl_btt_is_enabled(btt)

&& !ndctl_btt_is_configured(btt))

return btt;


return NULL;

}

static int configure_btt(struct ndctl_region *region,

struct btt_parameters *parameters)

{

btt = get_idle_btt(region);


ndctl_btt_set_uuid(btt, parameters->uuid);

ndctl_btt_set_sector_size(btt, parameters->sector_size);

ndctl_btt_set_namespace(btt, parameters->ndns);

/* turn off raw mode device */

ndctl_namespace_disable(parameters->ndns);

/* turn on btt access */

ndctl_btt_enable(btt);

}


Once instantiated a new inactive btt seed device will appear underneath the region. Once a "namespace" is removed from a BTT that instance of the BTT device will be deleted or otherwise reset to default values.  This deletion is only at the device model level.  In order to destroy a BTT the "info block" needs to be destroyed.  Note, that to destroy a BTT the media needs to be written in raw mode.  By default, the kernel will autodetect the presence of a BTT and disable raw mode.  This autodetect behavior can be suppressed by enabling raw mode for the namespace via the ndctl_namespace_set_raw_mode() API.


'NVM' 카테고리의 다른 글

libvmmalloc  (0) 2016.08.24
libpmem library  (0) 2016.08.09
PCMSIM  (0) 2016.08.05
Persistent memory and page structures  (0) 2016.08.03
ND: NFIT-Defined / NVDIMM Subsystem  (0) 2016.08.02

PCMSIM

NVM2016. 8. 5. 09:57

A simple PCM block device simulator for Linux

PCMSIM

Original Source: https://code.google.com/archive/p/pcmsim/


A block device driver for Linux that simulates the presence of a Phase Change Memory (PCM), a type of a non-volatile (persistent) byte-addressable memory (NVBM), in the system installed in one of the DIMM slots on the motherboard. The simulator is implemented as a kernel module for Linux that creates /dev/pcm0 when it is loaded - a ramdisk-backed block devices with latencies that of PCM.

We have designed pcmsim to have minimal overhead, so that the users can run benchmarks in real time. pcmsim accounts for the differences between read and write latencies, and the effects of CPU caches, but it is by no means a complete system simulation. The current implementation only partially accounts for prefetching in the CPU and for memory-level parallelism.

Please beware that a bunch of configuration is currently hard-coded in memory.c and module.c, separately for 32-bit and 64-bit systems (separated using the LP64 macro). Please refer to README.txt about what you need to change, so that you get meaningful results. I would also recommend that you run your benchmarks with disabled swap, so that your simulated PCM device would not be swapped out while you are using it.

Lastly, the pcmsim kernel module has been tested only on selected 2.6.x kernels; it still needs to be ported to 3.x kernels. If you fix any of these issues, please send me your patches, and I'll be happy to merge them!



'NVM' 카테고리의 다른 글

libpmem library  (0) 2016.08.09
libnvdimm  (0) 2016.08.05
Persistent memory and page structures  (0) 2016.08.03
ND: NFIT-Defined / NVDIMM Subsystem  (0) 2016.08.02
Direct Access for files (DAX in EXT4)  (0) 2016.08.02

Persistent memory and page structures


Original Source: http://lwn.net/Articles/644079/

Abstraction

PM을 paging 통해 관리하는 것이 바람직 한가에 대한 논의이다. 메모리 관리자를 통하여 PM을 관리할 경우, 모든 물리 페이지는 page structure에 의해 명시되며, 해당 페이지 프레임은 버디를 통해 관리된다. 이 접근은 PM을 메모리 영역으로 사용하는데 매우 용이하나, 하나의 PFN 에 대해 하나의 page structure가 요구되며, TB단위의 큰 PM에 대해서 메모리 공간을 관리하는 구조체에 많은 메모리를 할당 해야 한다. 이때 PM영역의 page structure의 메모리 할당 위치: RAM or PRAM, 문제에 대해 고려해야 한다. 일반적으로 PM 의 경우, NVDIMM-N 과 같이 일반 DRAM이 사용되는 경우는 관계없지만 PCM 과 같은 소재가 사용되면 Wear Level 을 고려해야 하기 때문에 업데이트가 매우 잦은 page 구조체를 PM영역에 위치하는 것은 옳지 않다고 보는 견해가 많다. 따라서 그 거대한(?) 크기의 page 구조체를 DRAM에 위치하는 것은 상당부분 공간/비용에 대한 overhead가 많을 수 밖에 없다.  또한 버디는 메모리 존 개수 만큼 생성되므로, 커널을 수정하지 않고서는 에플리케이션은 해당 VA가 PM 영역의 PFN을 가리키는지 알수 없고, 또한 특정 PM영역을 엑세스 할 수 없다. 마지막으로 버디 역시 TB 단위의 메모리 관리에 대한 검증이 전혀 없으므로, 효율성 측면에서 의문점이 남는다. 

결론적으로 PM을 paging 하는 것은 그리 바람직하지 않다고 본다. 개인적으론 MMIO + DAX를를 베이스로 하는 PMEM.IO가 좋은 선택이 아닌가 한다. 특히 PM 영역을 특정 목적으로만 사용하고자 할경우, filesystem interface를 사용하는 대신 row device mapping을 하고, 디바이스 자체를 여러개의 풀로 관리할 수 있는 접근도 가능한 선택이 아닐까 한다. intel의 nvdimm 에 대한 접근 (libnvdimm, pmem.io,) 등은 모두 MMFIO를 기반으로 동작하는데, OS call-path overhead가 분명 어느정도 존재할 텐데, 왜 이러한 선택을 했는가는 여전히 의문을 갖고 있다. 


By Jonathan Corbet
May 13, 2015
As is suggested by its name, persistent memory (or non-volatile memory) is characterized by the persistence of the data stored in it. But that term could just as well be applied to the discussions surrounding it; persistent memory raises a number of interesting development problems that will take a while to work out yet. One of the key points of discussion at the moment is whether persistent memory should, like ordinary RAM, be represented by page structures and, if so, how those structures should be managed.

One page structure exists for each page of (non-persistent) physical memory in the system. It tracks how the page is used and, among other things, contains a reference count describing how many users the page has. A pointer to a page structure is an unambiguous way to refer to a specific physical page independent of any address space, so it is perhaps unsurprising that this structure is used with many APIs in the kernel. Should a range of memory exist that lacks corresponding page structures, that memory cannot be used with any API expecting a struct page pointer; among other things, that rules out DMA and direct I/O.

Persistent memory looks like ordinary memory to the CPU in a number of ways. In particular, it is directly addressable at the byte level. It differs, though, in its persistence, its performance characteristics (writes, in particular, can be slow), and its size — persistent memory arrays are expected to be measured in terabytes. At a 4KB page size, billions of page structures would be needed to represent this kind of memory array — too many to manage efficiently. As a result, currently, persistent memory is treated like a device, rather than like memory; among other things, that means that the kernel does not need to maintain page structures for persistent memory. Many things can be made to work without them, but this aspect of persistent memory does bring some limitations; one of those is that it is not currently possible to perform I/O directly between persistent memory and another device. That, in turn, thwarts use cases like using persistent memory as a cache between the system and a large, slow storage array.

Page-frame numbers

One approach to the problem, posted by Dan Williams, is to change the relevant APIs to do away with the need for page structures. This patch set creates a new type called __pfn_t:

    typedef struct {
	union {
	    unsigned long data;
	    struct page *page;
	};
    __pfn_t;

As is suggested by the use of a union type, this structure leads a sort of double life. It can contain a page pointer as usual, but it can also be used to hold an integer page frame number (PFN). The two cases are distinguished by setting one of the low bits in the data field; the alignment requirements for page structures guarantee that those bits will be clear for an actual struct page pointer.

A small set of helper functions has been provided to obtain the information from this structure. A call to __pfn_t_to_pfn() will obtain the associated PFN (regardless of which type of data the structure holds), while __pfn_t_to_page() will return a struct page pointer, but only if a page structure exists. These helpers support the main goal for the __pfn_t type: to allow the lower levels of the I/O stack to be converted to use PFNs as the primary way to describe memory while avoiding massive changes to the upper layers where page structures are used.

With that infrastructure in place, the block layer is changed to use __pfn_t instead of struct page; in particular, the bio_vec structure, which describes a segment of I/O, becomes:

    struct bio_vec {
        __pfn_t         bv_pfn;
        unsigned short  bv_len;
        unsigned short  bv_offset;
    };

The ripple effects from this change end up touching nearly 80 files in the filesystem and block subtrees. At a lower level, there are changes to the scatter/gather DMA API to allow buffers to be specified using PFNs rather than page structures; this change has architecture-specific components to enable the mapping of buffers by PFN.

Finally, there is the problem of enabling kmap_atomic() on PFN-specified pages. kmap_atomic() maps a page into the kernel's address space; it is only really needed on 32-bit systems where there is not room to map all of main memory into that space. On 64-bit systems it is essentially a no-op, turning a page structure into its associated kernel virtual address. That problem gets a little trickier when persistent memory is involved; the only code that really knows where that memory is mapped is the low-level device driver. Dan's patch set adds a function by which the driver can inform the rest of the kernel of the mapping between a range of PFNs and kernel space; kmap_atomic() is then adapted to use that information.

All together, this patch set is enough to enable direct block I/O to persistent memory. Linus's initial response was on the negative side, though; he said "I detest this approach." Instead, he argued in favor of a solution where special page structures are created for ranges of persistent memory when they are needed. As the discussion went on, though, he moderated his position, saying: "So while I (very obviously) have some doubts about this approach, it may be that the most convincing argument is just in the code." That code has since been reposted with some changes, but the discussion is not yet finished.

Back to page structures

Various alternatives have been suggested, but the most attention was probably drawn by Ingo Molnar's "Directly mapped pmem integrated into the page cache" proposal. The core of Ingo's idea is that all persistent memory would have page structures, but those structures would be stored in the persistent memory itself. The kernel would carve out a piece of each persistent memory array for these structures; that memory would be hidden from filesystem code.

Despite being stored in persistent memory, the page structures themselves would not be persistent — a point that a number of commenters seemed to miss. Instead, they would be initialized at boot time, using a lazy technique so that this work would not overly slow the boot process as a whole. All filesystem I/O would be direct I/O; in this picture, the kernel's page cache has little involvement. The potential benefits are huge: vast amounts of memory would be available for fast I/O without many of the memory-management issues that make life difficult for developers today.

It is an interesting vision, and it may yet bear fruit, but various developers were quick to point out that things are not quite as simple as Ingo would like them to be. Matthew Wilcox, who has done much of the work to make filesystems work properly with persistent memory, noted that there is an interesting disconnect between the lifecycle of a page-cache page and that of a block on disk. Filesystems have the ability to reassign blocks independently of any memory that might represent the content of those blocks at any given time. But in this directly mapped view of the world, filesystem blocks and pages of memory are the same thing; synchronizing changes to the two could be an interesting challenge.

Dave Chinner pointed out that the directly mapped approach makes any sort of data transformation by the filesystem (such as compression or encryption) impossible. In Dave's view, the filesystem needs to have a stronger role in how persistent memory is managed in general. The idea of just using existing filesystems (as Ingo had suggested) to get the best performance out of persistent memory is, in his view, not sustainable. Ingo, instead, seems to feel that management of persistent memory could be mostly hidden from filesystems, just like the management of ordinary memory is.

In any case, the proof of this idea would be in the code that implements it, and, currently, no such code exists. About the only thing that can be concluded from this discussion is that the kernel community still has not figured out the best ways of dealing with large persistent-memory arrays. Likely as not, it will take some years of experience with the actual hardware to figure that out. Approaches like Dan's might just be merged as a way to make things work for now. The best way to make use of such memory in the long term remains undetermined, though.




'NVM' 카테고리의 다른 글

libnvdimm  (0) 2016.08.05
PCMSIM  (0) 2016.08.05
ND: NFIT-Defined / NVDIMM Subsystem  (0) 2016.08.02
Direct Access for files (DAX in EXT4)  (0) 2016.08.02
How to emulate Persistent Memory  (0) 2016.08.02

ND: NFIT-Defined / NVDIMM Subsystem


Since 2010 Intel has included non-volatile memory support on a few
storage-focused platforms with a feature named ADR (Asynchronous DRAM
Refresh).  These platforms were mostly targeted at custom applications
and never enjoyed standard discovery mechanisms for platform firmware
to advertise non-volatile memory capabilities. This now changes with
the publication of version 6 of the ACPI specification [1] and its
inclusion of a new table for describing platform memory capabilities.
The NVDIMM Firmware Interface Table (NFIT), along with new EFI and E820
memory types, enumerates persistent memory ranges, memory-mapped-I/O
apertures, physical memory devices (DIMMs), and their associated
properties.

The ND-subsystem wraps a Linux device driver model around the objects
and address boundaries defined in the specification and introduces 3 new
drivers.

  nd_pmem: NFIT enabled version of the existing 'pmem' driver [2]
  nd_blk: mmio aperture method for accessing persistent storage
  nd_btt: give persistent memory disk semantics (atomic sector update)

See the documentation in patch2 for more details, and there is
supplemental documentation on pmem.io [4].  Please review, and
patches welcome...

For kicking the tires, this release is accompanied by a userspace
management library 'ndctl' that includes unit tests (make check) for all
of the kernel ABIs.  The nfit_test.ko module can be used to explore a
sample NFIT topology.

[1]: http://www.uefi.org/sites/default/files/resources/ACPI_6....
[2]: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/...
[3]: https://github.com/pmem/ndctl
[4]: http://pmem.io/documents/

--
Dan for the NFIT driver development team Andy Rudoff, Matthew Wilcox, Ross
Zwisler, and Vishal Verma


Original Source: https://lwn.net/Articles/640891/

'NVM' 카테고리의 다른 글

PCMSIM  (0) 2016.08.05
Persistent memory and page structures  (0) 2016.08.03
Direct Access for files (DAX in EXT4)  (0) 2016.08.02
How to emulate Persistent Memory  (0) 2016.08.02
Supporting filesystems in persistent memory  (0) 2016.08.02

Original Source http://www.eiric.or.kr/util/pdsFileDownload.php?db=TB_PostConference2&fileName=FN_1606287078993.pdf&seq=4957

최근 Linux 플랫폼에는 PM 저장 장치에 직접적으로 데이터를 입출력 시킬 수 있는 DAX (Direct Access eXciting) 기능이 추가되었다. 이 기술은 POSIX I/O 시스템 콜을 통해 구동되며, 기존 파일 시스템인 Ext4, XFS의 파일 시스템 계층 내부에서 바이트 단위로 넘어온 입출력 데이터를 DAX 인터페이스 (direct_access()) 이용하여 블록 단위로 변환하지 않고, PM 기반 저장 장치에 직접적으로 데이터를 접근시킴으로써 불 필요한 성능 부하의 문제를 해결하였다. 하지만, DAX는 페이지 캐시를 우회시키는 Direct I/O 흐름을 기반으로 설계되었기 때문에, 기존의 운영체제의 페이지 캐시를 이용한 성숙된 특성들을 사용할 수 없게 한다. 페이지 캐시의 사용으로 인한 이점은 read() 시스템 콜에 대한 빠른 응답 시간뿐만 아니라 파일 간의 복사, 압축, 암호화 등 여러 측면에서 존재하기 때문에, DAX를 사용하게 된다면, 이러한 이점들을 사용할 수 없게 된다. 

Direct Access for files

Original Source: https://www.kernel.org/doc/Documentation/filesystems/dax.txt

Motivation

The page cache is usually used to buffer reads and writes to files. It is also used to provide the pages which are mapped into userspace by a call to mmap. For block devices that are memory-like, the page cache pages would be unnecessary copies of the original storage. The DAX code removes the extra copy by performing reads and writes directly to the storage device. For file mappings, the storage device is mapped directly into userspace.

Usage

If you have a block device which supports DAX, you can make a filesystem on it as usual.  The DAX code currently only supports files with a block size equal to your kernel's PAGE_SIZE, so you may need to specify a block size when creating the filesystem.  When mounting it, use the "-o dax" option on the command line or add 'dax' to the options in /etc/fstab.

Implementation Tips for Block Driver Writers

To support DAX in your block driver, implement the 'direct_access' block device operation.  It is used to translate the sector number (expressed in units of 512-byte sectors) to a page frame number (pfn) that identifies the physical page for the memory.  It also returns a kernel virtual address that can be used to access the memory.

The direct_access method takes a 'size' parameter that indicates the number of bytes being requested. The function should return the number of bytes that can be contiguously accessed at that offset. It may also return a negative errno if an error occurs.

In order to support this method, the storage must be byte-accessible by the CPU at all times.  If your device uses paging techniques to expose a large amount of memory through a smaller window, then you cannot implement direct_access. Equally, if your device can occasionally stall the CPU for an extended period, you should also not attempt to implement direct_access.

These block devices may be used for inspiration:

- axonram: Axon DDR2 device driver

- brd: RAM backed block device driver

- dcssblk: s390 dcss block device driver

- pmem: NVDIMM persistent memory driver

Implementation Tips for Filesystem Writers

Filesystem support consists of

  • Adding support to mark inodes as being DAX by setting the S_DAX flag in i_flags
  • Implementing the direct_IO address space operation, and calling dax_do_io() instead of blockdev_direct_IO() if S_DAX is set
  • Implementing an mmap file operation for DAX files which sets the VM_MIXEDMAP and VM_HUGEPAGE flags on the VMA, and setting the vm_ops to include handlers for fault, pmd_fault and page_mkwrite (which should probably call dax_fault(), dax_pmd_fault() and dax_mkwrite(), passing the appropriate get_block() callback)
  • Calling dax_truncate_page() instead of block_truncate_page() for DAX files
  • Calling dax_zero_page_range() instead of zero_user() for DAX files
  • Ensuring that there is sufficient locking between reads, writes, truncates and page faults 

The get_block() callback passed to the DAX functions may return uninitialised extents.  If it does, it must ensure that simultaneous calls to get_block() (for example by a page-fault racing with a read() or a write()) work correctly.

These filesystems may be used for inspiration:

- ext2: see Documentation/filesystems/ext2.txt

- ext4: see Documentation/filesystems/ext4.txt

- xfs:  see Documentation/filesystems/xfs.txt

Handling Media Errors

The libnvdimm subsystem stores a record of known media error locations for each pmem block device (in gendisk->badblocks). If we fault at such location, or one with a latent error not yet discovered, the application can expect to receive a SIGBUS. Libnvdimm also allows clearing of these errors by simply writing the affected sectors (through the pmem driver, and if the underlying NVDIMM supports the clear_poison DSM defined by ACPI).

Since DAX IO normally doesn't go through the driver/bio path, applications or sysadmins have an option to restore the lost data from a prior backup/inbuilt redundancy in the following ways:

1. Delete the affected file, and restore from a backup (sysadmin route): This will free the file system blocks that were being used by the file, and the next time they're allocated, they will be zeroed first, which happens through the driver, and will clear bad sectors.

2. Truncate or hole-punch the part of the file that has a bad-block (at least an entire aligned sector has to be hole-punched, but not necessarily an entire filesystem block).

These are the two basic paths that allow DAX filesystems to continue operating in the presence of media errors. More robust error recovery mechanisms can be built on top of this in the future, for example, involving redundancy/mirroring provided at the block layer through DM, or  additionally, at the filesystem level. These would have to rely on the above two tenets, that error clearing can happen either by sending an IO through the driver, or zeroing (also through the driver).

Shortcomings

Even if the kernel or its modules are stored on a filesystem that supports DAX on a block device that supports DAX, they will still be copied into RAM.

The DAX code does not work correctly on architectures which have virtually mapped caches such as ARM, MIPS and SPARC.

Calling get_user_pages() on a range of user memory that has been mmaped from a DAX file will fail as there are no 'struct page' to describe those pages. This problem is being worked on. That means that O_DIRECT reads/writes to those memory ranges from a non-DAX file will fail (note that O_DIRECT reads/writes _of a DAX file_ do work, it is the memory that is being accessed that is key here). Other things that will not work include RDMA, sendfile() and splice().



'NVM' 카테고리의 다른 글

Persistent memory and page structures  (0) 2016.08.03
ND: NFIT-Defined / NVDIMM Subsystem  (0) 2016.08.02
How to emulate Persistent Memory  (0) 2016.08.02
Supporting filesystems in persistent memory  (0) 2016.08.02
Persistent memory support progress  (0) 2016.08.02

How to emulate Persistent Memory


    Data allocated with NVML is put to the virtual memory address space, and concrete ranges are relying on result of mmap(2) operation performed on the user defined files. Such files can exist on any storage media, however data consistency assurance embedded within NVML requires frequent synchronisation of data that is being modified. Depending on platform capabilities, and underlying device where the files are, a different set of commands is used to facilitate synchronisation. It might be msync(2) for the regular hard drives, or combination of cache flushing instructions followed by memory fence instruction for the real persistent memory.

    Although application adaptation to NVML usage, and ability to operate on persistent memory might be done by relying on regular hard drive, it is not recommended due to the performance hit coming from msync(2) operation. That is the reason to work either with the real equipment or emulated environment. Since persistent memory is not yet commonly available we do recommend setting up emulation system, that will speed up development, and testing of the application you are converting. In the following steps we shall cover how to setup such system.

    Hardware and system requirements

    Emulation environment is available at the current stage only for Linux systems, and should work on any hardware or virtualized environment. Emulation of persistent memory is based on DRAM memory, that is seen by OS as Persistent Memory region. Due to being a DRAM based emulation it is very fast, but will likely loose all the data upon powering down the machine. It should as well work with any distribution able to handle official kernel.

    Linux Kernel

    Download kernel sources from official kernel pages. Support for persistent memory devices and emulation is present in Kernel since 4.0 version, however it is recommended to use Kernel newer then 4.2 due to easier configuration of it. Following instruction relies on 4.2 or newer. Using Kernel older then 4.2 will require a bit more work to setup, and will not be described here. Please note, that features and bug fixes around DAX support are being implemented as we speak, therefore it is recommended to use the newest stable Kernel if possible. To configure proper driver installation run nconfig and enable driver.

    $ make nconfig
    	Device Drivers ---> 
    		{*} NVDIMM (Non-Volatile Memory Device) Support --->
    			<M>   PMEM: Persistent memory block device support
    			<M>   BLK: Block data window (aperture) device support
    			[*]   BTT: Block Translation Table (atomic sector updates)
    			[*]   PFN: Map persistent (device) memory 

    Additionally you need to enable treatment of memory marked using the non-standard e820 type of 12 as used by the Intel Sandy Bridge-EP reference BIOS as protected memory. The kernel will offer these regions to the ‘pmem’ driver so they can be used for persistent storage.

    $ make nconfig
    	Processor type and features --->
    		[*] Support non-standard NVDIMMs and ADR protected memory
    		[*] Device memory (pmem, etc...) hotplug support
    	File systems --->
    		[*] Direct Access (DAX) support 

    You are ready to build your Kernel

    $ make -jX
    	where X is the number of cores on the machine

    Install the kernel

    # sudo make modules_install install

    Reserve memory region so it appears to be a persistent memory by modifying Kernel command line parameters. Region of memory to be used, from ss to ss+nn. [KMG] refers to kilo, mega, giga.

    memmap=nn[KMG]!ss[KMG]

    E.g. memmap=4G!12G reserves 4GB of memory between 12th and 16th GB. Configuration is done within GRUB, and varies between Linux distributions. Here are two examples of GRUB configuration.

    Ubuntu Server 15.04

    1
    2
    3
    # sudo vi /etc/default/grub
    GRUB_CMDLINE_LINUX="memmap=nn[KMG]!ss[KMG]"
    # sudo update-grub2
    

    CentOS 7.0

    1
    2
    3
    4
    5
    6
    # sudo vi /etc/default/grub
    GRUB_CMDLINE_LINUX="memmap=nn[KMG]!ss[KMG]"
    On BIOS-based machines:
    # sudo grub2-mkconfig -o /boot/grub2/grub.cfg
    On UEFI-based machines:
    # sudo grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg
    

    After machine reboot you should be able to see the emulated device as /dev/pmem0. Please be aware of the memory ranges available to your OS, and try not to overlap with those. Trying to get reserved memory regions for persistent memory emulation will result in splitted memory ranges defining persistent (type 12) regions. General recommendation would be to either use memory from 4GB+ range (memmap=nnG!4G) or checking upfront e820 memory map and fitting within. If you don’t see the device, verify the memmap setting correctness, followed by dmesg(1) analysis. You should be able to see reserved ranges as shown on the dmesg output snapshot: dmesg

    You can see that there can be multiple non-overlapping regions reserved as a persistent memory. Putting multiple memmap="...!..." entries will result in multiple devices exposed by the kernel, and visible as /dev/pmem0/dev/pmem1/dev/pmem2, …

    DAX - Direct Access

    The DAX (direct access) extensions to the filesystem creates PM-aware environment. Having filesystem brings easy and reliable rights management, while with DAX add-on, any file that is memory maped with mmap(2) is directly mapped from physical addres range into process virtual memory addresses. For those files there is no paging, and load/store operations provide direct access to persistent memory.

    Install filesystem with DAX (available today for ext4 and xfs):

    1
    2
    3
    # sudo mkdir /mnt/mem
    # sudo mkfs.ext4 /dev/pmem0    OR    #sudo mkfs.xfs /dev/pmem0
    # sudo mount -o dax /dev/pmem0 /mnt/mem
    

    Now files can be created on the freshly mounted partition, and given as an input to NVML pools.

    It is additionally worth mentioning you can emulate persistent memory with ramdisk (i.e./dev/shm), or force pmem-like behavior by setting environment variablePMEM_IS_PMEM_FORCE=1, that would eliminate performance hit caused by msync(2).


    Original Source: http://pmem.io/2016/02/22/pm-emulation.html




    'NVM' 카테고리의 다른 글

    Persistent memory and page structures  (0) 2016.08.03
    ND: NFIT-Defined / NVDIMM Subsystem  (0) 2016.08.02
    Direct Access for files (DAX in EXT4)  (0) 2016.08.02
    Supporting filesystems in persistent memory  (0) 2016.08.02
    Persistent memory support progress  (0) 2016.08.02

    Supporting filesystems in persistent memory

    Original Source: https://lwn.net/Articles/610174/


    Abstract

    본글은 NVMFS 설계에 필요한 요소에 대해 논의한다. Matthew Wilcox의 EXT2-XIP 로부터 EXT4-DAX까지의 간단한 설계 변경 과정을 설명하고, EXT4-DAX의 page cache bypass 방법에 대해 간단히 소개한다. Andrew Morton 의 경우, 기존의 tmpfs 와 같은 in-memory filesystem을 활용하지 않는가에 대해 의견을 제시했고, Dave Chinner는 DRAM based in-memory filesystem의 NVM 적용의 문제점을 robust 측면에서 제시한다. 
    By Jonathan Corbet
    September 2, 2014



    For a few years now, we have been told that upcoming non-volatile memory (NVM) devices are going to change how we use our systems. These devices provide large amounts (possibly terabytes) of memory that is persistent and that can be accessed at RAM speeds. Just what we will do with so much persistent memory is not entirely clear, but it is starting to come into focus. It seems that we'll run ordinary filesystems on it — but those filesystems will have to be tweaked to allow users to get full performance from NVM.


    It is easy enough to wrap a block device driver around an NVM device and make it look like any other storage device. Doing so, though, forces all data on that device to be copied to and from the kernel's page cache. Given that the data could be accessed directly, this copying is inefficient at best. Performance-conscious users would rather avoid use of the page cache whenever possible so that they can get full-speed performance out of NVM devices.

    The kernel has actually had some support for direct access to non-volatile memory since 2005, when execute-in-place (XIP) support was added to the ext2 filesystem. This code allows files from a directly-addressable device to be mapped into user space, allowing file data to be accessed without going through the page cache. The XIP code has apparently seen little use, though, and has not been improved in some years; it does not work with current filesystems.

    Last year, Matthew Wilcox began work on improving the XIP code and integrating it into the ext4 filesystem. Along the way, he found that it was not well suited to the needs of contemporary filesystems; there are a number of unpleasant race conditions in the code as well. So over time, his work shifted from enhancing XIP to replacing it. That work, currently a 21-part patch set, is getting closer to being ready for merging into the mainline, so it is beginning to get a bit more attention.

    Those patches replace the XIP code with a new subsystem called DAX (for "direct access," apparently). At the block device level, it replaces the existing direct_access() function in struct block_device_operations with one that looks like this:

    long (*direct_access)(struct block_device *dev, sector_t sector, void **addr, unsigned long *pfn, long size);

    This function accepts a sector number and a size value saying how many bytes the caller wishes to access. If the given space is directly addressable, the base (kernel) address should be returned in addr and the appropriate page frame number goes into pfn. The page frame number is meant to be used in page tables when arranging direct user-space access to the memory.

    블록장치는 섹터를 기반으로 하고 있지만, 직접 매핑이 가능한 디바이스에 대해 해당 섹터들은 주소로 변환되고, 해당 물리 페이지 넘버를 pfn으로 저장한다

    The use of page frame numbers and addresses may seem a bit strange; most of the kernel deals with memory at this level via struct page. That cannot be done here, though, for one simple reason: non-volatile memory is not ordinary RAM and has no page structures associated with it. Those missing page structures have a number of consequences; perhaps most significant is the fact that NVM cannot be passed to other devices for DMA operations. That rules out, for example, zero-copy network I/O to or from a file stored on an NVM device. Boaz Harrosh is working on a patch set allowing page structures to be used with NVM, but that work is in a relatively early state.

    Moving up the stack, quite a bit of effort has gone into pushing NVM support into the virtual filesystem layer so that it can be used by all filesystems. Various generic helpers have been set up for common operations (reading, writing, truncating, memory-mapping, etc.). For the most part, the filesystem need only mark DAX-capable inodes with the new S_DAX flag and call the helper functions in the right places; see the documentation in the patch set for (a little) more information. The patch set finishes by adding the requisite support to ext4.

    Andrew Morton expressed some skepticism about this work, though. At the top of his list of questions was: why not use a "suitably modified" version of an in-memory filesystem (ramfs or tmpfs, for example) instead? It seems like a reasonable question; those filesystems are already designed for directly-addressable memory and have the necessary optimizations. But RAM-based filesystems are designed for RAM; it turns out that they are not all that well suited to the NVM case.

    For the details of why that is, this message from Dave Chinner is well worth reading in its entirety. To a great extent, it comes down to this: the RAM-based filesystems have not been designed to deal with persistence. They start fresh at each boot and need never cope with something left over from a previous run of the system. Data stored in NVM, instead, is expected to persist over reboots, be robust in the face of crashes, not go away when the kernel is upgraded, etc. That adds a whole set of requirements that RAM-based filesystems do not have to satisfy.

    So, for example, NVM filesystems need all the tools that traditional filesystems have to recognize filesystems on disk, check them, deal with corruption, etc. They need all of the techniques used by filesystems to ensure that the filesystem image in persistent storage is in a consistent state at all times; metadata operations must be carefully ordered and protected with barriers, for example. Since compatibility with different kernels is important, no in-kernel data structures can be directly stored in the filesystem; they must be translated to and from an on-disk format. Ordinary filesystems do these things; RAM-based filesystems do not.

    Then, as Dave explained, there is the little issue of scalability:

    Further, it's going to need to scale to very large amounts of storage. We're talking about machines with *tens of TB* of NVDIMM capacity in the immediate future and so free space management and concurrency of allocation and freeing of used space is going to be fundamental to the performance of the persistent NVRAM filesystem. So, you end up with block/allocation groups to subdivide the space. Looking a lot like ext4 or XFS at this point.

    And now you have to scale to indexing tens of millions of everything. At least tens of millions - hundreds of millions to billions is more likely, because storing tens of terabytes of small files is going to require indexing billions of files. And because there is no performance penalty for doing this, people will use the filesystem as a great big database. So now you have to have a scalable posix compatible directory structures, scalable freespace indexation, dynamic, scalable inode allocation, freeing, etc. Oh, and it also needs to be highly concurrent to handle machines with hundreds of CPU cores.

    Dave concluded by pointing out that the kernel already has a couple of "persistent storage implementations" that can handle those needs: the XFS and ext4 filesystems (though he couldn't resist poking at the scalability of ext4). Both of them will work now on a block device based on persistent memory. The biggest thing that is missing is a way to allow users to directly address all of that data without copying it through the page cache; that is what the DAX code is meant to provide.

    There are groups working on filesystems designed for NVM from the beginning. But most of that work is in an early stage; none has been posted to the kernel mailing lists, much less proposed for merging. So users wanting to get full performance out of NVM will find little help in that direction for some years yet. It is thus not unreasonable to conclude that there will be some real demand for the ability to use today's filesystems with NVM systems.

    The path toward that capability would appear to be DAX. All that is needed is to get the patch set reviewed to the point that the relevant subsystem maintainers are comfortable merging it. That review has been somewhat slow in coming; the patch set is complex and touches a number of different subsystems. Still, the code has changed considerably in response to the reviews that have come in and appears to be getting close to a final state. Perhaps this functionality will find its way into the mainline in a near-future development cycle.



    'NVM' 카테고리의 다른 글

    Persistent memory and page structures  (0) 2016.08.03
    ND: NFIT-Defined / NVDIMM Subsystem  (0) 2016.08.02
    Direct Access for files (DAX in EXT4)  (0) 2016.08.02
    How to emulate Persistent Memory  (0) 2016.08.02
    Persistent memory support progress  (0) 2016.08.02

    Persistent memory support progress 

    https://lwn.net/Articles/640113/

    By Jonathan Corbet

    April 15, 2015 Persistent memory (or non-volatile memory) has a number of nice features: it doesn't lose its contents when power is cycled, it is fast, and it is expected to be available in large quantities. Enabling proper support for this memory in the kernel has been a topic of discussion and development for some years; it was, predictably, an important topic at this year's Linux Storage, Filesystem, and Memory Management Summit. The 4.1 kernel will contain a new driver intended to improve support for persistent memory, but there is still a fair amount of work to be done.

    2015년 4월 15일Persistent memory (혹은 non-volatile memory)은 멋진 기능들을 가지고 있습니다. 전원이 꺼졌다가 켜져도 데이터가 그대로 남아있고, 빠르고, 대용량이기 때문입니다. 커널이 이 메모리를 제대로 지원하도록 구현하기 위해 몇년동안 토론을 해왔고 올해 LSF/MM에서 중요한 이슈중 하나였습니다. 4.1 커널이 persistent memory를 지원하기 위한 새로운 드라이버를 지원한 것입니다만 아직도 해야할 일이 많습니다.

    At a first glance, persistent memory looks like normal RAM to the processor, so it might be tempting to simply use it that way. There are, though, some good reasons for not doing that. The performance characteristics of persistent memory are still not quite the same as RAM; in particular, write operations can be slower. Persistent memory may not wear out as quickly as older flash arrays did, but it is still best to avoid rewriting it many times per second, as could happen if it were used as regular memory. And the persistence of persistent memory is a valuable feature to take advantage of in its own right — but, to do so, the relevant software must know which memory ranges in the system are persistent. So persistent memory needs to be treated a bit differently.

    간단하게는 persistent memory는 프로세서에게 보통 RAM처럼 보여야 합니다. 그래서 RAM에 접근하듯이 쓸 수 있어야 합니다. 하지만 몇가지 이유 때문에 그렇게 하지 않습니다. 성능이 아직 RAM하고 똑같지 않습니다. 특히 쓰기 동작이 느립니다. PM(persistent memory) 예전 플래시 메모리처럼 수명이 짧지는 않지만 어쨌든 일반 메모리처럼 초당 몇번씩 쓰기 동작을 하지 않아야 합니다. PM의 좋은 특성들을 제대로 활용하기 위해서 소프트웨어가 시스템의 어떤 메모리 영역이 persistent한지를 먼저 알야합니다. 그래서 결국 PM은 약간 다르게 접근되게 됩니다.

    The usual approach, at least for a first step, is to separate persistent memory from normal RAM and treat it as if it were a block device. Various drivers implementing this type of access have been circulating for a while now. It appears that this driver from Ross Zwisler will be merged for the 4.1 release. It makes useful reading as it is something close to the simplest possible example of a working block device driver. It takes a region of memory, registers a block device to represent that memory, and implements block read and write operations with memcpy() calls.

    일반적인 방법(초기 단계에서)은 보통 RAM 영역과 pm을 분리해서 블럭 장치처럼사용하는 것입니다. 많은 드라이버들이 이런 접근 방식으로 구현되서 배포되고 있습니다. Ross Swisler에서 이런 방식의 드라이버를 구현해서 4.1에 넣었습니다. 이것은 간단한 블럭 장치 드라이버처럼 동작합니다 메모리 영역을 만들고 해당 영영을 표현하는 블럭 장치를 만들어서 memcpy를 이용한 블럭 단위 읽기 쓰기를 구현합니다.

    In his pull request to merge this driver, Ingo Molnar noted that a number of features that one might expect, including mmap() and execute-in-place, are not supported yet, and that persistent-memory contents would be copied in the page cache. What Ingo had missed is that the DAX patch set providing direct filesystem access to persistent memory was merged for the 4.0 release. If a DAX-supporting filesystem (ext4 now, XFS soon) is built in a persistent memory region, file I/O will avoid the page cache and operations like mmap() will be properly supported.

    Ingo Molnar는 mmap이나 xip등 사람들이 기대할고 있는 많은 기능들이 아직 지원되지 않았다고 말했습니다. 그리고 pm의 내용이 페이지 캐시에도 복사되고 있다는 것도 거론했습니다. Ingo가 놓치고 있는건 DAX 패치를 이용해서 파일시스템이 pm에 직접 접근할 수 있게된게 4.0 부터라는 것입니다. DAX를 지원하는 파일시스템이 pm 영역에 적용된다면 파일 IO가 페이지 캐시를 사용하지 않을 것이고 mmap 연산이 제대로 지원될 것입니다.

    That said, there are a few things that still will not work quite as expected. One of those is mlock(), which, as Yigal Korman pointed out, may seem a bit strange: data stored in persistent memory is almost by definition locked in memory. As noted by Kirill Shutemov, though, supporting mlock() is not a simple no-op; the required behavior depends on how the memory mapping was set up in the first place. Private mappings still need copy-on-write semantics, for example. A perhaps weirder case is direct I/O: if a region of persistent memory is mapped into a process's address space, the process cannot perform direct I/O between that region and an ordinary file. There may also be problems with direct memory access (DMA) I/O operations, some network transfers, and the vmsplice() system call, among others.

    기대하는 것만큼 제대로 동작하지 않는게 더 있습니다. mlock이 그중 하나입니다. Yigal Korman이 지적한 것처럼 약간 이상하게도 pm에 저장된 데이터는 메모리에서 lock 된것과 거의 같습니다. 그렇지만 Kirill Shutemoy에 따르면 mlock을 그냥 no-op으로 만들 수도 없습니다. 최초에 메모리 맵핑이 어떻게 되었는지에 따라 동작이 달라지기 때문입니다. private 맵핑은 copy-on-write를 필요로 합니다. 더 이상한건 direct io입니다. pm 영역이 프로세스의 주소 공간에 맵핑되어 있을 때 프로세스는 일반 파일과 영역간의 direct io를 실행할 수 없습니다. DMA io 동작에도 문제가 있습니다. 네트워크 전송이나 vmsplice 시스템 콜도 그렇습니다.



    Whither struct page?

    In almost all cases, the restrictions with persistent memory come down to the lack of page structures for that memory. A page structure represents a page of physical memory in the system memory map; it contains just about everything the kernel knows about that page and how it is being used. See this article for the gory details of what can be found there. These structures are used with many internal kernel APIs that deal with memory. Persistent memory, lacking corresponding page structures, cannot be used with those APIs; as a result, various things don't work with persistent memory.

    pm을 사용하지 못하는 대부분의 이유가 충분한 페이지 구조체를 만들 수 없기 때문입니다. 페이지 구조체는 시스템 메모리 맵에 있는 물리 페이지를 나타냅니다. 커널이 페이지에 대해 알아야할 모든 것을 가지고 있고 어떻게 사용되고 있는지를 가지고 있습니다.... 생략...... 이 구조체는 메모리를 다루는 많은 커널 내부 API에서 사용됩니다. pm을 관리하는 페이지 구조체가 없으므로 이런 API에서 사용될 수 없고 결국 많은 것들이 pm을 처리할 수 없게 됩니다.

    Kernel developers have hesitated to add persistent memory to the system memory map because persistent-memory arrays are expected to be large — in the terabyte range. With the usual 4KB page size, 1TB of persistent memory would need 256 million page structures which would occupy several gigabytes of RAM. And they do need to be stored in RAM, rather than in the persistent memory itself; page structures can change frequently, so storing them in memory that is subject to wear is not advisable. Rather than dedicate a large chunk of RAM to the tracking of persistent memory, the development community has, so far, chosen to treat that memory as a separate type of device.

    커널 개발자들이 pm을 시스템 메모리 맵에 추가하길 망설이는 이유가 pm이 몇 TB까지 커질 수 있기 때문입니다. 보통의 4KB 페이지 크기를 생각하면 1TB의 pm은 256백만개의 페이지 구조체를 필요로 하고 몇 GB의 메모리를 차지하게 됩니다. 그것들을 RAM에 저장할 필요는 없고 pm에 저장할 수 있습니다. 하지만 페이지 구조체는 자주 변경되므로 수명이 짧은 pm에 저장하는 것은 좋지 않습니다. 많은 RAM을 pm 관리에 사용하는 대신에 커뮤니티에서는 아직까진 다른 타입의 장치로 취급하기로 선택했습니다.

    At some point, though, a way to lift the limitations around persistent memory will need to be found. There appear to be two points of view on how that might be done. One says that page structures should never be used with persistent memory. The logical consequence of this view is that the kernel interfaces that currently use page structures need to be changed to use something else — page-frame numbers, for example — that works with both RAM and persistent memory. Dan Williams posted a patch removing struct page usage from the block layer in March. It is not for the faint of heart: just over 100 files are touched to make this change. That led to complaints from some developers that getting rid of struct page usage in APIs would involve a lot of high-risk code churn and remove a useful abstraction while not necessarily providing a lot of benefit.

    어떻게든 pm의 한계를 극복할 방법이 필요합니다. 두가지의 방향으로 진행되는것 같습니다. 한쪽에서는 페이지 구조체가 pm에는 절대 사용될 수 없다고 말합니다. 이런 관점에서 보면 현재 페이지 구조체를 사용하는 커널 인터페이스들은 페이지 프레임 번호와 같이 뭔가 다른 것을 사용하도록 바껴야만 RAM과 pm에서 사용될 수 있게됩니다. 3월에 Dan Williams가 블럭 레이어에서 struct page를 제거하는 패치를 포스팅했습니다. 100개 이상의 파일이 수정됐습니다. 일부 개발자들은 API에서 struct page를 없애는 것이 장점도 있지만 그에 비해 코드가 위험해지고 추상화가 깨지는 단점이 있다는 불평을 했습니다.

    The alternative would be to bite the bullet and add struct page entries for persistent memory regions. Boaz Harrosh posted a patch to that end in August 2014; it works by treating persistent memory as a range of hot-pluggable memory and allocating the memory-map entries at initialization time. The patch is relatively simple, but it does nothing to address the memory-consumption issue.

    대안은 pm 영역을 위한 항목을 struct page에 추가하는 것입니다. Boaz Harrosh가 2014년 8월말 올린 패치가 있습니다. pm을 hot-pluggable 메모리의 영역으로 잡고 초기화때 메모리 맵으로 할당합니다. 패치는 비교적 간단하지만 메모리가 과소비되는 이슈를 해결하지는 못합니다.

    In the long run, the solution may take the form of something like a page structure that represents a larger chunk of memory. One obvious possibility is to make a version of struct page that refers to a huge page; that has the advantage of using a size that is understood by the processor's memory-management unit and would integrate well with the transparent huge page mechanism. An alternative would be a variable-size extent structure as is used by more recent filesystems. Either way, the changes required would be huge, so this is not something that is going to happen in the near future.

    MMU를 활용해서 huge page를 지원하게해서 큰 메모리 영역을 표시하는 방법도 있다. 최근 파일 시스템들에서 사용하는 variable-size extent 구조체를 사용하는 방법도 있다. 둘다 너무 큰 변화를 요구하기 때문에 가까운 시일내에 적용할 수는 없습니다.

    What will happen is that persistent memory devices will work on Linux as a storage medium for the major filesystems, providing good performance. There will be some rough edges with specific features that do not work, but most users are unlikely to run into them. With 4.1, the kernel will have a level of support for persistent-memory devices to allow that hardware to be put to good use, and to allow users to start figuring out what they actually want to do with that much fast, persistent storage.

    PM 디바이스들이 리눅스에서 주요한 파일시스템들에 대해 스토리지로서 사용이 될때 무슨일이 일어날 것인가, 좋은 성능을 보여줄 것입니다. 특정 요소는 동작하지 않는 다소 거친 면이 존재하지만, 대부분의 유저는 그것들을 동작시키지 않을 것입니다. 4.1에서 커널은 PM 디바이스가 하드웨어를 사용할 수 있는 레벨의 지원을 할 것이고, 이것은 사용자들이 더빠르고, 영속적인 스토리를 통해 그들이 무엇을 하고자 하는지를 허용할 것이다. 



    'NVM' 카테고리의 다른 글

    Persistent memory and page structures  (0) 2016.08.03
    ND: NFIT-Defined / NVDIMM Subsystem  (0) 2016.08.02
    Direct Access for files (DAX in EXT4)  (0) 2016.08.02
    How to emulate Persistent Memory  (0) 2016.08.02
    Supporting filesystems in persistent memory  (0) 2016.08.02