Linux Kernel +27

Loading..PCI configuration space1
2016.09.03

뷰어로 보기
Loading..Block Device Open
2016.09.01

뷰어로 보기
Loading..volatile keyword
2016.08.30

뷰어로 보기
Loading..glibc Malloc
2016.08.27

뷰어로 보기
Loading..LD_PRELOAD example
2016.08.27

뷰어로 보기
Loading..MMIO in PCIe
2016.08.24

뷰어로 보기
Loading..JEMALLOC: A Scalable Concurrent malloc(3) Implementation for FreeBSD
2016.08.18

뷰어로 보기
Loading..malloc 소개
2016.08.18

뷰어로 보기
Loading..posix_fallocate
2016.08.16

뷰어로 보기
Loading..System Memory
2016.08.13

뷰어로 보기
Loading..Radix Tree
2016.08.12

뷰어로 보기
Loading..Linear VS Physical Address
2016.08.10

뷰어로 보기
Loading..Block I/O Operation
2016.08.06

뷰어로 보기
Loading..What is wmb() in linux driver
2016.08.05

뷰어로 보기
Loading..What is the return address of kmalloc() ? Physical or Virtual?
2016.07.29

뷰어로 보기
Loading..Memory Mapping
2016.07.28

뷰어로 보기
Loading..고정 크기 ramdisk 만들기 및 swap 영역 사용
2016.03.22

뷰어로 보기
Loading..sysinfo 관련
2016.02.26

뷰어로 보기
Loading..ubuntu 12.04 kernel compile
2016.02.25

뷰어로 보기
Loading..강제 umount 방법
2015.10.22

뷰어로 보기
Loading..[ubuntu 12.04] grub 메모리 크기 변경
2015.10.22

뷰어로 보기
Loading..LXC 관련 자료
2015.10.17

뷰어로 보기
Loading..ZEST [thezest] 사용법
2015.10.16

뷰어로 보기
Loading..qemu 설치및 사용
2015.10.16

뷰어로 보기
Loading..permanent 환경변수 설정 on ubuntu 12.04
2015.10.14

뷰어로 보기
Loading..etags
2015.07.21

뷰어로 보기
Loading..ldconfig deferred processing now taking place?
2015.07.18

뷰어로 보기

PCI configuration space

Linux Kernel2016. 9. 3. 01:37

뷰어
댓글로
이전글
다음글

PCI Configuration Space

original source: http://egloos.zum.com/nimhaplz/v/5314763

PCI 버스에는 여러가지 장치가 물리는데 그 장치를 사용하려면, 각 장치가 어떤 것이고(identification), 어떻게 장치와 통신해야 하는지(protocol)를 알아야 하는데, PCI 버스에서는 장치를 인식하고, 그 장치의 기본적인 정보를 얻어오기 위해 configuration space를 사용한다.

PCI configuration space 대략 다음과 같은 정보가 담긴 데이터 스트럭쳐 이다.
Device ID,Vendor ID,Status,Class code,.. 등

디바이스의 PCI configuration space 정보를 읽으면 디바이스와 통신을 하기 위한 기본적인 정보를 알 수 있는 것이다. 그러면 예를 들어, 내 PC에 달린 네트웍 카드(NIC)의 configuration space는 어떻게 읽어올 수 있을까? 직접적으로 읽어올 수는 없고, 모든 디바이스를 검색 해야 한다. PCI디바이스는 물리적으로 모든 디바이스에 bus, device, function 이라는 번호가 부여된다. 이 번호는 PCI slot에 따라 부여되는 것이기 때문에, 모든 bus, 모든 device, 모든 function을 스캔 해 보면, 컴퓨터에 달린 모든 디바이스 정보를 알 수 있다.

bus, device, function은 총 16bit이므로, 여걸 다 스캔하면 65536개를 스캔해야 하는 것이다. 실제로 Linux는 부팅과정에서 이걸 다 스캔해서 디바이스를 인식하며, 이 과정을 enumeration이라 한다. 실제 Linux 디바이스 드라이버를 살펴 보자. 살펴 볼 드라이버는 Realtek의 네트웍카드인 r8169이다. Linux source의 drivers/net/r8169.c 파일이다.

(구글에서 r8169.c라고 검색하면 바로 볼 수 있다)
이 파일의 맨 끝부분을 보면, 다음과 같이 디바이스 드라이버를 등록하는 코드가 있다.

static struct pci_driver rtl8169_pci_driver = {
    .name       = MODULENAME,
    .id_table   = rtl8169_pci_tbl,
    .probe      = rtl8169_init_one,
    .remove     = __devexit_p(rtl8169_remove_one),
#ifdef CONFIG_PM
    .suspend    = rtl8169_suspend,
    .resume     = rtl8169_resume,
#endif
};

이 디바이스 드라이버가 처리할 수 있는 ID는 id_table에 저장돼 있다. 그걸 따라가 보면 이렇다.

static struct pci_device_id rtl8169_pci_tbl[] = {
    { PCI_DEVICE(PCI_VENDOR_ID_REALTEK, 0x8129), 0, 0, RTL_CFG_0 },
    { PCI_DEVICE(PCI_VENDOR_ID_REALTEK, 0x8136), 0, 0, RTL_CFG_2 },
    { PCI_DEVICE(PCI_VENDOR_ID_REALTEK, 0x8167), 0, 0, RTL_CFG_0 },

여기서 PCI_VENDOR_ID_REALTEK이 realtek의 vendor ID이고,(0x10ec)Device ID가 각각 0x8129, 8136, 8167이다. 이 vendor ID, Device ID가 PCI Configuration space에 적혀 있어서,이 VID(Vendor ID), DID(Device ID)를 가진 하드웨어가 검색되면 이 8169.c를 쓸 수 있는 것이다.디바이스 드라이버를 등록하는 곳으로 다시 돌아가서, 이번에는 probe함수를 살펴보자.이 함수는 하드웨어를 사용하기 위해 실제로 준비를 하는 함수이다.

probe함수의 이름은 rtl8169_init_one 이다.이 함수 중간에 보면, pci_set_master라는 함수를 부르는 곳이 있다.pci set master는 busmaster DMA를 켜주는 함수로, 다음과 같이 정의된다.

void pci_set_master(struct pci_dev *dev){
    u16 cmd;
    pci_read_config_word(dev, PCI_COMMAND, &cmd);
    if (! (cmd & PCI_COMMAND_MASTER)) {
        pr_debug("PCI: Enabling bus mastering for device %s\n", pci_name(dev));
        cmd |= PCI_COMMAND_MASTER;
        pci_write_config_word(dev, PCI_COMMAND, cmd);
    }
    dev->is_busmaster = 1;
    pcibios_set_master(dev);
}

여기서 굵은 글씨로 표시한 두 줄이 바로 PCI configuration space를 읽고 쓰는 부분이다.

* pci_read_config_word : dev의 PCI configuration space에서 PCI_COMMAND field를 읽어와서, cmd에 저장한다.
* pci_write_config_word : dev의 PCI configuration space의 PCI_COMMAND field에 cmd값을 쓴다.

따라서 위 코드는, PCI_COMMAND field에다가 PCI_COMMAND_MASTER라는 bit을 켜는 역할을 수행하는 것이다.여기까지 PCI configuration space가 무엇인지, 어떻게 쓰는지를 살펴봤다.그 외 field에 대해 좀 더 자세히 살펴보자면 이렇다.

* Class Code
이 PCI device의 분류를 나타낸다. 예를 들어 이 디바이스가 network card인지, USB controller인지 구별하는 field이다.class code는 include/linux/pci_ids.h 파일에 보면 적혀있다.상위 16 bits가 저 파일에 적혀 있으며, 예를 들면 다음과 같다.

#define PCI_CLASS_STORAGE_IDE 0x0101 // Legacy IDE controller
#define PCI_CLASS_NETWORK_ETHERNET 0x0200 // Ethernet Interface

따라서, class code만 봐도 디바이스의 종류를 알 수 있는것이다.입출력 표준이 정해 져 있는 class가 있는데, 그런 경우 class마다 디바이스 드라이버가 있다.예를 들어, IDE같은 경우(class code 0x0101), PCI IDE controller Specification을 보고디바이스 드라이버를 작성하면, device ID나 vendor ID에 상관없이 동작한다.이러한 표준이 없는 디바이스 드라이버의 경우, device ID와 vendor ID에 따라 디바이스 드라이버가 필요하다.

* Base Address Register (BAR)
PCI configuration space는 장치를 찾을 때만 이용할 수 있는 매우 좁은 공간이다.실제로 디바이스 드라이버를 이용하기 위해서는 좀 더 넓은 영역이 필요할 것이다.예를 들어, 비디오 카드의 경우, Memory-mapped I/O (MMIO)를 하기 위해서는넓은 공간이 필요하다. (2000x1000 pixels에서 32bits color를 쓰면, 8MB정도가 필요하다) 이러한 공간을 따로 잡아서, 디바이스와 OS가 "우리 저길 쓰자"고 합의해야 하는데,이 때, Base Address Register (BAR)에 그러한 내용을 쓰는 것이다.부팅 과정에서 BIOS는 기본적으로 각 디바이스 별로 이용할 address space를 할당 해 주는데,BAR를 읽어보면 그 값을 알 수 있고, BAR를 통해서 그 값을 설정할 수도 있다.

* Decoding PCI data and lspci output on Linux Hosts

$ ls -la /sys/bus/pci/devices

위의 명령을 실행하면 다음과 같은 예제를 얻을 수 있는데,

lrwxrwxrwx 1 root root 0 2009-08-03 10:38 0000:04:00.0 -> ../../../devices/pci0000:00/0000:00:0b.0/0000:04:00.0

디바이스의 문자열 "0000:04:00.0" 의 의미는 다음과 같다.

0000 : PCI domain (각 도메인은 256개의 PCI 버스를 가질 수 있음) 04 : the bus number the device is attached to 00 : the device number .0 : PCI device function

해당 디바이스에 대한 좀더 구체적인 정보를 얻기 위해서는 0000:04:00.0 디렉토리로 들어가서, 파일을 보면 된다.

참고할 만한 사이트
http://en.wikipedia.org/wiki/PCI_configuration_space
위키 페이지는 항상 자세하고 친절한 설명이 있으니 여기에도 좋은 말씀이 많을 것이다.나는 보통 configuration space 모양을 보는데 참고한다.

http://wiki.osdev.org/PCI
PCI configuration space 사용법에 대해 자세히 나와있다.reference manual용도로 사용하면 되겠다.

http://lwn.net/Kernel/LDD3/
불후의 명저 Linux Device Driver 3rd edition 사이트이다.책으로도 파는데 온라인으로도 공개 돼 있다.Linux Device Driver전반에 대한 자세하고(?) 친절한 설명이 있다.처음 보기엔 좀 어려울 수 있다.

저작자표시 비영리

'Linux Kernel' 카테고리의 다른 글

Block Device Open (0)	2016.09.01
volatile keyword (0)	2016.08.30
glibc Malloc (0)	2016.08.27
LD_PRELOAD example (0)	2016.08.27
MMIO in PCIe (0)	2016.08.24

Block Device Open

Linux Kernel2016. 9. 1. 06:19

뷰어
댓글로
이전글
다음글

original source: Understanding Linux Kernel 3rd Edition

블록 장치 파일 열기

커널은 fs가 disk 또는 partition에 마운트 될때, 스왑 파티션이 활성화 될때, 유저 프로세스가 블록 장치에 open 시스템 콜을 호출할 때, block device file을 연다. 모든 경우 커널은 같은 작업을 수행하는데, (i)블록 장치 디스크립터를 찾고, (ii)파일 연산 메소드를 설정한다. 장치 파일이 열릴때, dentry_open() 함수가 파일 객체의 메소드를 전용함수로 설정하는데, 파일 객체의 f_op 필드는 def_blk_fops table의 주소로 설정한다. 기본적으로 디스크립터를 찾아 없으면 생성하고, open을 포함하여 이후에 사용될 method에 블록장치 함수에 매핑하는 것을 완료하면 concept 적인 open 과정이 끝난다. 고려해야 할 항목은 inode, filp, bdev, gendisk 등이다.

Method	Function
open	blkdev_open
release	blkdev_close
llseek	block_llseek
read	generic_file_read
write	blkdev_file_write
mmap	generic_file_mmap
fsync	block_fsync
ioctl	block_ioctl
aio_read	generic_fio_aio_read

1. blkdev_open() 함수는 매개 변수로 inode 와 filp (파일객체)를 전달 받아 다음 과정을 수행한다.

1. 아이노드 객체의 inode->i_bdev 필드를 검사 (NULL=nothing, else 해당 블록 디스크럽터의 주소), return desc.

2. if NULL, bdget(inode->i_rdev) 함수를 통해, desc. 를 찾는다. 없으면 새로 할당한다.

3. 추후 open에 desc가 사용될 수 있으므로, inode->i_bdev에 저장.

4. inode->i_mapping 필드를 bdev 아이노드에서 해당 필드 값으로 설정

5. inode를 열린 아이노드 리스트에 추가

2. filp->i_mapping 필드를 inode->i_mapping 값으로 설정

3. gendisk dscriptor 주소를 얻음.

4. bdev->bd_openers (0: 블록장치가 안열림, else: 이미 열렸음)

5. bdev->bd_disk를 gendisk 디스크립터의 주소 disk로 초기화함.

6. 블록 장치가 전체 디스크이면 (part 0) 다음 과정을 수행

1. 정의된 disk->fops->open 수행

2. disk->queue 로부터 관련 필드를 셋팅

7. return 0 (기타 몇몇 과정은 생략)

저작자표시 비영리

'Linux Kernel' 카테고리의 다른 글

PCI configuration space (1)	2016.09.03
volatile keyword (0)	2016.08.30
glibc Malloc (0)	2016.08.27
LD_PRELOAD example (0)	2016.08.27
MMIO in PCIe (0)	2016.08.24

volatile keyword

Linux Kernel2016. 8. 30. 01:51

뷰어
댓글로
이전글
다음글

volatile keyword

original source:https://ko.wikipedia.org/wiki/Volatile_%EB%B3%80%EC%88%98

C/C++ 프로그래밍 언어에서 이 키워드는 최적화 등 컴파일러의 재량을 제한하는 역할을 한다. 개발자가 설정한 개념을 구현하기 위해 코딩된 프로그램을 온전히 컴파일되도록 한다. 주로 최적화와 관련하여 volatile가 선언된 변수는 최적화에서 제외된다. OS와 연관되어 장치제어를 위한 주소체계에서 지정한 주소를 직접 액세스하는 방식을 지정할 수도 있다. 리눅스 커널 등의 OS에서 메모리 주소는 MMU와 연관 된 주소체계로 논리주소와 물리주소 간의 변환이 이루어진다. 경우에 따라 이런 변환을 제거하는 역할을 한다. 또한 원거리 메모리 점프 기계어 코드 등의 제한을 푼다.

C언어 MMIO에서 적용[편집]

주로 메모리 맵 입출력(MMIO)을 제어할 때, volatile을 선언한 변수를 사용하여 컴파일러의 최적화를 못하게 하는 역할을 한다.

static int foo;
 
void bar(void)
{
    foo = 0;
 
    while (foo != 255);
}

foo의 값의 초기값이 0 이후, while 루프 안에서 foo의 값이 변하지 않기 때문에 while의 조건은 항상 true가 나온다. 따라서 컴파일러는 다음과 같이 최적화한다.

void bar_optimized(void)
{
    foo = 0;

    while (true);
}

이렇게 되면 while의 무한 루프에 빠지게 된다. 이런 최적화를 방지하기 위해 다음과 같이 volatile을 사용한다.

static volatile int foo;

void bar (void)
{
    foo = 0;

    while (foo != 255);
}

이렇게 되면 개발자가 의도한 대로, 그리고 눈에 보이는 대로 기계어 코드가 생성된다. 이 프로그램만으로는 무한루프라고 생각할 수 있지만, 만약 foo가 하드웨어 장치의 레지스터라면 하드웨어에 의해 값이 변할 수 있다. 따라서 하드웨어 값을 폴링(poll)할 때 사용할 수 있다. 컴파일러 최적화를 피해야 하는 변수들, 레지스터 또는 CPU가 아닌 장치에 대해 엑세스 되는 메모리 영역등에 대한 제어문의 최적화의 회피에 volatile 변수가 사용된다.

저작자표시 비영리

'Linux Kernel' 카테고리의 다른 글

PCI configuration space (1)	2016.09.03
Block Device Open (0)	2016.09.01
glibc Malloc (0)	2016.08.27
LD_PRELOAD example (0)	2016.08.27
MMIO in PCIe (0)	2016.08.24

glibc Malloc

Linux Kernel2016. 8. 27. 09:32

뷰어
댓글로
이전글
다음글

Glibc Malloc

original source: http://studyfoss.egloos.com/5206979

malloc() 함수의 서비스 루틴은 public_mALLOc() 함수이며 실제로는 전처리 과정에 의해 __libc_malloc()이라는 이름으로 바뀐다. __malloc과 malloc은 이 함수에 대한 alias이다.) public_mALLOc() 함수는 다음과 같은 작업을 수행한다.

1. __malloc_hook이 정의되어 있다면 해당 hook을 호출한 후 종료한다.

2. 그렇지 않으면 malloc을 처리할 heap 영역(arena)를 찾는데 일반적으로 main_arena가 사용된다.

3. arena에 대한 lock을 건 후에 실제 malloc의 처리 루틴인 _int_malloc() 내부 함수를 호출한다.

4. 만약 _int_malloc() 함수가 NULL을 반환했다면 다른 arena에 대해 _int_malloc()을 다시 한 번 호출한다.

5. arena에 걸린 lock을 해제한다.

6. _int_malloc() 함수가 반환한 값을 반환하고 종료한다.

_int_malloc() 함수는 다음과 같은 작업을 수행한다.

Fast Bin Search

요청한 크기를 chunk 크기에 맞춘다. 즉, 헤더를 위한 4 바이트를 더한 후 8 바이트 단위로 정렬(align)한다. 이 후로는 chunk 크기를 기준으로 계산한다. 주어진 크기가 fast bin에 속한다면 (<= 72) fast bin 내의 free chunk를 찾아본다. 주어진 크기에 맞는 fast bin의 인덱스를 계산한다. 해당 인덱스의 포인터가 가리키는 chunk를 victim 지역 변수에 저장한다. victim이 NULL이 아니라면 fast bin의 해당 인덱스에 victim->fb가 가리키는 chunk를 저장하고 victim의 데이터 영역에 대한 포인터를 반환한다. (종료)

Small Bin Search

주어진 크기가 small bin에 속한다면 (< 512) small bin 내에서 free chunk를 찾아본다. 주어진 크기에 맞는 small bin의 인덱스를 계산하여 idx 지역 변수에 저장한다. 해당 인덱스 내에 가장 오래된 chunk를 victim 지역 변수에 저장한다. victim이 올바른 chunk를 가리킨다면 해당 인덱스 내의 리스트에서 victim을 제거하고, victim 바로 다음에 위치한 chunk의 헤더에 P (PREV_INUSE) 플래그를 설정한 뒤 victim의 데이터 영역에 대한 포인터를 반환한다. (종료)

Large Bin Search

large bin은 바로 찾아보지 않고 다음과 같은 준비 과정을 거친다. 주어진 크기에 맞는 large bin의 인덱스를 계산하여 idx 지역 변수에 저장한다. 만약 fast bin을 포함하고 있다면 이들을 모두 병합(consolidate)하여 보다 큰 chunk로 만든다. 이는 큰 메모리 요청을 받은 경우에는 더 이상 작은 크기의 요청이 (최소한 당분간은) 없을 것이라고 가정하기 때문이다. (이로 인해 fast bin으로 인한 fragmentation 문제를 줄일 수 있다.) 이제 unsorted bin을 검색하여 일치하는 크기의 free chunk가 있는지 검색한다. unsorted bin 내의 가장 오래된 chunk를 victim 지역 변수에 저장한다. victim을 unsorted bin의 리스트에서 분리한다. victim의 크기와 주어진 크기가 일치한다면 victim을 반환한다. (종료)

Bin is not Found

idx 값을 하나 증가시킨 후 더 큰 크기의 bin 내에 free chunk가 있는지 검사한다. (이는 bitmap을 통해 빨리 확인할 수 있다.) 현재 인덱스에 해당하는 bitmap을 검사하여 free chunk가 있는지 확인한다. 만약 해당 bin이 비어있다면 인덱스를 하나 증가시킨 후 검사를 다시한다. bitmap이 설정된 bin이 있다면 해당 bin 내의 (가장 작은 크기의) 가장 오래된 chunk를 victim 지역 변수에 저장한다. victim을 리스트에서 분리한다. victim의 크기가 요청을 처리하고도 다른 chunk를 구성할 수 있을 정도로 크다면 분할하여 나머지 영역을 chunk로 만들어서 unsorted bin에 추가한다. 나머지 영역의 크기가 small bin에 속한다면 last_remainder 변수가 나머지 영역을 가리키도록 설정한다. victim을 반환한다. (종료)

Heap Increase

그래도 없다면 시스템의 heap 영역을 늘려야 한다. 이는 sYSMALLOc() 함수가 처리하며, 이 함수의 반환값을 반환하고 종료한다. sYSMALLOc() 함수는 다음과 같은 작업을 수행한다.

먼저 (1)요청된 크기가 mmap() 시스템 콜을 이용하도록 설정된 범위에 속하고 (>= 128K) mmap() 사용 횟수 제한을 넘지 않는다면 (< 65536회) mmap()을 호출한다. 호출이 성공하면 chunk에 M (IS_MMAPPED) 플래그를 설정하고 데이터 영역의 포인터를 반환한다. mmap()으로 할당한 chunk는 분할할 수 없으므로 크기에 여유가 있더라도 하나의 chunk로 사용된다.

그 보다 작은 크기거나 mmap() 호출이 실패했다면 heap 영역을 늘려야 한다. 증가시킬 크기는 요청한 크기에서 원래의 top chunk 크기를 빼고 top chunk가 기본적으로 가져야 할 여유 공간의 크기(pad)를 더한 후 할당 후 남은 영역에 chunk를 구성하기 위한 최소 크기(16)를 더한 값이다. 또한 이는 시스템의 페이지 크기에 맞춰 조정된다. (2)위에서 계산한 크기에 대해 sbrk() (MORCORE라는 이름을 사용한다) 시스템 콜을 호출한다. 호출이 성공했다면 __after_morecore_hook이 정의되어 있는지 검사하여 이를 호출한다. (3)호출이 실패했다면 크기와 횟수 제한에 상관없이 mmap() 시스템 콜을 호출하여 메모리 할당을 시도한다. 이것이 성공하면 해당 arena는 더 이상 연속된 주소 공간에 속하지 않으므로 NONCONTIGUOUS_BIT를 설정한다. 실패했다면 errno 변수를 ENOMEM으로 설정하고 NULL을 반환한다. (종료)

저작자표시 비영리

'Linux Kernel' 카테고리의 다른 글

Block Device Open (0)	2016.09.01
volatile keyword (0)	2016.08.30
LD_PRELOAD example (0)	2016.08.27
MMIO in PCIe (0)	2016.08.24
JEMALLOC: A Scalable Concurrent malloc(3) Implementation for FreeBSD (0)	2016.08.18

LD_PRELOAD example

Linux Kernel2016. 8. 27. 05:53

뷰어
댓글로
이전글
다음글

LD_PRELOAD rootkits Part2

Original Source: http://www.catonmat.net/blog/simple-ld-preload-tutorial-part-2/

이번에는 간단히 fopen: 을 실행하는 프로그램 prog.c를 예로 들어 보자.

#include <stdio.h>

int main(void) {
    printf("Calling the fopen() function...\n");

    FILE *fd = fopen("test.txt", "r");
    if (!fd) {
        printf("fopen() returned NULL\n");
        return 1;
    }

    printf("fopen() succeeded\n");

    return 0;
}

그리고 공유 라이브러리 myfopen.c 를 작성하자. 이 파일은 prog.c 의 fopen 을 override 하고, c standard 라이브러리 원본 fopen 를 호출한다.

#define _GNU_SOURCE

#include <stdio.h>
#include <dlfcn.h>

FILE *fopen(const char *path, const char *mode) {
    printf("In our own fopen, opening %s\n", path);

    FILE *(*original_fopen)(const char*, const char*);
    original_fopen = dlsym(RTLD_NEXT, "fopen");
    return (*original_fopen)(path, mode);
}

이 공유 라이브러리는 fopen 함수를 export 하고, path 를 출력한다. 그리고 RTLD_NEXT pseudo handle을 통한 dlsym 을 사용하여 원본 fopen 를 찾는다. 우리는 반드시 _GNU_SOURCE feature 를 define 해야 하는데, 이것은 RTLD_NEXT 를 <dlfcn.h>. 로부터 사용하기 위해서 이다. RTLD_NEXT 는 현재 라이브러리에서 검색순위에 따라 순차적으로 함수를 검색한다.

다음과 같이 이 공유 라이브러리를 컴파일 하고,

gcc -Wall -fPIC -shared -o myfopen.so myfopen.c -ldl

Now when we preload it and run prog we get the following output that shows that test.txt was successfully opened:

이제 preload 해서 prog를 실행시킨다. test.txt 가 성공적으로 open 된 것을 확인 할 수 있다.

$ LD_PRELOAD=./myfopen.so ./prog
Calling the fopen() function...
In our own fopen, opening test.txt
fopen() succeeded

저작자표시 비영리

'Linux Kernel' 카테고리의 다른 글

volatile keyword (0)	2016.08.30
glibc Malloc (0)	2016.08.27
MMIO in PCIe (0)	2016.08.24
JEMALLOC: A Scalable Concurrent malloc(3) Implementation for FreeBSD (0)	2016.08.18
malloc 소개 (0)	2016.08.18

MMIO in PCIe

Linux Kernel2016. 8. 24. 06:11

뷰어
댓글로
이전글
다음글

MMIO in PCIe: Device has CPU accessible memory

Abstract

디바이스 드라이버 모듈의 .init는 먼저 디바이스 드라이버를 등록하는데, name, ID (vendor, device), prove, remove 등을 등록한다. IDtable 에서 vendor ID 와 device ID는 규격에 의해 제품마다 정해져 있기 때문에, 드라이버 작성시 반드시 매칭을 해야 한다 (참고:http://pcidatabase.com/). 또한 lspci를 통해 ID를 확인 할 수 있다. 이후 정상적으로 디바이스가 OS에 의해 인식이 되면, 먼저 .probe가 호출된다. .probe의 콜백이 호출되지 않는 다면 디바이스는 인식되지 않은 것이다. .probe의 콜백의 주요한 역할은 실제 드라이버가 구동하기 전 해당 디바이스의 메모리를 사용할 수 있도록 준비 상태를 만드는 것이며, 이는 IOremap 에 의해 완성된다.

static struct pci_device_id test_ids[] = {

{ PCI_DEVICE(0x vendorID, 0x deviceID) },

{ 0 }

};

static struct pci_driver test_driver = {

.name = "test",

.id_table = test_ids,

.probe = test_probe,

.remove = test_remove

};

static int __init test_init (void){

rc = pci_register_driver(&test_driver);

}

static void __exit test_exit (void){

pci_unregister_driver(&test_driver);

}

.probe에서는 가장 먼저 디바이스를 enable 한다. 정상적으로 디바이스가 구동 가능한 상태가 되면, 매핑을 하고자 하는 영역(bar)을 선택하여, 어떤 타입의 매핑(IORESOURCE_MEM)을 할것인지를 결정한다. 해당 디바이스의 시작 주소 (PFN)와 크기(PFN)을 선택한 후, ioremap을 통해 매핑된 영역의 시작 가상 주소 (VA)을 얻어내어, CPU를 통한 직접 읽기가 가능한 메모리 영역을 엑세스 할 수 있다. nopage method 와는 달리 이의 방법은 연속된 물리적 주소에만 가능하고, 당연히 page fault를 발생시키기 않는다. kernel page table은 해당 VMA에 모든 PFN을 홀드하고 있는 상태이기 때문이다.

static int test_probe (struct pci_dev *pdev, const struct pci_device_id *id)

{

pci_enable_device_mem(pdev)

pci_select_bars(pdev,IORESOURCE_MEM)

start = pci_resource_start(pdev,0)

size = pci_resource_len(pdev,0)

io_va = ioremap/_wt/_nocache(start,size)

...

}

Accessing the I/O and Memory Spaces

Original Source: http://www.oreilly.com/openbook/linuxdrive3/book/ch12.pdf

A PCI device implements up to six I/O address regions. Each region consists of either memory or I/O locations. Most devices implement their I/O registers in memory regions, because it’s generally a saner approach (as explained in the section “I/O Ports and I/O Memory,” in Chapter 9). However, unlike normal memory, I/O registers should not be cached by the CPU because each access can have side effects. The PCI device that implements I/O registers as a memory region marks the difference by setting a “memory-is-prefetchable” bit in its configuration register.* If the memory region is marked as prefetchable, the CPU can cache its contents and do all sorts of optimization with it; nonprefetchable memory access, on the other hand, can’t be optimized because each access can have side effects, just as with I/O ports. Peripherals that map their control registers to a memory address range declare that range as nonprefetchable, whereas something like video memory on PCI boards is prefetchable.

In this section, we use the word region to refer to a generic I/O address space that is memory-mapped or port-mapped. An interface board reports the size and current location of its regions using configuration registers—the six 32-bit registers shown in Figure 12-2, whose symbolic names are PCI_BASE_ADDRESS_0 through PCI_BASE_ADDRESS_5. Since the I/O space defined by PCI is a 32-bit address space, it makes sense to use the same configuration interface for memory and I/O. If the device uses a 64-bit address bus, it can declare regions in the 64-bit memory space by using two consecutive PCI_BASE_ADDRESS registers for each region, low bits first. It is possible for one device to offer both 32-bit regions and 64-bit regions.

In the kernel, the I/O regions of PCI devices have been integrated into the generic resource management. For this reason, you don’t need to access the configuration variables in order to know where your device is mapped in memory or I/O space. The preferred interface for getting region information consists of the following functions:

커널에서 PCI 장치의 IO 영역은 범용 자원 관리자로 통합되었다. 따라서, 메모리나 입출력 공간의 어디에 장치가 매핑되었는지를 알기 위해 configuration variable을 엑세스 할 필요가 없다. 그냥 매핑이 일단 이루어지고, 연속된 해당 영역에 대해 시작과 크기 주소공간만 알면, 엑세스가 가능하다.

unsigned long pci_resource_start(struct pci_dev *dev, int bar);

The function returns the first address (memory address or I/O port number) associated with one of the six PCI I/O regions. The region is selected by the integer bar (the base address register), ranging from 0–5 (inclusive).

unsigned long pci_resource_end(struct pci_dev *dev, int bar);

The function returns the last address that is part of the I/O region number bar. Note that this is the last usable address, not the first address after the region.unsigned long pci_resource_flags(struct pci_dev *dev, int bar);

This function returns the flags associated with this resource. Resource flags are used to define some features of the individual resource. For PCI resources associated with PCI I/O regions, the information is extracted from the base address registers, but can come from elsewhere for resources not associated with PCI devices.

All resource flags are defined in <linux/ioport.h>; the most important are:

IORESOURCE_IO

IORESOURCE_MEM

If the associated I/O region exists, one and only one of these flags is set.

IORESOURCE_PREFETCH

IORESOURCE_READONLY

These flags tell whether a memory region is prefetchable and/or write protected. The latter flag is never set for PCI resources. By making use of the pci_resource_ functions, a device driver can completely ignore the underlying PCI registers, since the system already used them to structure resource information.

pci_resource 함수는 디바이스 드라이버가 PIC 레지스터를 무시할 수 있도록 해주는데, 시스템이 리소스 정보를 구조화 하는데 이미 그것들을 사용하고 있기 때문이다.

저작자표시 비영리

'Linux Kernel' 카테고리의 다른 글

glibc Malloc (0)	2016.08.27
LD_PRELOAD example (0)	2016.08.27
JEMALLOC: A Scalable Concurrent malloc(3) Implementation for FreeBSD (0)	2016.08.18
malloc 소개 (0)	2016.08.18
posix_fallocate (0)	2016.08.16

JEMALLOC: A Scalable Concurrent malloc(3) Implementation for FreeBSD

Linux Kernel2016. 8. 18. 10:00

뷰어
댓글로
이전글
다음글

JEMALLOC: A Scalable Concurrent malloc(3) Implementation for FreeBSD

Original Source: https://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemalloc.pdf

Abstract

본글은 facebook의 jemalloc에 대해 소개한다. libmalloc의 ptmalloc은 일반적인 성능이 괜찮지만, false file sharing이 발생하는 multi-threaded 환경에서는 극단적으로 메모리 할당 성능이 저하될 수 있다. 특히 enterprise 환경에서는 이와 같은 특징은 치명적이므로, 구글의 tcmalloc 과 함께 jemalloc은 위의 문제를 해결할 수 있는 대안으로 작성되었다. 기본적인 구조는 CPU 당 4개의 아레나를 생성하고 (CPU 2개 이상일 경우), 각 쓰레드가 메모리 할당을 시도할 경우, round-robin 방식으로 아레나들로부터 메모리를 할당 받아 false file sharing을 줄이도록 노력한다. 각 아레나는 3개의 주요한 섹션으로 나뉘며 (small, large, huge), buddy & slab이 메모리를 할당하는 방법과 매우 비슷한 형태로 동작한다.

Problem Issue: False File Sharing

Modern multi-processor systems preserve a coherent view of memory on a per-cache-line basis. If two threads are simultaneously running on separate processors and manipulating separate objects that are in the same cache line, then the processors must arbitrate ownership of the cache line. This false cache line sharing can cause serious performance degradation.

멀티 프로세서 시스템은 메모리뷰의 일관성(coherent)을 캐시 라인을 기준으로 보존한다. 만일 두개의 쓰레드 A, B가 각각 CPU 0, 1에서 동시에 동작하고, 각개의 같은 캐시 라인에 있는 오브젝트를 다룰 때, 프로세서는 해당 캐시라인에 대한 오너쉽을 중재해야 한다. 이 false cache line sharing은 성능에 심각한 영향을 미친다. 각기 다른 쓰레드에 의해 사용되는 2개의 할당이 물리 메모리 캐시의 같은 라인에서 공유된다(false cache sharing). 만일 쓰레드들이 2개의 할당을 동시에 수정하려고 할 경우,프로세서는 캐시라인의 소유권을 중재해야 한다. 결론적으로 스레드의 개수가 증가할 경우, false cache sharing의 확률이 증가하여 cache update 에 대한 rock contention overhead가 증가한다.

Related works

One of the main goals for this allocator was to reduce lock contention for multi-threaded applications running on multi-processor systems. Larson and Krishnan (1998) did an excellent job of presenting and testing strategies. They tried pushing locks down in their allocator, so that rather than using a single 2 allocator lock, each free list had its own lock. This helped some, but did not scale adequately, despite minimal lock contention. They attributed this to “cache sloshing” – the quick migration of cached data among processors during the manipulation of allocator data structures. Their solution was to use multiple arenas for allocation, and assign threads to arenas via hashing of the thread identifiers (Figure 2). This works quite well, and has since been used by other implementations (Berger et al., 2000; Bonwick and Adams, 2001). jemalloc uses multiple arenas, but uses a more reliable mechanism than hashing for assignment of threads to arenas.

메모리 할당자의 주요한 목표중 하나는 멀티 프로세서 시스템에서 동작하는 멀티 쓰레드 에플리케이션의 락 경쟁을 줄이는 것이다. Larson and Krishnan (1998)은 그들의 할당자에 락을 줄이기 위해 노력했는데, 각각의 free list가 자신의 락을 갖도록 했다. 그들은 "캐시 출렁거림 (cache sloshing)" 에 공헌했다:- 할당자 자료 구조를 조작하는 동안 프로세서들간의 빠른 캐시 데이터의 통합. 그들의 해법은 복수의 경기장 (arena)을 할당에 사용하는 것이었고, 쓰레드 식별자의 해싱을 통해 스레드를 경기장에 할당한다. jemalloc은 복수의 경기장을 사용하지만, 쓰레드를 경기장에 할당하는데 해싱을 통한 기법보다 좀더 유연한 메커니즘을 사용한다.

Algorithms and Data Structure

Each application is configured at run-time to have a fixed number of arenas. By default, the number of arenas depends on the number of processors:

- Single processor: Use one arena for all allocations. There is no point in using multiple arenas, since contention within the allocator can only occur if a thread is preempted during allocation.

- Multiple processors: Use four times as many arenas as there are processors. By assigning threads to a set of arenas, the probability of a single arena being used concurrently decreases.

Larson and Krishnan (1998) 방법과 유사하게 여러개의 경기장을 유지하여, 스레드를 할당하지만, 스레드 식별자를 통한 해싱이 아닌, Round Robin 방식을 통해 순차적으로 스레드 별 메모리를 할당한다. 모든 에플리케이션은 고정된 개수의 경기장을 동작중에 갖도록 구성되어 있다. 기본적으로, 경기장의 개수는 프로세서의 개수에 따른다.

- Reliable Pseudo-Random Hashing: Hash(스레드 식별자) 를 통한 스레드의 아레나 할당: round-robin보다 fairness sk contention 측면에서 나은 점을 찾아 볼수 없다.

- Dynamic re-balancing: 확실히 경쟁을 줄일 수 있지만, 유지비용이 많이 들고, 오버해드 대비 이득을 보장하는데 힘들다.

All memory that is requested from the kernel via sbrk(2) or mmap(2) is managed in multiples of the “chunk” size, such that the base addresses of the chunks are always multiples of the chunk size. This chunk alignment of chunks allows constant-time calculation of the chunk that is associated with an allocation. Chunks are usually managed by particular arenas, and observing those associations is critical to correct function of the allocator. The chunk size is 2 MB by default. Chunks are always the same size, and start at chunk-aligned addresses. Arenas carve chunks into smaller allocations, but huge allocations are directly backed by one or more contiguous chunks.

모든 메모리는 커널로 부터 sbark/mmap을 통해 요청되는데, 몇개의 청크 크기를 통해 관리 된다. 청크들의 정렬은 할당에 관련된 청크를 계산하는데 상수 시간 계산이 가능하게 한다. 청크들은 보통 특정 아레나에 의해 관리되고, 할당자의 동작을 수정하는데 사용된다. 청크들은 언제나 같은 크기이고, 청크에 정렬된 주소에서 시작한다. 경기장들은 청크들을 더 작은 할당으로 다듬 지만, 큰 할당에 대해서는 하나 이상의 몇개의 청크들을 직접 사용한다.

Allocation size classes fall into three major categories: small, large, and huge. All allocation requests are rounded up to the nearest size class boundary. Huge allocations are larger than half of a chunk, and are directly backed by dedicated chunks. Metadata about huge allocations are stored in a single red-black tree. Since most applications create few if any huge allocations, using a single tree is not a scalability issue.

기본적으로 아레나는 특정 크기의 연속적인 메모리 주소의 나타내며, 스레드를 아레나에 할당한 다는 것은 특정 스레드의 메모리 할당이 해당 아레나의 주소 공간의 일부를 통해 이루어진다는 뜻이다. 할당의 크기는 3개의 주요한 항목으로 나뉜다: small, large, and huge. 모든 할당 요청은 가까운 사이즈 클래스에 따라 라운드 로빈으로 동작한다. Huge 할당은 청크의 1/2 보다 큰 할당이며, 특정 청크에 직접 할당된다. Huge 할당에 데한 메타데이터는 단일 RB 트리에 저장된다. 대부분의 애플리케이션이 Huge 할당을 거의 생성하지 않기 때문에 단일 트리의 사용은 scalability 문제가 없다.

For small and large allocations, chunks are carved into page runs using the binary buddy algorithm. Runs can be repeatedly split in half to as small as one page, but can only be coalesced in ways that 4 reverse the splitting process. Information about the states of the runs is stored as a page map at the beginning of each chunk. By storing this information separately from the runs, pages are only ever touched if they are used. This also enables the dedication of runs to large allocations, which are larger than half of a page, but no larger than half of a chunk.

small과 large 할당에 대해서는, 이진 버디 알고리즘을 통해 동작하는 페이지들로 분할된다. 동작은 절반씩 청크를 줄이는 방법을 반복해서 한페이지 크기까지 줄이지만, 오로지 4번의 분할 과정을 역으로 해서 합쳐질 수 있다. 동작의 상태를 나타내는 정보는 각 청크의 시작 주소에 있는 페이지 맾에 저장된다. 각 동작별로 이정보를 분할하여 저장하는 것을 통해, 페이지들은 사용될 때만 오로지 수정된다. 이것은 동작의 특정을 페이지의 절반보다 큰고 청크의 절반보다 작은 large 할당이 가능하도록 한다.

Small allocations fall into three subcategories: tiny, quantum-spaced, and sub-page. Modern architectures impose alignment constraints on pointers, depending on data type. malloc(3) is required to return memory that is suitably aligned for any purpose. This worst case alignment requirement is referred to as the quantum size here (typically 16 bytes). In practice, power-of-two alignment works for tiny allocations since they are incapable of containing objects that are large enough to require quantum alignment. Figure 4 shows the size classes for all allocation sizes.

small 할당은 3개의 작은 서브 항목으로 나뉜다. tiny, quantum-spaced, and sub-page. 최신 아키텍쳐는 포인터의 정렬 제한을 도입하고, 이것은 데이터의 타입에 따른다. malloc은 어떤 목적에 부합하도록 정렬되어 있는 메모리를 리턴한다. 최악의 정렬요구사항은 퀀텀 크기의 정렬이다. 실제로, 2승의 요구사항은 tiny 할당에 사용되는데 그것들은 퀀텀 정렬 만큼 큰 객체를 수용할 수 있게 한다.

저작자표시 비영리

'Linux Kernel' 카테고리의 다른 글

LD_PRELOAD example (0)	2016.08.27
MMIO in PCIe (0)	2016.08.24
malloc 소개 (0)	2016.08.18
posix_fallocate (0)	2016.08.16
System Memory (0)	2016.08.13

malloc 소개

Linux Kernel2016. 8. 18. 04:54

뷰어
댓글로
이전글
다음글

malloc 소개

Original Source:http://webtn.tistory.com/entry/Facebook%EC%9D%98-%EB%A9%94%EB%AA%A8%EB%A6%AC-%ED%95%A0%EB%8B%B9%EC%9E%90-jemalloc

바쁘신 분들을 위한 핵심 요약

Facebook이 소개한 jemalloc은 Google의 tcmalloc과 함께 요즘에 뜨는 메모리 할당자 ( malloc ) 입니다. 둘 다 기존 바이너리를 고치지 않고, 실행 전 한 줄 추가하는 것만으로 수십 퍼센트의 성능향상을 얻을 수가 있습니다. 꼭 테스트하고 사용해보도록 합시다.

소개

연초에 Facebook에서 Jason Evans씨가 쓴 “Facebook은 메모리 할당자를 jemalloc을 써서 속도향상을 얻었다”는 글이 개발자들의 트위터 타임라인을 한창 돌아다녔습니다. 무엇을 하든 화제가 되는 기업이 Google에서 Facebook으로 넘어간 듯한 느낌이었습니다. 이들이 쓰는 jemalloc은 어떤 것일까요?

Malloc

프로그램이 무엇을 하려 하든, 시스템에서 메모리를 받아오는 일이 가장 먼저입니다. 도화지가 있어야 그림을 그릴 수 있으니까요. 때문에 메모리를 할당 받는 malloc은 C, C++ 프로그래머들이 가장 많이 사용하는 call 입니다. 효율적인 malloc을 만들고자 하는 대가들의 도전은 지금도 계속되고 있습니다. 프로그램이 도는 동안 수십, 수백만 번 이상 불리기에, 소스를 고치지 않고 빠른 malloc을 사용하는 것만으로 전체 프로그램의 속도가 올라가기 때문이죠.

Malloc의 중요성에 대한 재조명

malloc은 원래부터 중요했습니다만, 최근의 멀티코어, 멀티스레드 환경에서 동작하는 서버 프로그램에서 다음 측면 때문에 더욱 중요해지고 있습니다.

속도 – 최근에는 프로그램 하나가 많은 수의 스레드를 사용하고, 각 스레드가 여러 CPU에 분산되어 실행되게 되었습니다. 이 상황에서도 메모리를 효율적으로 분배하는 일은 그렇게 쉬운 일이 아닙니다. 많은 스레드를 다루게 되면 기존 malloc library의 성능이 주저 앉기 시작합니다. 예를 들어 리눅스에 기본으로 들어있는 glibc malloc의 경우 스레드 8개 이상을 돌리기 시작하면 최고 성능의 60% 수준으로 떨어져 버립니다. 당연히 그것을 쓰는 프로그램도 성능이 뚝 낮아져버리는 겁니다. 대책이 필요합니다.

공간 효율성 – malloc은 도화지에서 그림을 그리기 위한 구역을 따오는 것과 비슷합니다. 도화지에서 중구난방으로 영역을 가져오면, 여기저기에 구멍이 숭숭 뚫리게 됩니다. 이렇게 오랜 시간 동안 쓰게 되면, 전체 도화지에서 아직 칠할 수 있는 전체 면적은 많지만, 단일 덩어리로서의 큰 여백이 점차 사라져서, 있어도 못쓰는 현상이 발생합니다(fragmentation). 오랜 시간 동작하는 서버 프로그램에서 이 부분은 특히나 치명적입니다. 메모리가 충분히 남아 있는데도, 메모리 공간이 부족하다가 시스템이 죽는 현상을 겪어보셨을 텐데 바로 이 경우에 해당합니다. 따라서 오랜 시간 영역을 할당 받고 해제하더라도 큰 면적을 잘 보존하는 malloc이 더욱 중요해졌습니다.

사실 속도와 공간 효율성은 두 마리의 토끼와 같아 동시에 달성하기 어렵습니다만, 소프트웨어 엔지니어들의 각고의 노력 끝에 둘 다 쫓을 수 있는 malloc이 점차 나오고 있습니다. 오늘 소개드릴 jemalloc이 바로 그 예입니다.

대표적인 malloc 들

jemalloc은 하늘에서 뚝 떨어진 것이 아닙니다. 기존의 malloc개발 역사 계보를 따라 가고 있습니다. 아주 과거의 역사를 제외하면, 최근에 가장 많이 쓰이는 malloc들은 다음과 같은 것들이 있습니다.

dlmalloc – Doug Lea 아저씨가 만들었던 malloc입니다. 빠르지는 않고, 예전에 만들어져서 멀티코어, 멀티스레드 개념도 고려되지 않았습니다. 그러나 이후 많은 malloc의 베이스가 됩니다. 참고로 doug Lea 씨는 Java Concurrency의 대가 입니다. 이분이 2006년에 쓰신 Java Concurrency in Practice 는 자바로 서버 쪽 프로그래밍 하시는 분들에게는 아직도 매우 강력히 추천되는 서적입니다. (우리나라에도 번역서가 나와있습니다.)

ptmalloc – glibc에 포함된 malloc입니다. 리눅스의 사실상 표준이란 이야기죠. dlmalloc기반에 멀티코어와 멀티스레드 개념이 고려되었습니다. 뒤에 설명드릴 jemalloc의 arena 개념도 ptmalloc2에 먼저 도입되어 있습니다. 제일 빠른 malloc은 아니지만, 범용적인 사용에 평균적인 성능을 보여주기에 아직까지 리눅스 glibc 기본으로 채택되어 있습니다. 고지식한 모범생 같은 녀석입니다.

tcmalloc – 구글의 Sanjay Ghemawat 아저씨가 만든 malloc입니다. “구글이 만들면 malloc도 다릅니다.”를 천명하며 많은 이들을 열광시킨 malloc입니다. 이름부터 thread caching malloc으로 thread에 대한 고려가 매우 크게 되었고, ptmalloc대비 굉장히 빠릅니다. 덤으로 이 malloc을 쓰면 구글의 여러가지 프로그램 분석도구, 튜닝도구들이 제공 됩니다. 이들이 매우 훌륭합니다. 참고로 이 아저씨가 Google File System도 만들고 MapReduce, BigTable 도 만들었습니다. 구글의 인프라를 만든 사람입니다.

저작자표시 비영리

'Linux Kernel' 카테고리의 다른 글

MMIO in PCIe (0)	2016.08.24
JEMALLOC: A Scalable Concurrent malloc(3) Implementation for FreeBSD (0)	2016.08.18
posix_fallocate (0)	2016.08.16
System Memory (0)	2016.08.13
Radix Tree (0)	2016.08.12

posix_fallocate

Linux Kernel2016. 8. 16. 09:55

뷰어
댓글로
이전글
다음글

original Source: http://pubs.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html

posix_fallocate

NAME

posix_fallocate 함수는 해당 파일에 대해서 요청하는 크기만큼 블록을 미리 할당한다. 주로 filesystem에서의 block fragmentation을 방지하는데 많은 도움이 되며, torrent와 같이 미리 할당이 요구되는 작업에 효율적이다. posix_fallocate를 지원하지 않는 파일 시스템에 대해서는 커널에서 적절히(?) 블록을 미리 할당한다.

NAME

posix_fallocate - file space control (ADVANCED REALTIME)

SYNOPSIS

^[ADV] #include <fcntl.h> int posix_fallocate(int fd, off_t offset, off_t len);

DESCRIPTION

The posix_fallocate() function shall ensure that any required storage for regular file data starting at offset and continuing for len bytes is allocated on the file system storage media. If posix_fallocate() returns successfully, subsequent writes to the specified file data shall not fail due to the lack of free space on the file system storage media.
posix_fallocate() 함수는 일반 파일이 요구하는 연속된 공간(off-len)이 스토리지 partition에 연속적으로 할당되어 있는 것을 보장 한다. 만일 posix_fallocate()가 success를 리턴하면, 반드시 요구되는 공간이 있다는 것을 의미한다. 해당 파일에 대한 연속적인 쓰기는 파일시스템 스토리지에 free space가 없는 이유로 실패할 수 없다.
만일 4KB 연속적인 공간을 해당 fd를 포함하고 있는 파일시스템의 블록 디바이스에 할당할 경우, posix_fallocate(fd,0,4096) 과 같이 사용할 수 있다.
If the offset+ len is beyond the current file size, then posix_fallocate() shall adjust the file size to offset+ len. Otherwise, the file size shall not be changed.
만일 offset+ len 가 현재 파일 크기보다 크면, posix_fallocate() 는 파일 크기를 offset+ len 로 조정해야 한다. 아닐경우, 파일 크기는 변할 수 없다.
It is implementation-defined whether a previous posix_fadvise() call influences allocation strategy.
Space allocated via posix_fallocate() shall be freed by a successful call to creat() or open() that truncates the size of the file. Space allocated via posix_fallocate() may be freed by a successful call to ftruncate() that reduces the file size to a size smaller than offset+len.
posix_fallocate() 에 의해 할당된 공간은 파일의 크기를 절단하는 create나 open에 해제 될 수 있다. posix_fallocate() 에 의해 할당된 공간은 파일 크기를 offset+len 보다 작은 파일 크기로 줄일 수 있는 ftruncate() 에 의해 해제될 수 있다.

RETURN VALUE

success: return zero
else return errors below:
[EBADF] The fd argument is not a valid file descriptor.
[EBADF] The fd argument references a file that was opened without write permission.
[EFBIG] The value of offset+ len is greater than the maximum file size.
[EINTR] A signal was caught during execution.
[EINVAL] The len argument was zero or the offset argument was less than zero.
[EIO] An I/O error occurred while reading from or writing to a file system.
[ENODEV] The fd argument does not refer to a regular file.
[ENOSPC] There is insufficient free space remaining on the file system storage media.
[ESPIPE] The fd argument is associated with a pipe or FIFO.

저작자표시 비영리

'Linux Kernel' 카테고리의 다른 글

JEMALLOC: A Scalable Concurrent malloc(3) Implementation for FreeBSD (0)	2016.08.18
malloc 소개 (0)	2016.08.18
System Memory (0)	2016.08.13
Radix Tree (0)	2016.08.12
Linear VS Physical Address (0)	2016.08.10

System Memory

Linux Kernel2016. 8. 13. 10:44

뷰어
댓글로
이전글
다음글

original source: http://egloos.zum.com/studyfoss/v/5020843

[Linux] x86 시스템 메모리 맵 설정

시스템 부팅 시에 수행하는 가장 중요한 일 중 하나는 시스템에서 사용할 수 있는 메모리의 크기와 위치를 파악하여 이를 적절히 설정하는 것이다. ARM이나 MIPS와 같은 임베디드에서 주로 사용되는 코어들은 하드웨어 구성이 표준적으로 정해질 수 없으므로 이러한 작업은 보통 컴파일 시 특정 하드웨어에 정해진 설정을 그대로 사용하거나 부트로더에서 명령행 옵션으로 설정을 넘겨주어야 한다.

하지만 x86/PC 환경에서는 이러한 작업을 위한 표준적인 BIOS 서비스를 제공한다. 그 중 가장 대표적으로 사용되는 것이 이른바 'e820' 방식이라고 하는 BIOS 인터럽트 15번을 이용하는 방법이다. (실행 시 AX 레지스터에 16진수 e820이 들어있어야 하기 때문에 붙여진 이름이다.) 이에 대한 설명은 Ralf Brown's Interrupt List에 다음과 같이 나와있다.

INT 15 - newer BIOSes - GET SYSTEM MEMORY MAP
    AX = E820h
    EAX = 0000E820h
    EDX = 534D4150h ('SMAP')
    EBX = continuation value or 00000000h to start at beginning of map
    ECX = size of buffer for result, in bytes (should be >= 20 bytes)
    ES:DI -> buffer for result (see #00581)
Return: CF clear if successful
        EAX = 534D4150h ('SMAP')
        ES:DI buffer filled
        EBX = next offset from which to copy or 00000000h if all done
        ECX = actual length returned in bytes
    CF set on error
        AH = error code (86h) (see #00496 at INT 15/AH=80h)
Notes:    originally introduced with the Phoenix BIOS v4.0, this function is
    now supported by most newer BIOSes, since various versions of Windows
    call it to find out about the system memory
    a maximum of 20 bytes will be transferred at one time, even if ECX is
    higher; some BIOSes (e.g. Award Modular BIOS v4.50PG) ignore the
    value of ECX on entry, and always copy 20 bytes
    some BIOSes expect the high word of EAX to be clear on entry, i.e.
    EAX=0000E820h
    if this function is not supported, an application should fall back
    to AX=E802h, AX=E801h, and then AH=88h
    the BIOS is permitted to return a nonzero continuation value in EBX
    and indicate that the end of the list has already been reached by
    returning with CF set on the next iteration
    this function will return base memory and ISA/PCI memory contiguous
    with base memory as normal memory ranges; it will indicate
    chipset-defined address holes which are not in use and motherboard
    memory-mapped devices, and all occurrences of the system BIOS as
    reserved; standard PC address ranges will not be reported
SeeAlso: AH=C7h,AX=E801h"Phoenix",AX=E881h,MEM xxxxh:xxx0h"ACPI"

Format of Phoenix BIOS system memory map address range descriptor:
Offset    Size    Description    (Table 00580)
00h    QWORD    base address
08h    QWORD    length in bytes
10h    DWORD    type of address range (see #00581)

(Table 00581)
Values for System Memory Map address type:
01h    memory, available to OS
02h    reserved, not available (e.g. system ROM, memory-mapped device)
03h    ACPI Reclaim Memory (usable by OS after reading ACPI tables)
04h    ACPI NVS Memory (OS is required to save this memory between NVS
    sessions)
other    not defined yet -- treat as Reserved
SeeAlso: #00580

이 방식을 통해 메모리 맵 정보를 구성하는 코드는 (setup 프로그램에 포함되는) arch/x86/boot/memory.c 파일의 detect_memory_e820 함수이다. e820 방식을 통해 얻은 메모리 맵 정보는 부트로더 혹은 setup 프로그램을 통해 boot_params내에 포함되어 커널로 전달된다. 커널은 해당 e820 맵 정보를 모두 검사하여 중복된 정보가 있는지 확인하고 이를 순서대로 정리한다. (sanitize_e820_map) 이 후 이 정보를 통대로 max_pfn, max_low_pfn 등의 변수를 설정하고 init_memory_mapping 함수를 호출하여 커널 영역의 페이지 테이블을 초기화한다. e820 방식으로 얻은 메모리 정보 및 커널이 수정한 메모리 정보는 dmesg 명령을 통해 확인할 수 있으며 /sys/firmware/memmap 디렉토리에서도 확인할 수 있다.

현재 이 글을 작성 중인 머신에서의 dmesg 출력은 다음과 같다.

[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: 0000000000000000 - 000000000009f800 (usable)
[    0.000000] BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
[    0.000000] BIOS-e820: 00000000000dc000 - 0000000000100000 (reserved)
[    0.000000] BIOS-e820: 0000000000100000 - 000000007f6e0000 (usable)
[    0.000000] BIOS-e820: 000000007f6e0000 - 000000007f700000 (ACPI NVS)
[    0.000000] BIOS-e820: 000000007f700000 - 0000000080000000 (reserved)
[    0.000000] BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)

...

sysfs의 정보는 다음과 같다.

namhyung@NHK-XNOTE:/sys/firmware/memmap/0$ cat start end type
0x0
0x9f7ff
System RAM

저작자표시 비영리

'Linux Kernel' 카테고리의 다른 글

malloc 소개 (0)	2016.08.18
posix_fallocate (0)	2016.08.16
Radix Tree (0)	2016.08.12
Linear VS Physical Address (0)	2016.08.10
Block I/O Operation (0)	2016.08.06

Radix Tree

Linux Kernel2016. 8. 12. 06:35

뷰어
댓글로
이전글
다음글

Original Source: http://timewizhan.tistory.com/entry/%EB%9D%BC%EB%94%95%EC%8A%A4-%ED%8A%B8%EB%A6%ACRadix-Tree

Radix Tree

라딕스 트리(Radix Tree)란 무엇일까? .. 라딕스 트리에 대해 알아 보기 전에.!! 라딕스 트리는 왜 사용하는 것일까?? 간단히 말해서 페이지 캐시를 위해 쓰이는 자료 구조이다.

그러면 자세히 알아볼까?? 페이지를 좀 더 빠르게 이용하기 위해서 보통 캐시 기법을 사용한다. 그래서 디스크에서 페이지 인덱스가 주어지면 커널은 페이지 캐시를 찾기 위해 라딕스 트리를 이용한다. 왜냐하면!! 라딕스 트리에 페이지 캐시가 위치가 나오기 때문에..
아무튼. 커널은 라딕스 트리를 이용하여 있다면. 페이지 디스크립터를 가져오게 된다.~~
그리고 이 페이지 디스크립터를 보고 이 페이지가 어떤 페이지구나~ 라는 것을 알게 된다.

그렇기 떄문에.~ 페이지 캐시를 위해서는 라딕스 트리를 사용하는 것이다. 그러면 라딕스 트리의 구조에 대해 알아 볼까?? 잠시 '리눅스 커널의 이해'의 그림을 참조 하자면...

이렇게 일반적인 리스트(?) 와 같은 형태이다. root->node->node ... 형태로..그렇다면 자료 형태를 보도록 해보자.

height : 현재 트리의 높이
gfp_mask : 새로운 노드를 위해 메모리 요청이 있을 경우 사용하는 플래그
rnode : 1 단계에 있는 노드를 가르킴 (한 단계씩 내려감)

height : 현재 높이
count : 노드 내에 NULL이 아닌 포인터의 수
slot : 64개의 포인터 배열
RADIX_TREE_MAP_SIZE = 1UL << RADIX_TREE_MAP_SHIFT(6) ---> 그래서 2의 6승 = 64
tags : 좀 더 뒤에서..간단한 풀이할 개념이..ㅜㅜ

아무튼 각 노드당 64개의 페이지의 포인터를 가지고 있기 때문에 트리가 2개일 경우에는 2^6 * 2^6 = 2^12 -1
즉, 4096 - 1 -> 4095개의 페이지의 포인터를 가질 수 있다.

그렇다면 라딕스 트리를 이용해서 페이지를 어떻게 쉽게 찾을 수 있을까???
바로 리눅스 페이징 시스템 개념을 다시 한번 쓰는 것이다. (페이지 32 비트 , 48 비트를 비트 별로 쪼개는 것.)

페이지 인덱스가 들어오면 비트 별로 쪼개는 것이다.
라딕스 트리가 1이라면 하위 6 비트로 slot 배열 인덱스로.
2라면 하위 12비트에서 상위 6비트는 1단계에서 하위 6비트는 2단계에서 사용된다.

저작자표시 비영리

'Linux Kernel' 카테고리의 다른 글

posix_fallocate (0)	2016.08.16
System Memory (0)	2016.08.13
Linear VS Physical Address (0)	2016.08.10
Block I/O Operation (0)	2016.08.06
What is wmb() in linux driver (0)	2016.08.05

Linear VS Physical Address

Linux Kernel2016. 8. 10. 07:36

뷰어
댓글로
이전글
다음글

Original Source: http://www.on-time.com/rtos-32-docs/rttarget-32/programming-manual/x86-cpu/protected-mode/virtual-linear-and-physical-addresses.htm

Virtual, Linear, and Physical Addresses

The 386 memory management can become quite confusing. Here is a summary of the different types of addresses and how one type is translated to another:

Virtual addresses are used by an application program. They consist of a 16-bit selector and a 32-bit offset. In the flat memory model, the selectors are preloaded into segment registers CS, DS, SS, and ES, which all refer to the same linear address. They need not be considered by the application. Addresses are simply 32-bit near pointers.

Linear addresses are calculated from virtual addresses by segment translation. The base of the segment referred to by the selector is added to the virtual offset, giving a 32-bit linear address. Under RTTarget-32, virtual offsets are equal to linear addresses since the base of all code and data segments is 0.

Physical addresses are calculated from linear addresses through paging. The linear address is used as an index into the Page Table where the CPU locates the corresponding physical address. If paging is not enabled, linear addresses are always equal to physical addresses. Under RTTarget-32, linear addresses are equal to physical addresses except for remapped RAM regions (see section RTLoc: Locating a Program, sections Virtual Command and FillRAM Command) and for memory allocated using the virtual memory manager.

Linear address is generated after page table mapping. Physical address is generated before page table mapping(ie paging).

Linear Adress,created by adding logical address to the base of segment, CS,DS,ES,SS,FSor GS. When Paging is enabled, the page tables are used to translate linear address to physical address.

On the Other Hand, Physical Address is nothing but, the address value that appears on pins of processor during a memory read/memory write operations.

In Short, we can say if paging is disabled linear address = physical address

저작자표시 비영리

'Linux Kernel' 카테고리의 다른 글

System Memory (0)	2016.08.13
Radix Tree (0)	2016.08.12
Block I/O Operation (0)	2016.08.06
What is wmb() in linux driver (0)	2016.08.05
What is the return address of kmalloc() ? Physical or Virtual? (0)	2016.07.29

Block I/O Operation

Linux Kernel2016. 8. 6. 10:14

뷰어
댓글로
이전글
다음글

Original Source: http://studyfoss.egloos.com/5575220, http://studyfoss.egloos.com/5576850, http://studyfoss.egloos.com/5583458, http://studyfoss.egloos.com/5585801

Block I/O Operation

블록 장치는 개별 바이트 단위가 아닌 일정 크기(block) 단위로 접근하는 장치를 말하는 것으로 간단히 말하면 하드 디스크와 같은 대용량 저장 장치를 말한다. 전통적으로 이러한 블록 장치는 다른 (문자) 장치처럼 직접 다루는 대신 파일 시스템이라고 하는 추상화 계층을 통해 간접적으로 접근하게 되며 따라서 프로그래머는 해당 저장 장치가 어떠한 종류의 장치인지와는 무관하게 (또한 VFS에 의해 어떠한 파일 시스템인지와도 무관하게) 일관된 방식으로 (즉, 파일 및 디렉터리의 형태로) 이용하게 된다.

리눅스의 VFS 계층은 디스크 접근을 최소화하기 위해 페이지 캐시를 이용하여 한 번 접근한 디스크의 내용을 저장해 둔다. 하지만 여기서는 이러한 페이지 캐시의 작용은 건너뛰고 실제로 블록 장치와 I/O 연산을 수행하는 경우에 대해서 살펴보게 될 것이다.

블록 I/O의 시작 지점은 submit_bio() 함수이다. (일반적인 파일 시스템의 경우 buffer_head (bh)라는 구조체를 통해 디스크 버퍼를 관리하는데 이 경우 submit_bh() 함수가 사용되지만 이는 bh의 정보를 통해 bio 구조체를 할당하여 적절히 초기화한 후 다시 submit_bio() 함수를 호출하게 된다.)

이 함수는 I/O 연산의 종류 (간단하게는 READ 혹은 WRITE) 및 해당 연산에 대한 모든 정보를 포함하는 bio 구조체를 인자로 받는다. bio 구조체는 기본적으로 I/O를 수행할 디스크 영역의 정보와
I/O를 수행할 데이터를 저장하기 위한 메모리 영역의 정보를 포함한다. 여기서 몇가지 용어가 함께 사용되는데 혼동의 여지가 있으므로 간략히 정리하고 넘어가기로 한다.

먼저 섹터(sector)라는 것은 장치에 접근할 수 있는 최소의 단위이며 (H/W적인 특성이다) 대부분의 장치에서 512 바이트에 해당하므로, 리눅스 커널에서는 섹터는 항상 512 바이트로 가정하며 sector_t 타입은 (512 바이트의) 섹터 단위 크기를 나타낸다. (만약 해당 장치가 더 큰 크기의 섹터를 사용한다면 이는 장치 드라이버에서 적절히 변환해 주어야 한다)

블록(block)은 장치를 S/W적으로 관리하는 (즉, 접근하는) 크기로 섹터의 배수이다. 일반적으로 파일 시스템 생성 (mkfs) 시 해당 파일 시스템이 사용할 블록 크기를 결정하게 되며 현재 관리의 용이성을 위해 블록 크기는 페이지 크기 보다 크게 설정될 수 없다. 즉, 일반적인 환경에서 블록의 크기는 512B(?), 1KB, 2KB, 4KB 중의 하나가 될 것이다. 하나의 블록은 디스크 상에서 연속된 섹터로 이루어진다.

세그먼트(segment)는 장치와의 I/O 연산을 위한 데이터를 저장하는 "메모리" 영역을 나타내는 것으로 일반적으로는 페이지 캐시 내의 일부 영역에 해당 할 것이다. 하나의 블록은 메모리 상에서 동일한 페이지 내에 저장되지만 하나의 I/O 연산은 여러 블록으로 구성될 수도 있으므로 하나의 세그먼트는 (개념적으로) 여러 페이지에 걸칠 수도 있다.

블록 I/O 연산은 기본적으로 디스크에 저장된 데이터를 메모리로 옮기는 것 (READ) 혹은 메모리에 저장된 데이터를 디스크로 옮기는 것 (WRITE)이다. (장치의 특성에 따라 FLUSH, FUA, DISCARD 등의 추가적인 연산이 발생될 수도 있다.) I/O 연산이 여러 블록을 포함하는 경우 약간 복잡한 문제가 생길 수 있는데 이러한 블록 데이터가 디스크 혹은 메모리 상에서 연속되지 않은 위치에 존재할 수 있기 때문이다.

예를 들어 파일 시스템을 통해 어떠한 파일을 읽어들이는 경우를 생각해보자. 파일을 연속적으로 읽어들인다고 해도 이는 VFS 상에서 연속된 것으로 보이는 것일 뿐 실제 데이터는 디스크 곳곳에 흩어져있을 수도 있다. (많은 파일 시스템은 성능 향상을 위해 되도록 연속된 파일 데이터를 디스크 상에서도 연속된 위치에 저장하려고 시도하지만 시간이 지날 수록 단편화가 발생하므로 결국에는 어쩔 수 없이 이러한 현상이 발생하게 될 것이다.)

또한 디스크에서 읽어들인 데이터는 페이지 캐시 상에 저장되는데 페이지 캐시로 할당되는 메모리는 항상 개별 페이지 단위로 할당이 이루어지므로 메모리 상에서도 연속된 위치에 저장된다고 보장할 수 없다.

따라서 bio 구조체는 이러한 상황을 모두 고려하여 I/O 연산에 필요한 정보를 구성한다. 우선 하나의 bio은 디스크 상에서 연속된 영역 만을 나타낼 수 있다. 즉, 접근하려는 연속된 파일 데이터가 디스크 상에서 3부분으로 나뉘어져 있다면 세 개의 bio가 각각 할당되어 submit_bio() 함수를 통해 각각 전달될 것이다.

블록 I/O 연산 시 실제 데이터 복사는 대부분 DMA를 통해 이루어지게 되는데 이 때 (DMA를 수행하는) 장치는 물리 주소를 통해 메모리에 접근하게 되므로 설사 파일 매핑을 통해 파일 데이터를 저장한 페이지들이 (해당 프로세스의) 가상 메모리 상에서 연속된 위치에 존재한다고 하더라도 떨어진 페이지 프레임에 존재한다면 별도의 세그먼트로 인식할 것이다.

구식 장치의 경우 DMA를 수행할 때 디스크는 물론 메모리 상에서도 연속된 하나의 세그먼트 만을 지원했었다. 따라서 디스크 상에서 연속된 위치에 저장된 데이터라고 하더라도 메모리 상에서 연속되지 않았다면 하나의 I/O 연산을 통해 처리할 수 없는 상황이 발생하므로 여러 연산으로 분리해야 했었다. 하지만 장치가 scatter-gather DMA를 지원하거나 IO-MMU를 포함한 머신이라면 얘기가 달라진다. 현재 bio는 세그먼트를 bio_vec 구조체를 통해 저장하는데 세그먼트는 기본적으로 페이지의 형태로 저장되므로 이에 대한 모든 정보가 포함되며 장치가 한 I/O 당 여러 세그먼트를 지원할 수 있으므로 이를 배열(vector) 형태로 저장한다. 혹은 우연히도 디스크 상에 연속된 데이터가 메모리 상에서도 연속된 페이지에 저장되었을 수도 있다. 이 경우 별도의 페이지로 구성되었어도 물리적으로는 하나의 세그먼트로 처리한다. 또는 IO-MMU를 통해 떨어져있는 페이지들을 하나의 세그먼트 (연속된 주소)로 매핑할 수도 있다.

위 그림은 지금껏 설명한 bio의 구성을 보여준다. (설명을 간단히하기 위해 블록 크기와 페이지 크기가 동일한 환경을 고려하며 장치는 scatter-gather DMA 등을 통해 여러 세그먼트를 동시에 처리할 수 있다고 가정한다) 연속된 파일 주소 공간에 대한 I/O 요청은 디스크 상의 위치를 기준으로 3개의 bio로 나뉘어졌으며 각 bio는 해당 영역의 데이터를 담는 세그먼트를 여러 개 포함할 수 있다.

이번에는 submit_bio() 함수를 통해 bio가 전달되는 과정을 들여다보기로 하자.

submit_bio() 함수는 주어진 I/O 연산의 종류를 bio 구조체에 저장한 뒤 generic_make_request() 함수를 호출한다. I/O 연산의 종류 및 그에 따른 특성을 나타내기 위해 bio와 request 구조체는 REQ_* 형태의 플래그를 공유하며 이는 rq_flag_bits 열거형을 통해 정의되어 있고 위에서 설명한 I/O 연산 매크로들은 이 플래그들을 조합하여 만들어진다.

generic_make_request() 함수는 주어진 bio에 대해 장치 드라이버에 제공하는 방식(make_request_fn 콜백)을 통해 request를 만들어내는 작업을 수행한다.

여기서 bio는 앞서 살펴보았듯이 상위 계층 (VFS)에서 요청한 블록 I/O 연산에 대한 정보를 담고 있는 것이며 request는 실제로 장치 드라이버에서 장치와 실제 I/O 작업을 수행하는 것에 필요한 정보를 담고 있는 구조체이다.

이전 글에서 언급했듯이 블록 장치는 상대적으로 연산 속도가 매우 느리기 때문에 상위 계층에서 요청한 작업을 즉시 수행하지 않고 (I/O 스케줄러를 통해) 순서를 조정하게 되며 이 과정에서 여러 번에 걸쳐 요청된 bio들이 하나의 request로 합쳐지게 되는 경우도 있다.

이러한 작업들을 모두 처리하는 함수가 generic_make_request() 함수로써 장치 드라이버에서 I/O 연산에 필요한 여러 준비 작업들을 수행하게 되는데 몇몇 특별한 장치의 경우 이 과정이 재귀적으로 일어날 수 있기 때문에 이에 대한 대비를 위해 실제 처리는 __generic_make_request() 함수로 분리하였다.

S/W RAID (리눅스 커널에서는 MD (Multple Disks)라고 부른다) 또는 DM (Device Mapper)과 같은 장치는 커널에서 제공하는 특수 장치로 여러 물리적인 디스크 장치를 묶어서 마치 하나의 장치인 것 처럼 관리하는데, 이러한 장치에 대한 I/O 연산은 하위에 존재하는 여러 개의 실제 장치에 대한 I/O 연산으로 변경(clone)되어 수행되기도 하므로 이에 대한 재귀적인 처리 과정에서 커널 스택이 소진되는 문제가 발생할 수 있다. (direct-reclaim 시의 writeback과 같은 경우 이미 많은 양의 커널 스택이 사용된 상황일 것이다)

참고로 블록 계층에서의 메모리 할당은 매우 조심스럽게(?) 이루어지는데 앞서 말했다시피 이미 시스템의 메모리가 부족해진 상황에서 캐시로 사용되던 페이지들을 다른 용도로 재사용하기 위해 기존의 내용을 디스크에 기록해야 하는 경우가 많은데 이 때 디스크 I/O가 처리되기 때문이다. 즉, 메모리가 부족한 상황에서 메모리를 회수해야 하는 태스크가 (I/O 처리 과정에 필요한) 새로운 메모리를 요청하게 되는데 이미 메모리가 부족하므로 할당이 성공할 수 없고 따라서 해당 태스크가 대기 상태로 빠져 deadlock이 발생할 수 있는 문제를 안게 된다.

그래서 블록 I/O 처리 경로에서의 메모리 할당은 일반적으로 사용하는 GFP_KERNEL 매크로가 아닌,(I/O를 발생시키지 않는) GFP_NOIO 매크로를 통해 이루어지며 많은 경우 memory pool과 같은 기법을 이용하여 최악의 상황에서도 사용할 수 있도록 필요한 객체들을 사전에 미리 할당해 두는 방식을 사용한다.

generic_make_request() 함수는 현재 실행되는 태스크가 해당 함수를 재귀적으로 호출했는지 검사하기 위해 먼저 task_struct의 bio_list 필드를 검사한다.이 값이 NULL이 아니라면 재귀적으로 호출된 경우이므로 리스트에 현재 bio를 추가하고 종료한다. 그렇지 않다면 최초 호출이므로 스택에 할당된 bio_list 구조체로 bio_list 필드를 설정하고 실제로 요청을 처리하기 위해 __generic_make_request() 함수를 호출하며 호출이 완료된 후에는 그 사이에 재귀적으로 추가된 bio가 있는지 검사하여 있다면 이를 다시 수행한다. 리스트 내에 더 이상 bio가 존재하지 않는다면 bio_list 필드를 NULL로 설정하고 종료한다.

__generic_make_request() 함수도 또한 하나의 loop로 구현되어 있는데 마찬가지로 MD 혹은 DM과 같은 장치에서 해당 장치에 대한 I/O 요청을 그 하위의 실제 장치에 대한 I/O 요청으로 변경(remap)하는 경우가 있기 때문이다. 장치 드라이버는 주어진 bio를 실제 장치가 처리하기 위한 request로 만들기 위해 make_request_fn 콜백을 제공하는데 정상적인 경우 이 콜백 함수는 0을 리턴하여 loop 내부를 1번 만 수행하고 바로 종료한다. 하지만 위에서 말한 특수한 장치의 경우 0이 아닌 값을 리턴하여 bio가 다른 장치로 remap 되었음을 알려주면 다시 loop 내부를 수행하여 새로운 장치에 대해 필요한 검사를 수행한다.

loop 내부에서는 bio가 요청한 장치가 현재 사용 가능한 상태인지, 요청한 블록이 장치의 범위를 넘어서는지, FLUSH, FUA, DISCARD와 같은 특수 기능을 장치가 제공하는지 등을 검사하며 I/O를 요청한 장치가 디스크 파티션이라면 이를 전체 디스크에 대한 위치로 재조정한다. 또한 fault injection에 의한 I/O 요청 실패 상황을 검사하거나 block throttling 정책에 따라 현재 요청된 I/O를 잠시 대기 시킬 것인지 여부를 결정하게 된다.

이러한 모든 단계가 정상적으로 완료되면 드라이버에서 제공하는 make_request_fn 콜백을 호출한다.
일반적인 디스크 장치는 기본 구현인 __make_request() 함수를 콜백으로 등록하게 되며 이 과정에서 현재 bio를 장치에 전달하기 위해 필요한 request를 찾거나 새로 생성한다.

하지만 위에서 말한 MD 및 DM과 같은 복잡한 장치들은 물론 일반 파일을 디스크처럼 다루는 loop 장치와 메모리를 다루는 RAM 디스크 장치 (brd 모듈) 등은 request를 생성하지 않고 bio 구조체를 직접 이용하여 I/O 연산을 수행한다.

예를 들어 MD 장치의 구성 중에 여러 디스크를 마치 하나의 디스크인 것처럼 연결하는 linear 모드
(MD의 용어로는 personality라고 한다)가 있다. 이 경우 MD 장치로 들어온 요청은 make_request_fn 콜백으로 등록된 md_make_request() 함수에서 처리되는데 이는 다시 해당 장치의 personality에서 제공하는 make_request 콜백을 호출하여 결국 linear_make_request() 함수가 호출되게 된다.

linear_make_request() 함수는 MD 장치의 블록 번호에 해당하는 실제 장치를 찾은 후에 bio의 장치 및 섹터 정보를 적절히 변경하고 1을 리턴한다. 그러면 __generic_make_request() 함수 내의 loop가 새로운 장치에 대해 다시 수행되어 실제 디스크 장치로 I/O 요청이 전달되는 것이다. 만일 MD 장치에 대한 요청이 linear 모드로 연결된 실제 장치의 경계에 걸친 경우 이는 내부적으로 두 개의 bio로 분할되고 (bio_split() 함수 참고), 각각의 장치에 대해 다시 generic_make_request() 함수를 호출하므로 task_struct의 bio_list에 연결된 후 차례로 처리될 것이다.

bio를 통해 전달된 I/O 연산 요청은 각 블록 장치 드라이버에서 제공하는 make_request_fn 콜백을 통해 처리되는데 일반적인 디스크 장치의 경우 __make_request() 함수가 request를 할당하고 buffer bouncing, plugging, merging 등의 공통적인 작업을 처리한 후 이를 (elevator라고도 부르는) I/O 스케줄러에게 넘겨주게 된다. 여기서는 이 __make_request() 함수에 대해서 알아보기로 할 것이다.

가장 먼저 blk_queue_bounce() 함수를 통해 디스크 장치의 특성에 따라 페이지를 더 할당하는데
오래된 ISA 방식의 디스크인 경우 디스크 장치가 DMA를 통해 접근할 수 있는 주소의 범위가
16MB (24bit) 밖에 되지 않기 때문이다. (동일한 문제는 64bit 시스템의 PCI 장치에서도
4GB 이상의 메모리가 존재하는 경우에 발생할 수 있다.)

이 경우 전달된 bio의 세그먼트에 해당하는 페이지를 장치가 접근할 수 없으므로 접근할 수 있는 영역의 페이지 (ZONE_DMA/DMA32)를 새로 할당한 후 (이를 bounce buffer라고 한다) 이를 이용하여 실제 I/O를 대신 처리하는 방법을 사용해야 한다.

이 과정이 완료되면 I/O 요청을 드라이버에게 전달하기 위해 request 구조체를 할당하게 되는데 그 전에 기존의 request에 현재 bio가 merge 될 수 있는지를 먼저 검사하게 된다. 일단 request 구조체에 대해서 먼저 간략히 살펴볼 필요가 있다.

request 구조체는 기본적으로는 (bio와 동일하게) 디스크 상에서 연속된 영역에 해당하는 I/O 연산 요청에 대한 정보를 포함하는데 추가적으로 드라이버에서 사용할 여러 low-level 자료 구조를 포함/참조하고 있다. 특히나 세그먼트 정보는 이미 bio 구조체에 저장되어 있으므로 이를 그대로 이용하며
만약 연속된 디스크 영역에 여러 bio가 전달된 경우 이를 하나의 리스트로 연결하여 관리한다.

아래는 전체 request 구조체 중에서 현재 관심있는 부분 만을 표시한 것이다.

include/linux/blkdev.h:

struct request {
    struct list_head queuelist;

    ...

    struct request_queue *q;

    unsigned int cmd_flags;
    enum rq_cmd_type_bits cmd_type;

    ...

    ⁄* the following two fields are internal, NEVER access directly *⁄
    unsigned int __data_len;    ⁄* total data len *⁄
    sector_t __sector;          ⁄* sector cursor *⁄

    struct bio *bio;
    struct bio *biotail;

    struct hlist_node hash;      ⁄* merge hash *⁄

    ⁄*
     * The rb_node is only used inside the io scheduler, requests
     * are pruned when moved to the dispatch queue. So let the
     * completion_data share space with the rb_node.
     *⁄
    union {
        struct rb_node rb_node;    ⁄* sort/lookup *⁄
        void *completion_data;
    };

    ...

    ⁄* Number of scatter-gather DMA addr+len pairs after
     * physical address coalescing is performed.
     *⁄
    unsigned short nr_phys_segments;

    ...
};

request는 궁극적으로 해당 장치에서 제공하는 request_queue로 전달되어 처리되는데 (실제로 전달되는 순서는 I/O 스케줄러에서 조정한다) q 필드는 이 request가 전달될 큐를 가리키며 queuelist 필드는 request_queue 내의 리스트를 관리하기 위해 필요한 포인터이다. cmd_flags는 앞서 bio에서 살펴보았듯이 해당 I/O 연산의 특성을 알려주는 REQ_* 형태의 플래그이며 cmd_type은 일반적인 경우 REQ_TYPE_FS 값으로 설정된다. (filesystem 연산)

__sector는 해당 request가 접근하는 디스크 상의 위치를 섹터 단위로 저장한 것이며 __data_len은 해당 request가 처리하는 데이터의 길이를 바이트 단위로 저장한 것이다. (이 필드들은 드라이버에서 요청을 처리하는 도중에 갱신될 수 있으므로 외부에서 접근하면 안된다)

bio와 biotail은 해당 request에 포함된 bio의 목록으로 merge 시에 확장될 수 있으며 hash는 merge할 request를 빨리 찾을 수 있도록 해시 테이블을 구성하기 위해 필요하다. (merge 과정에 대해서는 잠시 후에 살펴볼 것이다.)

rb_node 필드는 I/O 스케줄러가 request를 디스크 상의 위치를 통해 정렬하기 위해 사용되며 nr_phys_segments는 해당 request가 포함하는 총 메모리 세그먼트의 수를 저장한다.

이제 merge 과정에 대해서 알아보기로 하자. submit_bio() 함수를 통해 요청된 (최초) bio는 request 형태로 변경될 것이다. 그런데 바로 후에 (아마도 filesystem 계층에서) 디스크 상에서 연속된 영역에 대해 다시 submit_bio()를 호출하여 bio를 요청하는 경우가 있을 수 있다.

이 경우 최초에 생성된 request에 두 번째로 요청된 bio가 포함되게 되며 __sector 및 __data_len 필드는 필요에 따라 적절히 변경될 것이고 bio와 biotail 필드는 각각 첫번째 bio와 두번째 bio를 가리키게 될 것이다. (각각의 bio는 내부의 bi_next 필드를 통해 연결된다)

그럼 문제는 주어진 bio를 merge할 request를 어떻게 찾아내느냐 인데 (위에서 설명한 아주 단순한 경우는 바로 이전에 생성된 request를 찾은 경우였지만 디스크 접근 패턴이 복잡한 경우는 여러 request들을 검색해 보아야 할 것이다.) 이를 위해 기본적으로 각 디스크의 I/O 스케줄러는 (위에서 언급한) 해시 테이블을 유지한다.

해시 테이블은 request가 접근하는 가장 마지막 섹터의 경계를 기준으로 구성하는데 이는 디스크 접근이 보통 섹터 번호가 증가하는 순으로 이루어지는 경우가 많기 때문일 것이다. 이 경우 원래의 request가 접근하는 제일 뒤쪽에 새로운 bio가 연결되므로 이를 back merge라고 부른다. 반대로 원래의 request보다 앞쪽에 위치하는 bio가 요청된 경우를 front merge라고 한다. back merge의 경우는 항상 가능하지만 front merge의 경우는 I/O 스케줄러에 따라 허용하지 않을 수도 있다. 물론 이 외에도 merge가 되려면 해당 request와 bio는 호환가능한 속성을 가져야 한다.

또한 sysfs를 통해 I/O 스케줄러의 merge 시도 여부를 제어할 수가 있는데예를 들어 sda라는 디스크의 경우 /sys/block/sda/queue/nomerges 파일의 값에

0을 쓰면 항상 (해시 테이블을 검색하여) 가능한 경우 merge를 허용하고,
1을 쓰면 바로 이전에 생성 또는 merge된 request와의 merge 만을 허용하며
2를 쓰면 merge를 허용하지 않게 된다.

하지만 이러한 I/O 스케줄러의 해시 테이블은 각 디스크 별로 유지되기 때문에 해당 디스크에 접근하려는 여러 태스크는 동기화를 위해 lock을 필요로하게 된다. 이는 많은 디스크 I/O가 발생하는 시스템에서 성능 상 좋지 않은 효과를 줄 수 있는데 이를 위해 이러한 공유 해시 테이블에 접근하기 전에 먼저 각 태스크 별로 유지하는 plugged list를 검사하여 merge가 가능한 request가 존재하는지 확인하게 된다.

plugged list는 이른바 'block device plugging'이라는 기능을 구현한 것인데 이는 디스크의 동작 효율을 높이기 위한 기법으로, 디스크가 idle 상태였다면 request가 요청된 즉시 처리하지 않고 조금 더 기다림으로써 여러 request를 모아서 한꺼번에 처리하거나 merge될 시간을 벌어주는 효과를 얻게 된다.

즉, 디스크에 대한 접근이 발생하면 plugged 상태로 되어 I/O 스케줄러가 잠시 request를 보관하며
이후 특정 조건이 만족된 경우 (일정 시간이 경과하거나, 충분히 많은 I/O 요청이 발생한 경우) 장치가 (자동으로) unplug되어 주어진 request들을 실제로 처리하기 시작하는 형태였다.

하지만 2.6.39 버전부터 plugging 방식이 태스크가 직접 unplug 하는 식으로 변경되면서 태스크 별로 I/O 스케줄러에 request를 넘기기 전에 자신이 생성한 request를 리스트 형태로 유지하게 되었다. 따라서 이는 공유되지 않으므로 불필요한 lock contention을 줄일 수 있다.

단 이 per-task plugging 방식은 선택적인 것이므로 __make_request() 실행 당시 해당 태스크는 이 기능을 이용하지 않을 수도 있다.

이렇게 plugged list와 I/O 스케줄러 (혹은 엘리베이터)의 request를 검색한 후에도 merge할 마땅한 request를 찾지 못했다면 해당 bio를 위한 request를 새로 생성한다. 마찬가지로 request 구조체를 할당할 때도 GFP_NOIO 플래그를 사용하며 mempool 인터페이스를 사용하여 비상 시를 위한 여분의 구조체를 미리 준비해 둔다.

또한 각 디스크 (request_queue)에는 처리할 수 있는 request의 최대값이 정해져 있어서 그 이상 request를 생성하지 못하도록 제어하는데 기본값으로는 BLKDEV_MAX_RQ (128)이 사용되며 이에 따라 해당 디스크의 congestion 상태를 판단하기 위한 threshold 값이 결정된다. 이 경우 113개 이상의 request가 대기 중이면 디스크가 병목 현상을 겪고 있다고 판단하며 대기 중인 request의 수가 다시 103개 이하로 떨어지면 정상 상태로 회복되었음을 인식한다.

따라서 request 할당 시 이 threshold 값을 보고 적절히 디스크 상태를 설정하여 상위 계층에서 I/O 요청을 생성하는 속도를 조절할 수 있도록 하고 있다.

만약 병목 현상이 일어나고 있는 상황에서도 계속 I/O 요청이 발생하여 결국 할당된 request의 수가 최대값에 다다르면 디스크 (request_queue)가 가득찼음을 나타내는 플래그를 설정하여 더 이상 request를 생성하지 못하도록 하되, 단 현재 태스크는 batcher task로 설정하여 얼마간의 (함께 요청된) request를 더 생성할 수 있도록 배려하고 있다. 또한 request 할당 시 메모리 부족으로 인해 잠시 sleep되었던 경우에도 해당 태스크를 batcher task로 설정한다.

이렇게 request를 할당받고 난 후에는 per-task plugging을 이용하는 경우라면 해당 request를 plugged list에 연결하고 그렇지 않은 경우라면 I/O 스케줄러에 전달한 뒤 바로 디스크 드라이버에게 I/O를 요청한다.

지금까지 상위 (filesystem) 계층에서 요청된 I/O 연산이 bio를 거쳐 request로 만들어지는 과정을 살펴보았다. 이제 이렇게 생성된 request가 I/O 스케줄러 단에서 처리되는 방식을 알아볼 것이다.

앞서 살펴보았듯이 생성된 request는 대부분 (per-task) plugging 기능이 적용된 상태일 것이므로 (직접적인 read/write의 경우는 물론 read-ahead, writeback의 경우도 이에 해당한다) I/O 스케줄러에게 전달되기에 앞서 plugged list에 잠시 보관된다.

plugging 기능을 사용하려면 해당 함수의 스택에 blk_plug 구조체를 할당하고 먼저 blk_start_plug() 함수를 호출한 후에 I/O 연산을 발생시키고 마지막으로 blk_finish_plug() 함수를 호출하면 된다.

blk_start_plug() 함수는 주어진 blk_plug 구조체를 적절히 초기화한 후에 현재 태스크의 plug 필드에 저장하는데, 만약 blk_start_plug() 함수가 중첩된 실행 경로 상에서 여러 번 호출되었다면 제일 첫 번째로 호출된 경우에만 plug 필드를 저장한다. 이는 plugging 로직이 가장 상위 수준에서 처리될 수 있도록 보장해 준다.

blk_finish_plug() 함수는 태스크의 plug 필드와 인자로 주어진 blk_plug 구조체가 일치하는 경우에만 동작하며, 대응하는 start 함수와 현재 finish 함수 사이에서 발생한 I/O 연산 (request)들을 모두
I/O 스케줄러에게 전달하고 (insert) 실제로 드라이버가 I/O를 실행하도록 한다. request를 I/O 스케줄러에게 전달하는 방식은 request의 종류 및 상황에 따라 몇 가지 정책이 적용된다.

만약 plugged list에 request가 존재하는 상황에서 어떠한 이유로 인해 현재 태스크가 더 이상 실행되지 못하고 (자발적으로!) sleep 해야한다면 kblockd 스레드가 대신 plugged list를 넘겨받아 I/O 스케줄러에게 전달한 뒤에 I/O 연산을 실행한다.

plugged list 내의 request들이 I/O 스케줄러에게 전달되는 순간 다시 한번 merge가 가능한지 검사하게 되는데 이는 여러 태스크들이 동시에 디스크 상의 비슷한 위치에 접근하는 경우 각각의 태스크들은 자신의 plugged list에 포함되어 다른 태스크들은 접근하지 못하던 request들이 이제 공유되므로
새로이 merge될 가능성이 있기 때문이다. 이러한 정책은 ELEVATOR_INSERT_SORT_MERGE로 나타내며, plugging 기법을 이용하지 않을 시에는 이러한 merge 시도를 할 필요가 없으므로ELEVATOR_INSERT_SORT 정책이 사용된다.

I/O 스케줄러는 주어진 request들을 디스크 상의 위치에 따라 배열하여 seek time을 최소화하기 위해 노력하는데, 이 때 기본적으로 디스크의 헤드가 한 쪽 방향으로만 일정하게 움직이도록 하므로 이를 엘리베이터 (elevator)라고도 부른다. (물론 세부 동작은 각 I/O 스케줄러마다 다르다)

이를 위해서는 I/O 스케줄러 내부에 request들을 (잘 정렬하여) 보관할 자료구조가 필요한데 여기서는 rb tree (red-black tree)가 사용되며, 앞서 살펴보았듯이 (merge를 위해) 정렬된 rb tree 내의 특정 request를 빨리 찾아내기 위해 별도의 해시 테이블을 가지고 있다. 이렇게 rb tree 내에 보관된 request들은 REQ_SORTED라는 플래그를 추가하여 표시한다.

하지만 FLUSH/FUA request에 대해서는 약간 다른 ELEVATOR_INSERT_FLUSH 정책을 취하게 되는데
이러한 request들은 해당 디스크의 특성에 따라 다르게 처리될 수 있으며 또한 일반적인 merge를 지원하는 대신 중첩된 flush 요청을 한꺼번에 처리하는 기법을 사용하기 때문이다.

앞서 살펴보았듯이 FLUSH는 디스크 내부의 write-back 캐시의 내용을 실제 디스크에 저장하라는 의미이며 FUA는 write-back 캐시가 없는 것처럼 현재 데이터를 디스크에 직접 기록하라는 의미이다.
따라서 디스크가 내부 캐시를 가지지 않는 경우라면 FLUSH/FUA는 아무런 의미가 없다. 또한 캐시를 가진 디스크라고 하더라도 FUA 지원 여부는 선택적이므로 지원하지 않는 디스크의 경우 FUA request가 들어오면 이를 다시 FLUSH로 변경하여 처리하게 된다.

특히 FUA request의 경우 write할 데이터와 함께 요청되므로 최악(?)의 경우 하나의 (FLUSH & FUA) request는 다음과 같이 세 단계로 나누어 처리되어야 한다.

 (pre) FLUSH + WRITE + (post) FLUSH

따라서 FLUSH/FUA request는 REQ_FLUSH_SEQ 플래그를 추가하여 이러한 과정을 거치고 있음을 나타내며 이에 대한 추가적인 정보를 request 구조체 내의 flush (구조체) 필드에 저장하고 있다. 또한 이러한 request를 여러 태스크가 동시에 요청하는 경우 FLUSH 연산이 여러 차례 실행될 수 있으나
그 사이 데이터가 write 되지 않았다면 실질적으로 의미가 없으므로 (캐시 내의 모든 데이터가 이미 저장되었다) 이러한 중첩된 FLUSH 연산을 한 번만 수행해도 동일한 효과를 얻을 수 있게 될 것이다.

따라서 이러한 FLUSH/FUA request를 효율적으로 처리하기 위해 별도의 queue를 유지하며 총 2개의 리스트를 통해 하나의 FLUSH 요청이 실행되는 동안 발생된 FLUSH request들은 다른 리스트에 대기시키는 double buffering 방식을 이용하여 중첩된 request들을 한꺼번에 완료시키게 된다. 이렇게 I/O 스케줄러에게 전달된 request는 최종적으로 dispatch queue로 전달된다. 이렇게 전달된 request는 더 이상 merge될 수 없으므로 해시 테이블에서 제거되며 dispatch queue 내에서 디스크 섹터 번호를 기준으로 정렬된다 (단, 이미 처리 중이거나 REQ_SOFTBARRIER 플래그가 설정된 request들은 더 이상 정렬할 수 없으므로 그 이후의 request들만을 고려하게 된다).

dispatch queue 내의 request들은 순서대로 드라이버에 의해 처리되며 이렇게 request의 처리를 실제로 시작하는 것을 dispatch 혹은 issue라고 부른다. dispatch된 request들은 REQ_STARTED 플래그를 추가로 설정하며 queue에서 제거되며 디스크 오류로 인해 request가 오랫동안 완료되지 못하는 경우를 방지하기 위해 타이머를 설정한다. dispatch queue가 비게되면 드라이버는 I/O 스케줄러에게 새로운 request를 queue에 추가하도록 요청한다. request가 더이상 존재하지 않거나 I/O 스케줄러가 dispatch queue로 전달하지 않으면 처리는 종료된다.

지금껏 블록 장치 I/O 연산이 전달되는 과정을 간략히 살펴보았는데 리눅스 커널의 블록 서브시스템 관리자이기도 한 Jens Axboe님이 만든 blktrace 도구를 이용하면 현재 시스템 내의 디스크 장치의 I/O 과정을 한 눈에 알아볼 수 있는 방법을 제공한다.

만일 기본적인 출력 내용을 터미널 상에서 확인하고 싶다면 단순히 btrace라는 스크립트를 이용할 수 있다. 그 외의 자세한 옵션은 blktrace 및 blkparse의 man 페이지를 참조하기 바란다. 아래는 내 시스템에서의 출력 내용 중 일부이다.

# btrace /dev/sda
  ...
  8,0    0       60    10.168088873   178  A  WS 353925552 + 8 <- (8,5) 46516656
  8,0    0       61    10.168089576   178  Q  WS 353925552 + 8 [jbd2/sda5-8]
  8,0    0       62    10.168097323   178  G  WS 353925552 + 8 [jbd2/sda5-8]
  8,0    0       63    10.168098432   178  P   N [jbd2/sda5-8]
  8,0    0       64    10.168100785   178  A  WS 353925560 + 8 <- (8,5) 46516664
  8,0    0       65    10.168101033   178  Q  WS 353925560 + 8 [jbd2/sda5-8]
  8,0    0       66    10.168102298   178  M  WS 353925560 + 8 [jbd2/sda5-8]
  8,0    0       67    10.168104627   178  A  WS 353925568 + 8 <- (8,5) 46516672
  8,0    0       68    10.168104843   178  Q  WS 353925568 + 8 [jbd2/sda5-8]
  8,0    0       69    10.168105513   178  M  WS 353925568 + 8 [jbd2/sda5-8]
  8,0    0       70    10.168106517   178  A  WS 353925576 + 8 <- (8,5) 46516680
  8,0    0       71    10.168106744   178  Q  WS 353925576 + 8 [jbd2/sda5-8]
  8,0    0       72    10.168107411   178  M  WS 353925576 + 8 [jbd2/sda5-8]
  8,0    0       73    10.168109205   178  A  WS 353925584 + 8 <- (8,5) 46516688
  8,0    0       74    10.168109435   178  Q  WS 353925584 + 8 [jbd2/sda5-8]
  8,0    0       75    10.168110081   178  M  WS 353925584 + 8 [jbd2/sda5-8]
  8,0    0       76    10.168111110   178  A  WS 353925592 + 8 <- (8,5) 46516696
  8,0    0       77    10.168111328   178  Q  WS 353925592 + 8 [jbd2/sda5-8]
  8,0    0       78    10.168111953   178  M  WS 353925592 + 8 [jbd2/sda5-8]
  8,0    0       79    10.168112970   178  A  WS 353925600 + 8 <- (8,5) 46516704
  8,0    0       80    10.168113266   178  Q  WS 353925600 + 8 [jbd2/sda5-8]
  8,0    0       81    10.168113923   178  M  WS 353925600 + 8 [jbd2/sda5-8]
  8,0    0       82    10.168115804   178  A  WS 353925608 + 8 <- (8,5) 46516712
  8,0    0       83    10.168116019   178  Q  WS 353925608 + 8 [jbd2/sda5-8]
  8,0    0       84    10.168116656   178  M  WS 353925608 + 8 [jbd2/sda5-8]
  8,0    0       85    10.168118495   178  A  WS 353925616 + 8 <- (8,5) 46516720
  8,0    0       86    10.168118722   178  Q  WS 353925616 + 8 [jbd2/sda5-8]
  8,0    0       87    10.168119371   178  M  WS 353925616 + 8 [jbd2/sda5-8]
  8,0    0       88    10.168121449   178  A  WS 353925624 + 8 <- (8,5) 46516728
  8,0    0       89    10.168121665   178  Q  WS 353925624 + 8 [jbd2/sda5-8]
  8,0    0       90    10.168122304   178  M  WS 353925624 + 8 [jbd2/sda5-8]
  8,0    0       91    10.168123327   178  A  WS 353925632 + 8 <- (8,5) 46516736
  8,0    0       92    10.168123554   178  Q  WS 353925632 + 8 [jbd2/sda5-8]
  8,0    0       93    10.168124212   178  M  WS 353925632 + 8 [jbd2/sda5-8]
  8,0    0       94    10.168125241   178  A  WS 353925640 + 8 <- (8,5) 46516744
  8,0    0       95    10.168125462   178  Q  WS 353925640 + 8 [jbd2/sda5-8]
  8,0    0       96    10.168126087   178  M  WS 353925640 + 8 [jbd2/sda5-8]
  8,0    0       97    10.168128954   178  I  WS 353925552 + 96 [jbd2/sda5-8]
  8,0    0        0    10.168131125     0  m   N cfq178 insert_request
  8,0    0        0    10.168131926     0  m   N cfq178 add_to_rr
  8,0    0       98    10.168133660   178  U   N [jbd2/sda5-8] 1
  8,0    0        0    10.168135051     0  m   N cfq workload slice:100
  8,0    0        0    10.168136148     0  m   N cfq178 set_active wl_prio:0 wl_type:1
  8,0    0        0    10.168136908     0  m   N cfq178 Not idling. st->count:1
  8,0    0        0    10.168138014     0  m   N cfq178 fifo=          (null)
  8,0    0        0    10.168138615     0  m   N cfq178 dispatch_insert
  8,0    0        0    10.168139739     0  m   N cfq178 dispatched a request
  8,0    0        0    10.168140355     0  m   N cfq178 activate rq, drv=1
  8,0    0       99    10.168140588   178  D  WS 353925552 + 96 [jbd2/sda5-8]
  8,0    0      100    10.168534375     0  C  WS 353925552 + 96 [0]
  8,0    0        0    10.168554570     0  m   N cfq178 complete rqnoidle 1
  8,0    0        0    10.168555455     0  m   N cfq178 set_slice=120
  8,0    0        0    10.168556271     0  m   N cfq178 Not idling. st->count:1
  8,0    0        0    10.168556774     0  m   N cfq schedule dispatch
  ...

여기서 주의깊게 봐야할 부분은 알파벳 약자로 이루어진 6번째와 7번째 열 부분이다. 6번째 열이 나타내는 것은 해당 request가 처리되는 과정을 나타내며 (아래에서 설명) 7번째 열이 나타내는 것은 request의 종류로 여기서 WS는 sync write, N은 none에 해당한다. 6번째 열을 자세히 살펴보면 약간의 규칙성을 발견할 수 있는데 (첫번째 request는 제외) 먼저 A는 remap의 약자로 (8,5) 즉 /dev/sda5 파티션에 대한 I/O가 /dev/sda 디스크 전체에 대한 위치로 변환된 것을 뜻한다. 다음은 Q인데 이것은 queue의 약자로 make_request_fn이 실행되어 bio의 처리가 시작되었음을 뜻한다. 다음은 G인데 이것은 get (request)의 약자로 request 구조체가 하나 할당되었음을 뜻한다. 다음은 P인데 이것은 plug의 약자로 request가 plugged list에 포함되었음을 뜻한다.

이후의 요청들은 모두 A -> Q -> M의 과정을 거치는데, A와 Q는 위와 동일하고 M은 merge의 약자로 요청된 bio가 (앞선) request와 통합되었음을 뜻하는 것이며 8번째 열은 해당 bio의 시작 섹터 번호 및 크기임을 고려하면 연속된 요청이란 것을 쉽게 알 수 있다. 그 아래쪽에 I가 보이는데 이것은 insert의 약자로 앞서 생성(되고 merge)된 request가 I/O 스케줄러에게 전달되었음을 뜻한다. 그 바로 아래는 실제 request가 아닌 message를 의미하는 m이 있으며 (이는 CFQ 스케줄러에서 출력한 메시지이다) 지금은 무시하고 넘어가도 좋다. 다음은 U인데 이것은 unplug의 약자로 plugged list 내의 request들을 I/O 스케줄러에게 모두 전달했음을 뜻한다. 다음은 D인데 이것은 dispatch의 약자로 드라이버에게 I/O 연산의 실행을 시작하라고 요청하였음을 뜻한다. 다음은 C인데 이것은 complete의 약자로 dispatch된 request의 처리가 완료되었음을 뜻하는 것이다.

위의 경우 8섹터 (= 4KB) 크기의 bio 12개가 순서대로 요청되어 96섹터 (= 48KB) 크기의 한 request로
merge된 후 한 번에 처리되는 것을 볼 수 있었다.

지금까지 살펴본 과정을 그림으로 나타내면 다음과 같다.

저작자표시 비영리

'Linux Kernel' 카테고리의 다른 글

Radix Tree (0)	2016.08.12
Linear VS Physical Address (0)	2016.08.10
What is wmb() in linux driver (0)	2016.08.05
What is the return address of kmalloc() ? Physical or Virtual? (0)	2016.07.29
Memory Mapping (0)	2016.07.28

What is wmb() in linux driver

Linux Kernel2016. 8. 5. 10:03

뷰어
댓글로
이전글
다음글

http://stackoverflow.com/questions/30236620/what-is-wmb-in-linux-driver

What is wmb() in linux driver

WMB macro: Write Memory Barrior

Let's assume a serial port, where you have to write bytes to a certain address. The serial chip will then send these bytes over the wires. It is then important that you do not mess up the writes - they have to stay in order or everything is garbled.

But the following is not enough:

   *serial = 'h';
   *serial = 'e';
   *serial = 'l';
   *serial = 'l';
   *serial = 'o';

Because the compiler, the processor, the memory subsystems and the buses in between are allowed to reorder your stores as optimization.

so you'll have to add code that will ensure the stores do not get tangled up. That's what, amongst others, the wmb() macro does: prevent reordering of the stores.

Note that just making the serial pointer volatile is not enough: while it ensures the compiler will not reorder, the other mechanisms mentioned can still wreak havoc. This is documented in another kernel doc piece.

저작자표시 비영리

'Linux Kernel' 카테고리의 다른 글

Linear VS Physical Address (0)	2016.08.10
Block I/O Operation (0)	2016.08.06
What is the return address of kmalloc() ? Physical or Virtual? (0)	2016.07.29
Memory Mapping (0)	2016.07.28
고정 크기 ramdisk 만들기 및 swap 영역 사용 (0)	2016.03.22

What is the return address of kmalloc() ? Physical or Virtual?

Linux Kernel2016. 7. 29. 03:16

뷰어
댓글로
이전글
다음글

Generally, at the assembly - every time you save or load from memory - either direct or through registers, it is also virtual. This is because every addresses that comes out of the CPU address bus, is in virtual address:

or this (From wikipedia):

The virtual address will then be translated into physical address, which the CPU cannot see or even know its value.

vmalloc() is physically non-contiguous and kmalloc() is physically contiguous - so how does the CPU know it is contiguous or not? This is because of linear mapping, or also called one-to-one mapping (or direct mapping):

Sometimes DMA devices require the physical addresses - and so through these "mapping", since we know the virtual address, the physical can thus be derived, but remember that the CPU cannot use it, ie,

load (eax), ebx: where eax contained the physical address, is wrong.

Source from: https://www.quora.com/What-is-the-return-address-of-kmalloc-Physical-or-Virtual

저작자표시 비영리

'Linux Kernel' 카테고리의 다른 글

Block I/O Operation (0)	2016.08.06
What is wmb() in linux driver (0)	2016.08.05
Memory Mapping (0)	2016.07.28
고정 크기 ramdisk 만들기 및 swap 영역 사용 (0)	2016.03.22
sysinfo 관련 (0)	2016.02.26

Memory Mapping

Linux Kernel2016. 7. 28. 09:04

뷰어
댓글로
이전글
다음글

Memory Mapping

(원소스: 리눅스 디바이스드라이버, 유영창 저)

MMU에는 MMU 테이블을 유지하기위한 별도의 관리 메모리가 없다. MMU는 보통 프로세서에 내장되고, 시스템 메모리를 같이 사용한다. 그래서 프로세서가 처음 부팅되면 리눅스는 시스템 메모리의 일부를 MMU 테이블에 할당하고, 관리할 정보를 MMU 테이블에 기록한다. 이 과정에서는 MMU가 동작하지 않는다. 리눅스 커널은 MMU 테이블에 관련된 정보를 메모리에 모두 기록한 후에 MMU 테이블에 해당하는 메모리 위치를 MMU에 알려주고 MMU를 동작시킨다.

메모리맵I/O 방식:

char *videoptr;

videoptr = (char *) 0x000B0000;

*videoptr = 'A';

메모리 접근 명령으로 처리하는 것을 Memory Mapped I/O라 한다. 하지만, MMU가 활성화되어 동작하는 커널 모드에서는 위의 예처럼 비디오의 물리주소 0x000B0000에 접근하면 페이지폴트가 발생하여 처리되지 않는다. 따라서, 디바이스드라이버는 하드웨어를 제어하기위해 물리주소를 가상주소로 매핑 (MMU 가 사용되지 않는 메모리 영역의 operation에 대해서는 PA 가 사용될 수 있음, i.e. DMA)해야 한다. 메모리맵 I/O 방식의 물리적 주소공간을 커널에서 사용 가능한 주소로 매핑하거나 매핑된 영역을 해제할 때는 ioremap(), ioremap_nocache(), iounmap() 함수를 사용해야 한다.

kmalloc: VA physically contiguous
vmalloc: VA physically in-contiguous
kzalloc: VA physically contiguous, kalloc initialized zero - use it to expose mem to user space

응용 프로그램에서는 디바이스 파일로 하드웨어를 제어하기 위해 보통 read(), write(), ioctl() 함수를 사용한다. 이 함수들은 응용 프로그램에 하드웨어의 내부구조를 숨겨주는 효과가 있다. 그러나 이런 함수들은 프로세스 메모리공간과 커널메모리 공간사이의 메모리 전달 과정이 수반되기 때문에 매우 비효율적이다. 특히나 많은 용량의 데이터가 빠르게 전달되어야 하는 사운드나 비디오 장치에 메모리 복사가 수반되는 read(), write(), ioctl() 함수를 사용한다면 시스템 성능이 저하된다. (put_user(), copy_to_user(), get_user(), copy_from_user())

이렇게 비효율적인 방법을 극복하기위해 리눅스에서는 mmap() 함수를 제공하여 응용 프로그램에서 직접 하드웨어의 I/O 주소공간을 메모리 복사없이 직접적으로 사용할 수 있도록 한다. mmap() 함수는 원래 메모리 주소를 이용해 파일에 접근할 수 있도록 하는 함수다. 그러나 디바이스 파일에 적용할 경우에는 디바이스에서 제공하는 물리주소 (I/O 메모리 주소 또는 할당된 메모리 공간주소)를 응용 프로그램에서 사용할 수 있게한다. 응용 프로그램이 동작하는 프로세스의 메모리 영역에 디바이스 드라이버가 제공하는 물리주소를 매핑하면 된다.

동일한 물리주소를 가상주소로 매핑하는 방법에는 두가지가 있다. 하나는 mmap() 함수를 사용하여 응용프로그램의 프로세스 가상주소에 매핑하는 방법이고, 하나는 ioremap() 함수를 사용하여 커널의 가상주소에 매핑하는 방법이다.

nopage 매핑방식: 디바이스 드라이버에서 mmap을 처리하는 방식은 mmap() 함수에서remap_pfn_range() 함수를 이용하여 필요한 매핑을 직접처리하는 방법과 nopage 방식을 사용해 페이지 단위로 매핑하는 방법이 있다. 앞에서 설명한 방법이 remap_pfn_range() 함수를 이용해 매핑 대상이 되는 영역을 한꺼번에 매핑하는 방법이었다면, nopage 방법은 PAGE_SIZE 단위로 매핑을 처리한다. nopage 방식은 응용 프로그램에서 mmap() 함수를 호출하여 프로세스에서 사용할 수 있는 주소를 먼저 요구한다. nopage 방식은remap_pfn_range() 함수를 이용하는 방법처럼 응용 프로그램 mmap() 함수를 호출하면 디바이스 드라이버의 파일 오퍼레이션 구조체에 정의된 mmap() 함수가 호출된다. 그러나 핲서 설명한 mmap() 함수가 요청된 영역의 매핑을 remap_pfn_range() 함수를 이용해 처리하는 것과 달리 nopage 방식에서는 mmap() 함수가 remap_page_range() 함수를 수행하지 않는다. 그래서 커널이 해당 영역을 매핑하지 않기 때문에 응용 프로그램이 mmap()을 통해 주소에 접근하면 해당 메모리 주소를 유효하지 않은 영역으로 인식하여 페이지폴트가 발생한다. 커널은 페이지 폴트가 발생하면 디바이스 드라이버로 매핑하기 위해 해당 주소공간이 예약된 주소 영역인지를 확인하고, 예약된 영역이면 vma->vm_ops->nopage에 선언된 함수를 호출한다. nopage 방식은 물리적인 I/O 메모리공간을 응용 프로그램의 프로세스 공간에 사용하기 보다는 주로 디바이스 드라이버에 의해 할당된 메모리 공간을 공유하기 위해 사용한다. nopage 방식으로 mmap을 구현하려면 디바이스 드라이버는 가장 먼저 페이지 폴트가 발생할 때 호출된 nopage() 함수를 만들어야 한다.

Linear MMAP

1. 매핑할 파일에 대해 mmap file operation이 정의 되어 있는지 검사

2. 파일 객체의 get_unmapped_area method 호출하여 메모리 매핑에 적합한 linear address range 를 할당

3. 파일 시스템에서 vm_file field를 파일 객체의 주소로 초기화 하고 mmap method 호출 (generic_file_mmap)

4. Done, 기타 자잘한 검사들은 생략

매핑은 형성 되었지만, 그에 해당하는 PFN은 아직 할당 되지 않았기 때문에 demand paging을 해야한다. 이때 프로세스가 페이지 중 하나를 참조하여 page fault exception이 발생할 때까지 PFN할당을 늦춘다.커널은 폴트가 발생한 주소에 대응하는 페이지 테이블 엔트리를 검사하고, 엔트리가 없으면 do_no_page 함수를 호출한다.

do_no_page 함수는 페이지 프레임 할당과 페이지 테이블 갱신과 같이 모든 요구 페이징에 공통적인 연산을 수행한다. 또한 해당 memory region 에 nopage method를 정의하고 있는지 검사한다. nopage method 가 정의 되어 있을 경우,

1. nopage 메소드 호출하는데 요청한 페이지를 포함하는 PFN을 반환한다.

2.메모리 매핑이 private 이고, 프로세스가 페이지에 쓰기를 시도하면, 방금 읽은 페이지의 복사본을 만들고 이것을 inactive page list 에 삽입하여 'COW' fault가 발생하지 않도록 한다.

3. fault가 발생한 주소에 대응하는 페이지 테이블 엔트리를 PFN과 memory region의 vm_page_prot field에 포함된 페이지 접근 권한으로 설정한다.

4. ...

nopage Method

결국 매핑의 실제 동작의 키는 nopage method이며 nopage는 반드시 PFN을 반환해야 한다. nopage 는 요청한 페이지가 페이지 캐시에 있는지를 찾는다. 페이지를 찾지 못하면 method는 페이지를 디스크에서 읽어야 한다. 대부분의 파일 시스템은 filemap_nopage 함수를 사용하여 nopage method를 구현한다. 이 함수는 다음과 같은 새개의 매개 변수를 받는다.

(area,address,type)

..........................

나중에 채워 너을 거임.

결국 nopage method는 페이지 캐시에 해당 파일의 블록이 있는지를 검사해서 없다면 페이지 캐시에 새로운 프레임을 할당, 블록을 읽어 들인다. 이후 해당 page cache 의 PFN 를 리턴한다.

출처

http://linuxphil.blogspot.com/2011/12/blog-post.html

Understanding the Linux Kernel, 3rd Edition

저작자표시 비영리

'Linux Kernel' 카테고리의 다른 글

What is wmb() in linux driver (0)	2016.08.05
What is the return address of kmalloc() ? Physical or Virtual? (0)	2016.07.29
고정 크기 ramdisk 만들기 및 swap 영역 사용 (0)	2016.03.22
sysinfo 관련 (0)	2016.02.26
ubuntu 12.04 kernel compile (0)	2016.02.25

고정 크기 ramdisk 만들기 및 swap 영역 사용

Linux Kernel2016. 3. 22. 09:34

뷰어
댓글로
이전글
다음글

1. ramdisk 크기 키우기 24GB

go /etc/default/grub

GRUB_CMDLINE_LINUX_DEFAULT="memmap=224G\\\$42G ramdisk_size=25165824"

2. vmalloc은 실제로 메모리를 할당 하지 않기 때문에 고정 크기를 만들기 위해선 모든 page 에 대해 page fault 를 먼저 발생 시켜야 함.

dd if=/dev/zero of=/dev/ram1 bs=4K count=6291456

3. swap 만들기

mkswap /dev/ram1

4. swap 실행

swapon /dev/ram1

저작자표시 비영리

'Linux Kernel' 카테고리의 다른 글

What is the return address of kmalloc() ? Physical or Virtual? (0)	2016.07.29
Memory Mapping (0)	2016.07.28
sysinfo 관련 (0)	2016.02.26
ubuntu 12.04 kernel compile (0)	2016.02.25
강제 umount 방법 (0)	2015.10.22

sysinfo 관련

Linux Kernel2016. 2. 26. 08:27

뷰어
댓글로
이전글
다음글

sysinfo의 필드 중 명시적으로 user space memory중 사용가능한 공간을 알려주는

field는 없는데 어떤 값이 0으로 출력되었는지 제가 잘 이해를 못하겠네요..-_-;;
죄송합니다.

freeram 필드는 시스템의 전체 메모리 중 buddy allocator에 들어있는, 즉
그 누구에게도(kernel에게도, User에게도) 할당되어 있지 않는 메모리의 크기입니다.

freehigh 필드는 대략적으로 HIGHMEM zone에 있는 free 페이지의 수를 의미합니다.
HIGHMEM zone은 일반적으로 user memory 할당을 위해 사용하지만 커널 또한 사용할 수
있습니다.

반대로 NORMAL zone은 일반적으로 kernel이 사용하도록 노력하지만 HIGHMEM의 fall back zone
으로 구성되어 HIGHMEM zone의 메모리가 모자라게 될 경우, 응용 프로그램에 의해서도
사용가능합니다.

저작자표시 비영리

'Linux Kernel' 카테고리의 다른 글

Memory Mapping (0)	2016.07.28
고정 크기 ramdisk 만들기 및 swap 영역 사용 (0)	2016.03.22
ubuntu 12.04 kernel compile (0)	2016.02.25
강제 umount 방법 (0)	2015.10.22
[ubuntu 12.04] grub 메모리 크기 변경 (0)	2015.10.22

ubuntu 12.04 kernel compile

Linux Kernel2016. 2. 25. 11:01

뷰어
댓글로
이전글
다음글

Source: http://mitchtech.net/compile-linux-kernel-on-ubuntu-12-04-lts-detailed/

Details...

Compile Linux Kernel on Ubuntu 12.04 LTS (Detailed)

POSTED BY MICHAEL ON MAY 19, 2012 IN TUTORIALS, UBUNTU | 29 COMMENTS

This tutorial will outline the process to compile your own kernel for Ubuntu. It will demonstrate both the traditional process using ‘make’ and ‘make install’ as well as the Debian method, using ‘make-dpkg’. This is the detailed version of this tutorial, see Compile Linux Kernel on Ubuntu 12.04 LTS for the quick overview. In any case, we begin by installing some dependencies:

sudo apt-get install git-core libncurses5 libncurses5-dev libelf-dev asciidoc binutils-dev linux-source qt3-dev-tools libqt3-mt-dev libncurses5 libncurses5-dev fakeroot build-essential crash kexec-tools makedumpfile kernel-wedge kernel-package

Note: qt3-dev-tools and libqt3-mt-dev is necessary if you plan to use ‘make xconfig’ and libncurses5 and libncurses5-dev if you plan to use ‘make menuconfig’. Next, copy the kernel sources with wget:

wget http://www.kernel.org/pub/linux/kernel/v3.0/linux-3.2.17.tar.bz2

Extract the archive and change into the kernel directory:

tar -xjvf linux-3.2.17.tar.bz2 cd linux-3.2.17/

Now you are in the top directory of a kernel source tree. The kernel comes in a default configuration, determined by the people who put together the kernel source code distribution. It will include support for nearly everything, since it is intended for general use, and is huge. In this form it will take a very long time to compile and a long time to load. So, before building the kernel, you must configure it. If you wish to re-use the configuration of your currently-running kernel, start by copying the current config contained in /boot:

cp -vi /boot/config-`uname -r` .config

Parse the .config file using make with the oldconfig flag. If there are new options available in the downloaded kernel tree, you may be prompted to make a selection to include them or not. If unsure, press enter to accept the defaults.

make oldconfig

Since the 2.6.32 kernel, a new feature allows you to update the configuration to only compile modules that are actually used in your system. As above, make selections if prompted, otherwise hit enter for the defaults.

make localmodconfig

The next step is to configure the kernel to your needs. You can configure the build with ncurses using the ‘menuconfig’ flag:

make menuconfig

or, using a GUI with the ‘xconfig’ flag:

make xconfig

In either case, you will be presented with a series of menus, from which you will choose the options you want to include. For most options you have three choices: (blank) leave it out; (M) compile it as a module, which will only be loaded if the feature is needed; (*) compile it into monolithically into the kernel, so it will always be there from the time the kernel first loads.

There are several things you might want to accomplish with your reconfiguration:

Reduce the size of the kernel, by leaving out unnecessary components. This is helpful for kernel development. A small kernel will take a lot less time to compile and less time to load. It will also leave more memory for you to use, resulting in less page swapping and faster compilations.
Retain the modules necessary to use the hardware installed on your system. To do this without including just about everything conceivable, you need figure out what hardware is installed on your system. You can find out about that in several ways.

Before you go too far, use the “General Setup” menu and the “Local version” and “Automatically append version info” options to add a suffix to the name of your kernel, so that you can distinguish it from the “vanilla” one. You may want to vary the local version string, for different configurations that you try, to distinguish them also.

Assuming you have a running Linux system with a working kernel, there are several places you can look for information about what devices you have, and what drivers are running.

Look at the system log file, /var/log/messages or use the command dmesg to see the messages printed out by the device drivers as they came up.
Use the command lspci -vv to list out the hardware devices that use the PCI bus.
Use the command lsub -vv to list out the hardware devices that use the USB.
Use the command lsmod to see which kernel modules are in use.
Look at /proc/modules to see another view of the modules that are in use.
Look at /proc/devices to see devices the system has recognized.
Look at /proc/cpuinfo to see what kind of CPU you have.
Open up the computer’s case and read the labels on the components.
Check the hardware documentation for your system. If you know the motherboard, you should be able to look up the manual, which will tell you about the on-board devices.

Using the available information and common sense, select a reasonable set of kernel configuration options. Along the way, read through the on-line help descriptions (for at least all the top-level menu options) so that you become familiar with the range of drivers and software components in the Linux kernel.

Before exiting the final menu level and saving the configuration, it is a good idea to save it to a named file, using the “Save Configuration to an Alternate File” option. By saving different configurations under different names you can reload a configuration without going through all the menu options again. Alternatively, you can backup the file (which is named .config manually, by making a copy with an appropriate name.

One way to reduce frustration in the kernel trimming process (which involves quite a bit of guesswork, trial, and error) is to start with a kernel that works, trim just a little at a time, and test at each stage, saving copies of the .config file along the way so that you can back up when you run into trouble. However, the first few steps of this process will take a long time since you will be compiling a kernel with huge number of modules, nearly all of which you do not need. So, you may be tempted to try eliminating a large number of options from the start

Now we are ready to start the build. You can speed up the compilation process by enabling parallel make with the -j flag. The recommended use is ‘processor cores + 1’, e.g. 5 if you have a quad core processor:

make -j5

This will compile the kernel and create a compressed binary image of the kernel. After the first step, the kernel image can be found at arch/i386/boot/bzImage (for a x86 based processor). Once the initial compilation has completed, install the dynamically loadable kernel modules:

sudo make modules_install

The modules are installed in a subdirectory of “/lib/modules”, named after the kernel version. The resulting modules have the suffix “.ko”. For example, if you chose to compile the network device driver for the Realtek 8139 card as a module, there will be a kernel module name 8139too.ko. The third command is OS specific and will copy the new kernel into the directory “/boot” and update the Grub bootstrap loader configuration file “/boot/grub/grub.cfg” to include a line for the new kernel.

Finally, install the kernel:

sudo make install

This command performs many operations behind the scenes. Examine the /etc/grub.d/ directory structure before and after you run the above commands to see the changes. Also look in the /boot/grub/grub.cfg file for your kernel entry.

The OS specific make install, Ubuntu in this case, also creates an initrd image in the /boot directory. If you compiled the needed drives into the kernel then you will not need this ramdisk file to aid in booting. For extra credit remove the created initrd from the /boot/ directory as well as the references in /etc/grub.d/*.

If there are error messages from any of the make stages, you may be able to solve them by going back and playing with the configuration options. some options require other options or cannot be used in conjunction with some other options. These dependencies and conflicts may not all be accounted-for in the configuration script. If you run into this sort of problem, you are reduced to guesswork based on the compilation or linkage error messages. For example, if the linker complains about a missing definition of some symbol in some module, you might either turn on an option that seems likely to provide a definition for the missing symbol, or turn off the option that made reference to the symbol.

Reboot the system, selecting your new kernel from the boot loader menu. Watch the messages. See if it works. If it does not, reboot with the old kernel, try to fix what went wrong, and repeat until you have a working new kernel

Debian Method:

Instead of the compilation process of above, you can alternatively compile the kernel as installable .deb packages. This improves the portability of the kernel, since installation on a different machine is as simple as installing the packages. Rather than using ‘make’ and ‘make install’, we use ‘make-kpkg’:

fakeroot make-kpkg – -initrd – -append-to-version=-some-string-here kernel-image kernel-headers

Unlike above, you cannot enable parallel compilation with make-kpkg using the -j flag. Instead, define the CONCURRENCY_LEVEL environment variable.

export CONCURRENCY_LEVEL=3

Once the compilation has completed, you can install the kernel and kernel headers using dpkg:

sudo dpkg -i linux-image-3.2.14-mm_3.2.14-mm-10.00.Custom_amd64.deb
sudo dpkg -i linux-headers-3.2.14-mm_3.2.14-mm-10.00.Custom_amd64.deb

저작자표시 비영리

'Linux Kernel' 카테고리의 다른 글

고정 크기 ramdisk 만들기 및 swap 영역 사용 (0)	2016.03.22
sysinfo 관련 (0)	2016.02.26
강제 umount 방법 (0)	2015.10.22
[ubuntu 12.04] grub 메모리 크기 변경 (0)	2015.10.22
LXC 관련 자료 (0)	2015.10.17

강제 umount 방법

Linux Kernel2015. 10. 22. 08:39

뷰어
댓글로
이전글
다음글

강제 언마운트

fuser -ck mountdir

해당 마운트 포인트를 사용하는 user 찾기

fuser -cu mountdir

'Linux Kernel' 카테고리의 다른 글

sysinfo 관련 (0)	2016.02.26
ubuntu 12.04 kernel compile (0)	2016.02.25
[ubuntu 12.04] grub 메모리 크기 변경 (0)	2015.10.22
LXC 관련 자료 (0)	2015.10.17
ZEST [thezest] 사용법 (0)	2015.10.16

[ubuntu 12.04] grub 메모리 크기 변경

Linux Kernel2015. 10. 22. 04:29

뷰어
댓글로
이전글
다음글

ubuntu 12.04 LTS 기준

1. open /etc/default/grub

2. 다음과 같이 수정

GRUB_DEFAULT=2

GRUB_HIDDEN_TIMEOUT=0

GRUB_HIDDEN_TIMEOUT_QUIET=true

GRUB_TIMEOUT=10

GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`

GRUB_CMDLINE_LINUX_DEFAULT="memmap=224G\\\$34G"

GRUB_CMDLINE_LINUX=""

224GB --> 34GB 로 줄임.

3. sudo update-grub

완료

'Linux Kernel' 카테고리의 다른 글

ubuntu 12.04 kernel compile (0)	2016.02.25
강제 umount 방법 (0)	2015.10.22
LXC 관련 자료 (0)	2015.10.17
ZEST [thezest] 사용법 (0)	2015.10.16
qemu 설치및 사용 (0)	2015.10.16

LXC 관련 자료

Linux Kernel2015. 10. 17. 06:41

뷰어
댓글로
이전글
다음글

internals

http://www.slideshare.net/BodenRussell/realizing-linux-containerslxc?related=1

performance evaluation

http://www.slideshare.net/BodenRussell/kvm-and-docker-lxc-benchmarking-with-openstack

basics

http://www.slideshare.net/Flux7Labs/performance-of-docker-vs-vms?related=1

security

http://www.slideshare.net/jpetazzo/linux-containers-lxc-docker-and-security?related=2

'Linux Kernel' 카테고리의 다른 글

강제 umount 방법 (0)	2015.10.22
[ubuntu 12.04] grub 메모리 크기 변경 (0)	2015.10.22
ZEST [thezest] 사용법 (0)	2015.10.16
qemu 설치및 사용 (0)	2015.10.16
permanent 환경변수 설정 on ubuntu 12.04 (0)	2015.10.14

ZEST [thezest] 사용법

Linux Kernel2015. 10. 16. 04:16

뷰어
댓글로
이전글
다음글

1. zest 2.0 download

https://code.google.com/p/thezest/downloads/list

2. 압축풀기 및 설치

3. 실행

--> zest2.0/scripts

sudo ./capture.py # temp 폴더에 전체 메모리 덤프 이미지 파일 생성

./analyze.py # 분석

'Linux Kernel' 카테고리의 다른 글

[ubuntu 12.04] grub 메모리 크기 변경 (0)	2015.10.22
LXC 관련 자료 (0)	2015.10.17
qemu 설치및 사용 (0)	2015.10.16
permanent 환경변수 설정 on ubuntu 12.04 (0)	2015.10.14
etags (0)	2015.07.21

qemu 설치및 사용

Linux Kernel2015. 10. 16. 02:15

뷰어
댓글로
이전글
다음글

소스다운 --> ./configure --> sudo make --> sudo make install

만일 yak, bison, flex 등이 없다면 DO ! apt-get

1. 이미지 만들기

qemu-img create -f vdi userver.img 4G

2. 우분투 설치

qemu-system-i386 -cdrom ubuntu-12.04.5-server-i386.iso -k en-us userver.img -boot d

3. 우분투 실행

qemu-system-i386 -k en-us userver.img

4. FYI

만일 설치중 Loading apt-cdrom-setup failed for unknown reasons 일경우,

설치 메모리가 부족해서 생기는 에러일 가능성이 높음. reboot 후

(2)에서 -m 1024 옵션을 붙여서 사용할것

'Linux Kernel' 카테고리의 다른 글

LXC 관련 자료 (0)	2015.10.17
ZEST [thezest] 사용법 (0)	2015.10.16
permanent 환경변수 설정 on ubuntu 12.04 (0)	2015.10.14
etags (0)	2015.07.21
ldconfig deferred processing now taking place? (0)	2015.07.18

permanent 환경변수 설정 on ubuntu 12.04

Linux Kernel2015. 10. 14. 07:30

뷰어
댓글로
이전글
다음글

export 확인

환경 변수가 제대로 설정이 되어 있지 않을 경우,

1. /etc/environment 열기

2. 환경 변수 추가

3. source environment

완료

'Linux Kernel' 카테고리의 다른 글

LXC 관련 자료 (0)	2015.10.17
ZEST [thezest] 사용법 (0)	2015.10.16
qemu 설치및 사용 (0)	2015.10.16
etags (0)	2015.07.21
ldconfig deferred processing now taking place? (0)	2015.07.18

etags

Linux Kernel2015. 7. 21. 08:53

뷰어
댓글로
이전글
다음글

etags

find `pwd` -name "*.cc" -o -name "*.[cChH]" -o -name "*.sh" -o -name "*.cpp" -o -name "*.md" -print | xargs etags -a

'Linux Kernel' 카테고리의 다른 글

LXC 관련 자료 (0)	2015.10.17
ZEST [thezest] 사용법 (0)	2015.10.16
qemu 설치및 사용 (0)	2015.10.16
permanent 환경변수 설정 on ubuntu 12.04 (0)	2015.10.14
ldconfig deferred processing now taking place? (0)	2015.07.18

ldconfig deferred processing now taking place?

Linux Kernel2015. 7. 18. 09:37

뷰어
댓글로
이전글
다음글

This is normal and to do with triggers, if a package requires ldconfig to be run after installing a library then the trigger cause the command to be run only once at the end of installation rather than after every library is installed. ldconfig creates the necessary links and cache to shared libraries.

'Linux Kernel' 카테고리의 다른 글

LXC 관련 자료 (0)	2015.10.17
ZEST [thezest] 사용법 (0)	2015.10.16
qemu 설치및 사용 (0)	2015.10.16
permanent 환경변수 설정 on ubuntu 12.04 (0)	2015.10.14
etags (0)	2015.07.21

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

'Linux Kernel' 카테고리의 다른 글

'Linux Kernel' 카테고리의 다른 글

C언어 MMIO에서 적용[편집]

'Linux Kernel' 카테고리의 다른 글

'Linux Kernel' 카테고리의 다른 글

'Linux Kernel' 카테고리의 다른 글

'Linux Kernel' 카테고리의 다른 글

'Linux Kernel' 카테고리의 다른 글

'Linux Kernel' 카테고리의 다른 글

original Source: http://pubs.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html

posix_fallocate

NAME

NAME

SYNOPSIS

DESCRIPTION

RETURN VALUE

'Linux Kernel' 카테고리의 다른 글

[Linux] x86 시스템 메모리 맵 설정

'Linux Kernel' 카테고리의 다른 글

'Linux Kernel' 카테고리의 다른 글

'Linux Kernel' 카테고리의 다른 글

'Linux Kernel' 카테고리의 다른 글

http://stackoverflow.com/questions/30236620/what-is-wmb-in-linux-driver

'Linux Kernel' 카테고리의 다른 글

What is the return address of kmalloc() ? Physical or Virtual?

'Linux Kernel' 카테고리의 다른 글

Memory Mapping

'Linux Kernel' 카테고리의 다른 글

'Linux Kernel' 카테고리의 다른 글

'Linux Kernel' 카테고리의 다른 글

Compile Linux Kernel on Ubuntu 12.04 LTS (Detailed)

Debian Method:

'Linux Kernel' 카테고리의 다른 글

'Linux Kernel' 카테고리의 다른 글

'Linux Kernel' 카테고리의 다른 글

'Linux Kernel' 카테고리의 다른 글

'Linux Kernel' 카테고리의 다른 글

'Linux Kernel' 카테고리의 다른 글

'Linux Kernel' 카테고리의 다른 글

'Linux Kernel' 카테고리의 다른 글

'Linux Kernel' 카테고리의 다른 글

최근에 올라온 글

최근에 달린 댓글

공지사항

글 보관함

최근에 받은 트랙백

링크

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역