barrios kernel story: 2008

EOF of 2008 Dec 31, 2008

한해, 한해는 정말 바쁘기만 하다.

늘상, 집, 회사만을 오가면서도 이렇게 한해가 바쁠 수 있었던 것은
지키려는 것들이 늘어만 가기 때문일 것이다. 지켜야 할 것들과 미련한 소유욕이 한해 동안 여러 득실을 가져다 주었다.

여러 기억에 남을 일들이 있는 한해지만 무엇보다 인상깊었던 일은 다시 기타를 잡았다는 것일 것이다. 손 놓은 지 거의 5년 만에 다시 잡은 기타다. 예전 연주했던 곡과 악보를 마주할 때는 그 추억들에 혼자 센치해지곤 했다. 이번에 다시 잡은 기타는 다시 놓지 않게 될 것 같다. 녀석들과 소주 한잔 해야 하는데... 저울이 기울어지지가 않으니.

여러 사람들이 떠나고, 새로 들어오고 했던 한 해이기도 하다. 눈에서 멀어진 사람도 있고 마음에서 멀어진 사람도 있고... 나는 언제가부터 과감히 금을 긋기 시작했다. 말이란 것이 비뚤어지기 시작하면 결국 그 속내가 드러나기 마련이다. 오히려 있는 듯, 없는 듯, 자신을 과장해서 낮추거나 광대가 되는 편이 더 낫다. 물론 오해가 있는 경우도 있지만, 한번 소원해진 관계는 한계가 있기 때문이다. 이러한 일들이 점점 나를 가두는 일일지언정.

하고자 하는 일에 초석을 세운 한해 이기도 하다. 그 일에 대한 두려움 보다는 해나가과는 과정이 재밌기만 하다. 아직 갈길이 멀지만 언제나처럼 시간은 부족하다. 호기심이 절정에 달한 이 시기가 매우 중요하다. 내년 한 해는 올해보다 더 중요한 시기가 될 것이다.

언제나 드는 생각이지만 Zero Sum 게임이다. 얻는 것이 있으면 잃는 것도 있는 것이 당연지사. 귀차니즘에 편승한 그러한 생각들이 가끔은 삶을 윤택하게 하기도 한다.

끝으로, 감정표현이 서투른 탓에 항상 잘해주지 못해도, 끝까지 참아주고 믿어주는 이 친구에게도 감사의 말을 전한다.

Adios 2008

꼬랑지)

Greenmail은 항상 불평이다. 내가 연주하는 이 곡이 맘에 안든다는 것이다.
사람을 우울하게 만든다고. 그래도 가끔 콧노래로 따라 부르는 것을 보면 그렇게 맘에 들지 않는 것도 아닌가 보다.

[PATCH] cpuset,mm: fix allocating page cache/slab object on the unallowed node when memory spread is set

http://lists-archives.org/linux-kernel/19778435-cpuset-mm-fix-allocating-page-cache-slab-object-on-the-unallowed-node-when-memory-spread-is-set.html

크리스마스 징검다리 연휴가 들어가기전 report되었던 버그이다.
Miao라는 중국 fujitsu 개발자이다.
문제는 memory_spread_page가 set되어 cupset's mem이 변경되었을 때 slab이 바로 그 사항을 반영하지 못하여 old mem에서 메모리를 할당한다는 것이다.

것도 문제이지만 더 큰 문제는 해당 패치가 kernel's hot path에서 polling하는 루틴을 넣었다는 것이다. 이에 대해 다른 개발자들은 모라고 할 것인지 그 추이를 지켜보고 있었는데 아니나 다를까 Andrew는 다음과 같은 문제를 제기했다.

c) These are two of the kernel's hottest code paths. We really
really really really don't want to be polling for some dopey
userspace admin change on each call to __cache_alloc()!

d) How does slub handle this problem?

C는 barrios와 같은 의견이고 D에 대해 Christoph가 모라고 답변을 할지 기대된다. 그 추이를 지켜보자.

SLQB - and then there were four Dec 28, 2008

http://lwn.net/Articles/311502/

요즘 관심을 가지고 있는 것이 SLQB이다. SLQB는 기억으론 5개월 전쯤 이미 한번 RFC가 올라왔었다. 하지만 그 때 mm guys들은 별다른 반응을 보이지 않았었다. 그 때는 이미 SLUB에 대해 막판 다듬기가 한참이었던 것으로 기억한다. 그래서 다들 SLUB에 신경을 곤두세우고 있어서 일지도 모르겠다. 어쨌든 Nick은 두번째 제안을 해왔다.

SLQB의 특징은 구조가 굉장히 간단하면서 기존의 다른 allocator들에 크게 뒤지지 않는다. 그 이유는 per-CPU 형태를 취하고 있기 때문에 lock이 많이 줄어들었다. 또한 SLQB는 가능한한 high order allocation들을 피하려고 한다. high order allocation은 memory pressure의 가장 큰 범인이기도 하다. 그러므로 단편화를 줄이기 위해서라도 one order allocation이 바람직하다.

SLQB는 freelist, rlist, remote_free list로 object들의 list를 나누어 관리한다. 그 이유는 cache bouncing을 줄이기 위함이다. long running object들은 할당한 CPU가 아닌 다른 CPU에 의해서 해지될 가능성이 높다. 그러므로 cache line bouncing을 줄이기 위해 해지되는 object들은 처음 할당한 CPU로 옮겨주는 작업을 하여 cache hit을 최대한 높이겠다는 의지인데, 과연 long running한 object들이 할당된 CPU의 cache에 아직 남아 있을까?? 좀더 생각해 볼 필요가 있다.

SLQB의 전반적인 성능은 slab에 비하여 다소 떨어진다. 그 이유는 명확하지 않다. object들을 array 형태가 아닌 list 형태로 구현하면서 object cacheline layout의 변화에 기인한 것일 수도 있다. 어쨌든 Nick은 계속해서 이 allocator의 성능을 높일 것이며 barrios 또한 계속해서 이 코드에 대해 review 할 것이다. 코드에 사소한 bug와 naive한 code가 있으나 review가 다소 늦어서 이번에 comment하진 않을 것이다.

SLQB를 테스트 하는 동안 기존의 mainline kernel에서 bug가 발견되었다. SLQB를 사용했을 때 그 문제가 나타나게 된 원인은 SLQB는 object안에 metadata를 함께 관리하고 있기 때문이다. 그러므로 object의 크기 이상으로 메모리를 write하였을 경우, object list가 붕괴된다. 그 문제는 POISON을 가지고 알 수 있다. 바로 mainline에 report하려 했으나, rc 버젼이었고 SLQB 또한 review가 끝나지 않은 상태라서 보고 하지 않았다. 2.6.28이 나왔으니 SLQB를 시험해보고 문제가 동일하게 다시 발생한다면 그 때 보고할 예정이다.

꼬랑지)
시간이 되면 POISON과 RED_ZONE도 문서에 추가하고 싶긴한데.. 사람들이 일반적으로 잘 모르는 기능이라.. 시간이 될지 모르겠다. 다른 할일이 많아..

끝으로 간단히 분석한 문서를 첨부한다.

SLQB 문서

The 2.6.28 kernel is out Dec 25, 2008

http://lwn.net/Articles/312786/

2.6.28이 release 되었다. 몇가지 살펴봐야 할 주요한 패치들이 있다.
시간이 되고 정리가 되면 그 때 다시 살펴보기로 한다.

Linus의 말이 재미있다. 크리스마스 답게 산타의 선물에 빗댄 2.6.28 release.
과연 아이들이 그 선물을 좋아할까? ^^

각설하고 이번 2.6.28의 릴리즈는 나에게는 그 이상의 의미가 있다.
한동안 작업했던 메모리 회수의 성능 향상을 위한 split LRU가 mainline에 정식으로 merge되었다.
그로 인해 몇건의 mainline contribution이 더 추가되었다.

작은 fix또한 Signed-off-by를 추가해준 Rik에게 감사의 마음을 전한다.

Vals No. 4 Op. 8 Dec 22, 2008

근래 연습중인 곡이다.
망고레의 왈츠 4번.

2003년 이었던 것으로 기억한다. 후배 한 녀석이 오랫만에 찾은 동아리방에서 연주하고
있던 곡이 이 곡 이었다. 그 전까지는 사실 별로 연주하고 싶은 맘이 들던 곡은 아니었다.
딱히, 그 후배가 이 곡을 연습하게 해준 동기는 아니더라도,
나도 이 곡을 연주 할 수 있게 구나 하는 정도의 동기는 유발해준 것이 사실이다.

이 곡의 하이라이트는 중간의 연속되는 아르페지오와 함께 상승 베이스,
다시 하강 베이스 바하의 대위법은 아니지만 서정적이면서도 멋있는 곡의
분위기를 살려준다.

특히 마지막 부분에 고조감을 느끼게 연주하여 끝을 맺는 부분을 잘 살려서 연주해야 한다.

아래 동영상은 실제 CD 연주보다는 다소 천천히 친 감이 있다.
망고레 연주의 대가 러쎌의 연주를 감상할 수 있는 여유가 있을 때 이 곳을 들를 수 있기를.

- 2008년 두번째 눈이 내리던 날 -

all about documentation

http://lwn.net/Articles/310569/

커널 개발자들은 documentation에 대하여 굉장히 회의적이다.
Andrew의 말이 인상적이다.

"코드를 clear하게 만드는 것 보다는, 아무생각없이 단지
여기에 주석을 넣기로 되어 있으니...."

이미 TDD와 같은 방법론에서는 주석을 Andrew와 같은 이유로 주석을 금기시하기도 한다.
하지만, 어디까지나 저건 이미 guru의 반열에 오른 커널 developer들의 이야기이다.

사실, 커널 code를 exploration할 때 중간중간 주석이 없으면 굉장히 이해하기 어려운 부분들이 많이 있다. 물론 obsolete한 주석들도 많이 있긴하다. 하지만 그런 주석들로 인하여 긴 항해에서 방향을 잃지 않을 수 있을 때가 더 많이 있다는 것이 내 경험이다. 특히 kernel core 쪽은 단순한 logic이 아닌, 특정 machine에서 발생했던 regression들로 인하여, 이상한 magic code들이 들어가 있기도 하다. 이것은 커널 patch history를 follow up하지 않고 있으면 logic 상으로는 도저히 이해할 수 없는 코드들이다.

요즘은 Andrew를 비롯한 많은 kernel developer들이 git의 log의 중요성을 강조하고 있어, 문제를 보다 명확하게 logging을 하려고 하고 있어 대체적으로 만족할 만한 수준이다.

어쩌면, 나와 같은 kernel newbie들도 그런 주석에 고마워해야 할 게 아니라, 주석들로 인하여 clear되지 않고 있는 코드들에 질타를 던져야 하나...

2002 kernel trap Ingo Molnar Interview Dec 11, 2008

2002년 인터뷰 내용중 관심 있는 부분들만...

http://kerneltrap.org/node/517

Jeremy Andrews: When did you get started with Linux?

Ingo Molnar: i think i first heard

about Linux around 1993, but i truly got hooked on kernel development in 1995 when i bought the german edition of the 'Linux Kernel Internals' book. It might sound a bit strange but i installed my first Linux box for the sole purpose of looking at the kernel source - which i found (and still find) fascinating. So i guess i'm one of the few people who started out as a kernel developer, later on learned their way to be a Linux admin and then finally learned their way around as a Linux user ;-)

JA: What was your first contribution to the kernel?

Ingo Molnar: my very first contribution was a trivial #ifdef bugfix to the networking code, which was reviewed and merged by Alan Cox. At that point i've been lurking on the kernel mailing list for a couple of months already. My first bigger patch was to arch/i386/kernel/time.c, i implemented timestamp-counter based gettimeofday() on Pentiums (which sped up the gettimeofday() syscall by a factor of ~4) - that code is still alive in current kernels. This patch too was first reviewed by Alan Cox.

I strongly believe that a positive 'first contact' between kernel newbies and kernel oldbies is perhaps the single most important factor in attracting new developers to Linux. Besides having the ability to code, kernel developers also need the ability to talk and listen to other developers.

JA: Did you base the design on any existing scheduler implementations or research papers?

Ingo Molnar: this might sound a bit arrogant, but i have only read (most of the) research papers after writing the scheduler. This i found to be a good approach in the area of Linux - knowing about too many well-researched details can often confuse the real direction we have to take. I like writing new code, and i prefer to approach things from the physics side: take a few elementary rules and build up the 'one correct' solution, no compromises. This might not be as effective as first reading all the available material and then cherry-picking a few ideas and thinking up the remaining things, but it sure gives me lots of fun :-)

[ One thing i always try to ensure: i take a look at all existing kernel patches that were announced on the linux-kernel mailing list in the same area, to make sure there's no duplication of effort or NIH syndrome. Since such kernel mailing-list postings are progress reports of active research, it can be said that i read alot of bleeding-edge research. ]

JA: How do JVMs trigger an inefficiency in the old scheduler?

Ingo Molnar: the Java programming model prefers the use of many 'threads' - which is a valid and popular application programming model. So JVMs under Linux tend to be amongst the applications that use the most processes/threads, which are interacting in complex ways. Schedulers usually have the most work to do when there are more tasks in the systems, so JVMs tend to trigger scheduler inefficiencies sooner than perhaps any other Linux application.

JA: You're also the author of the original kernel preemption patch. How did your patch differ from the more recent work Robert Love has done in this area?

Ingo Molnar: it was a small concept-patch from early 2000 that just showed that a preemptible kernel can indeed be done by using SMP spinlocks. The patch, while it booted and appeared to work to a certain degree, had bugs and did not handle the many cases that need special care, which Robert's patches and the current 2.5 kernel handles correctly.

otherwise the base approach is IMO very similar, it has things like:

+               preempt_on();
            clear_highpage(page);

+               preempt_off();

and:

+               atomic_inc_local(&current->may_preempt);        \

which is quite similar to what we have 2.5 today, with the difference that
Robert and the kernel developer community actually did the other 95% of the work :-)

JA: Are you also actively working on 2.5 preemptible kernel development?

Ingo Molnar: The maintainer is Robert - i do tend to send smaller preempt related patches (and even a larger one, the 'IRQ lock removal' patch centered around the use of the preemption count). I'm obviously interested in the topic, and i'm happy that all the seemingly conflicting concepts as lowlatency and preemption are now properly merged into 2.5 and that we have really good kernel latencies. Other pressing topics like the scheduler and the threading code still keep me busy most of the time.

JA: Your IRQ rewrite and Robert's preemptible kernel work have resulted in a unified per-task atomic count (the preempt_count) and a lot of code being cleaned up. Do you have plans to do more work in this area?

Ingo Molnar: not at the moment - right now i think that the IRQ code could hardly be any cleaner than it is today :-)

JA: What other kernel projects are you currently working on?

Ingo Molnar: mainly the scheduler, plus these days i'm working on enhancing the handling of 'threads' under Linux, utilized by the NPTL project done by glibc maintainer Ulrich Drepper. This has a high number of components that are in the 2.5 kernel already.

JA: Can you further describe the components that have already been merged into the 2.5 kernel?

Ingo Molnar: TLS stands for 'Thread Local Storage'. You can find the first announcement of the patch at:

http://lwn.net/Articles/5851/

a number of followup patches were posted, and it all got eventually merged
into 2.5.31.

Plus there were a few other things related to threading:

http://lwn.net/Articles/8131/

http://lwn.net/Articles/8034/

http://lwn.net/Articles/7618/

http://lwn.net/Articles/7617/

http://lwn.net/Articles/7603/

http://lwn.net/Articles/7411/

http://lwn.net/Articles/7408/

(note that most of the above patches got reworked significantly before they
got into the 2.5 kernel, but the concepts were all preserved.)

JA: What other Linux kernel related projects have you worked on in the past?

Ingo Molnar: here's a probably incomplete list of the bigger pieces that made it into the kernel: software-RAID support, 3-level paging on x86 (and highmem), the recent IRQ handling rewrite in 2.5 (which also removed the 'big IRQ lock'), the timer scalability patch, kernel workqueues, the CPU affinity syscalls, the initial SMP pagecache scalability code in 2.3, and i also wrote the original 'writeback pagecache' patch for 2.3, wrote various fixes and enhancements to the 'old' scheduler, wrote the 'wake one' support patch for 2.4, wrote the original zoned allocator, bootmem and mempool subsystems. Ie. all across the spectrum.

One project that is not in the 2.5 kernel is the Tux webserver (and now FTP server as well). If you want to see a Tux/FTP server that can serve 10,000 users then do:

ftp ftp.rpmfind.net

some smaller but interesting patches: the NMI watchdog, the ability of the 2.4 kernel to create more than ~4000 processes on x86 (ie. the removal of per-thread TSS), netconsole/netdump, 'big reader locks', and one older patch from 2.2 times i'm particularly proud of: i wrote the original 'current task pointer' implementation, which uses the stack pointer to get to the 'current task pointer' on SMP systems. I also wrote the 'memleak' and 'ktrace' debugging helper tools, which have been picked up by other projects.

JA: Your list of contributions is staggering!

Ingo Molnar: well, it's just that i've been around long enough, and that i'm interested in many different areas. So a colorful mix of contributions piled up.

JA: Are you still working on the Tux webserver?

Ingo Molnar: occasionally yes, but other things take precedence currently. But life has not stopped, eg. Anton Blanchard has ported Tux to 2.5, and Arjan van de Ven keeps the 2.4 patch uptodate.

JA: What still needs to be modified in the generic kernel?

Ingo Molnar: it's mainly two VFS changes, an exit()-time cleanup function and one new TCP event callback. All the 'big' features that were induced by TUX are in the 2.5 kernel already, zerocopy and the scalability work, so TUX for 2.5 is a really unintrusive patch.

JA: Of all these many impressive accomplishments, which are you the most proud?

Ingo Molnar: well, perhaps the scheduler, it manages to solve a few really hard conceptual problems in a pretty critical piece of code that already got called a couple of thousand times while eg. reading this article on a Linux box! :-)

JA: What is your background in programming prior to getting involved with Linux?

Ingo Molnar: well, like many others, i grew up on programming all possible (and even some impossible) aspects of Commodore micro-computers, since age 11. Completely knowing a greatly simplified but fully functional computer architecture helped alot in kernel development.

I think kids today have a harder time, since hardware vendors are much more tightlipped about computer internals, and the complexity of computer systems skyrocketed as well. Linux perhaps helps here too, as a central 'documentation' and reference implementation for "all computer internals that matter".

JA: Much of your work seems to be focused on improving the performance and scalability of the 2.5 kernel. Is this the result of RedHat's product requirements, or your own interests?

Ingo Molnar: well, i'm in the fortunate position that the two are a perfect match.

JA: Can you describe your development environment, including the hardware and software tools you typically use?

Ingo Molnar: i use all the normal text based kernel development tools: vim, gcc/make/etc., i use a serial line to a test-system to debug kernels, and that's all. I like it simple when reading kernel code: i use text consoles (on an LCD screen) to do most of my development work. Occasionally i drop into X for tools that make sense only there, such as ethereal or some of the BK tools.

JA: Have you worked with any other open source kernels?

Ingo Molnar: not really. I occasionally take a look at FreeBSD - some things they do right, some things they dont, in the areas i'm most interested in the Linux kernel is currently ahead both design-wise and implementation-wise. Finally we caught up in the VM subsystem as well, with Andrea's big and important 2.4 rewrite, Rik's great rmap code and Andrew's fantastic integration work. But what other answer would one expect from a Linux kernel developer? :-)

JA: FreeBSD 5.0 is due to be released around December of this year, with some significant changes to the kernel. Have you followed this development?

Ingo Molnar: not really. The things i sometimes do is to look at their code. Also, when i search for past discussions regarding some specific topic, sometimes there's a FreeBSD hit and then i read it. That's all what i can tell. But i do wish their kernel gets better just as much as the Linux kernel gets better, there needs to be competition to drive both projects forwards. (the Windows kernel is closed up enough so that it does not create any development stimulus for Linux (and vice versa). Rarely do any Windows features get discussed.)

JA: What areas of the Linux kernel do you think still lags behind FreeBSD?

Ingo Molnar: there were two areas where i think we used to lag, the VM and the block IO subsystem - both have been significantly reworked in 2.5. Whether the VM got better than FreeBSD's remains to be seen (via actual use), but the Linux VM already has features that FreeBSD does not have, eg. support for more than 4 GB RAM on x86 (here i guess i'm biased, i wrote much of that code). But FreeBSD's core VM logic itself, ie. the state machine that decides what to throw out under memory pressure, how to swap and how to do IO, is top-notch. I think with Andrew Morton's and Jens Axobe's latest VM and IO work we are top-notch as well (with a few extras perhaps).

There's also an interesting VM project in the making, Arjan van de Ven's O(1) VM code. [without doubt i do appear to have a sweet spot for O(1) code :-) ] Rik van Riel has merged Arjan's code a couple of days ago. The code converts every important VM algorithm (laundering, aging) to a O(1) algorithm while still keeping the fundamentals - this is quite nontrivial for things like page aging. It's in essence the VM overhead reduction work that Andrea Arcangeli has started in 2.4.10, brought to the extreme. I have run Arjan's O(1) VM under high memory pressure, and it's really impressive - kswapd (the central VM housekeeping kernel thread), which used to eat up lots of CPU time under VM load, has almost vanished from the CPU usage chart.

I do have the impression that the Linux VM is close to a conceptual breakthrough - with all the dots connected we now have something that is the next level of quality. The 2.5 VM has merged all the seemingly conflicting VM branches that fought it out in 2.4, and the many complex subsystems involved suddenly started playing in concert and produce something really nice.

JA: A much earlier version of the rmap code was originally in the 2.4 kernel, but got ripped out. Do you feel it has improved enough that this won't happen again?

Ingo Molnar: this most definitely wont happen. We already rely on rmap for some other features, so it's not just a matter of undoing one patch. Rmap is essential to the new VM, without rmap the VM would be like a ferrari with an old diesel motor - looks good but is pretty unusable.

the problem of rmap in 2.4 was simply its complexity, relative youth as a project and the relative low number of people that tested it. So in 2.4 it would have been quite a stretch to keep it in. But it was a fair game for 2.5, and with Andrew's simplification/robustization/speedup of Rik's rmap code it was very manageable.

JA: What other major improvements have gone into 2.5, beyond the scheduler and VM rewrites?

Ingo Molnar: the block IO rewrite, lots of VFS changes, a rework of the module code and (plug) the new threading implementation. The block IO rewrite was long overdue and that's the one i'm most happy about.

JA: Do you feel the changes are significant enough to call the next major kernel 3.0 instead of 2.6?

Ingo Molnar: well, i do think they are significant enough to be called 3.0 - on the other hand it might not matter much whether it's called 2.6 or 3.0, after all what ordinarily people know about is this new shiny Linux 9.0 release, right? ;)

JA: Looking into the future, what do you see in store for the next development kernel, version 2.7?

Ingo Molnar: no idea, really, i dont think trying to look into the future brings many fruits, the kernel needs to handle what is available here and today. Sometimes we are lucky and create stuff that happens to work for years :-) Perhaps something like OpenMosix would be nice to have in the kernel. Plus even better (native) support for User Mode Linux. Things like this.

JA: Do you have any advice to offer those aspiring to become productive kernel developers?

Ingo Molnar: only the old mantra: to read the source and the mailing lists. And take it easy - do what you like doing most.

[PATCH] fix mapping_writably_mapped()

http://lkml.org/lkml/2008/12/10/344

이 문제는 Lee Schermerhorn에 의해 보고된 문제이다.
shared 속성의 vma를 counting하는 i_mmap_writable가 fork시에 정상적으로 counting이 되지 않고 있는 문제였다. 심지어는 음수로 가기도 한다.

이것은 Hugh가 2.6.7에서 패치하였던 __vma_link_file의 문제였다.
__vma_link_file에서 counting을 해주지 않아 발생하였던 문제이다.

어떻게 이 문제가 이제야 발견되었을까? 문제는 assert와 같은 루틴이 없었다는 것이다.
count가 음수로 가도 시스템이 계속해서 수행되었다는 것이다.

아래는 Lee가 테스트했던 방법이다.

root@dropout(root):memtoy
memtoy pid: 3301
memtoy>file /tmp/zf1
memtoy>map zf1 shared

console:__vma_link_file: vma: ffff8803fdc090b8 - mapping->i_mmap_writable: 0 -> 1

memtoy>child c1
memtoy: child c1 - pid 3302

me: I would have expected to see i_mmap_writable incremented again here, but
me: saw no console output from my instrumentation.

memtoy>unmap zf1 # unmap in parent

console:__remove_shared_vm_
struct: vma: ffff8803fdc090b8 - mapping->i_mmap_writable: 1 -> 0
console:__remove_shared_vm_struct: vma: ffff8803fdc090b8 - removed last shared mapping

memtoy>/c1 show
_____address______ ____length____ ____offset____ prot share name
f 0x00007f000ae68000 0x000001000000 0x000000000000 rw- shared /tmp/zf1

me: child still has zf1 mapped

memtoy>/c1 unmap zf1 # unmap in child

console:__remove_shared_vm_struct: vma: ffff8803fe5d3170 - mapping->i_mmap_writable: 0 -> -1

--------

So, the file's i_mmap_writable goes negative. Is this expected?

If I remap the file, whether or not I restart memtoy, I see that it's
i_mmap_writable has remained negative:

-------
memtoy>map zf1 # map private [!shared] - no change in i_mmap_writable

console:__vma_link_file: vma: ffff8805fd0590b8 - mapping->i_mmap_writable: -1 -> -1

memtoy>unmap zf1 # unmap: no change in i_mmap_writable

console:__remove_shared_vm_struct: vma: ffff8805fd0590b8 - mapping->i_mmap_writable: -1 -> -1

memtoy>map zf1 shared # mmap shared, again

console:__vma_link_file: vma: ffff8805fd0590b8 - mapping->i_mmap_writable: -1 -> 0

Corruption with O_DIRECT and unaligned user buffers Dec 4, 2008

좀 지난 얘기이긴 하지만..
현재 multiple thread의 O_DIRECT의 사용에 문제가 있다.
좀더 정확하게, multiple thread가 page 단위의 unaligned된 user buffer를 사용하게 되면
fork와 엉켜 발생하게 되는 문제이다.

예를 들어, 2개 이상의 쓰레드중 첫번째 쓰레드는 unaligned buffer의 512 offset부터 4096 byte만큼 데이터를 읽고, 2번째 쓰레드는 그 다음 4096 byte 만큼의 데이터를, 그 다음 쓰레드는 ... 차례데로 이렇게 읽어들이는 쓰레드들이 있다고 가정하자.

모든 buffer들이 512byte 만큼씩 bias되어 있다.

다음과 같은 sequence를 고려해보자.

Thread 1이 get_user_pages를 호출했고 I/O를 issue했다.
Fork가 발생, 해당 페이지를 COW로 mark
Thread 2 get_user_pages 호출, I/O issue. 그러므로 이 mapping은 physical page의 사본을 얻게 된다.
Thread 2가 issue한 I/O complete되고 새로운 데이터는 3에서 얻은 새로운 physical page에 복사
Thread 1이 issue한 I/O complete되고 데이터는 데이터는 old phsyical page에 복사

그래서 결국 페이지의 첫 512byte의 내용은 old data를 갖게 되어 data는 사라지게 된다. -_-;
안전하게 user buffer 주소와 크기를 page boundary를 cross하게 만들지 않는 것이 좋을 것이다.

우리라는 것 Dec 3, 2008

언젠부터인지 기억이 잘 나지 않는다.
제일 잘했던 것을 제일 못하게 되었다는 사실을 안 것은.

지금 곱씹어 보면 아마도 그 때부터가 아니었나 싶다.
하지만 더욱 안된 것은 향해가는 곳이 원하는 곳이 아니라는 점이다.

발걸음 내려놓으면 그만인 것을 구태여 돌아가고 싶지 않은 것도 있지만,
오래전 알게되었던 것들에 대한 두려움때문이기도 하다.

어느 시간, 어느 곳이 되든 절벽산책의 기분으로...

vmscan: bail out of page reclaim after swap_cluster_max pages

11/14일 Rik에 의한 패치이다.

VM은 가끔 처음 몇번의 priority동안은 inactive page들을 pormotion시키기위해 rotating back하거나 dirty page들을 sync하기 위해 I/O를 submit한다. 이것은 do_try_to_free_pages 함수에서 볼 수 있다. 그러다 결국 더 낮은 priority가 되었을 경우 너무 많은 페이지들을 회수해버리게 되는 것이다.
그래서 필요한 만큼의 페이지 회수를 완료했을 경우, 회수를 bail out하자는 것이다.
이미 이와 같은 생각은 나뿐만이 아닌 많은 사람들이 생각하고 있었다. 문제는 second-chance 알고리즘의 balancing이다. 리눅스의 페이지 회수 정책은 LRU approximation을 사용하고 있기 때문에
page referency의 정확도를 위해 효율적인 체크가 필요하다.

하지만 이와 같은 패치는 각 zone 별 referency 체크의 불균형을 가져오게 된다.
실제로 Andrew는 이와 같은 시도를 했었고, 문제를 겪었었다. 그래서 결국 그러한 패치는 revert되었었다. 또한 Mel Gorman또한 HugePage에 관한 comment를 주었다. Mel이 우려하는 것은 그 패치가 lumpy reclaim에 영향을 주어 high-order block의 회수에 영향을 줄 수 있다고 생각하기 때문이다. Lumpy reclaim은 최소한 high-order block의 base 크기만큼의 페이지 회수를 기대하고 있으나 Rik의 패치로 인하여 그렇게 되지 못할 확률이 보다 커졌기 때문이다. Mel의 테스트 결과에 의하면 테스트의 모든 machine에서 hugepage pool의 resizing을 위한 one-shot attempt은 훨씬 낮은 성공율을 보이고 있었다. 예상했던 결과이다. 하지만 multiple attempt는 결국 성공했고 aggressive한 hugepage pool의 resizing은 한 machine을 제외한 모든 machine에서 더 높은 성공률을 보였다. Mel은 Rik의 패치에 대해서 몇가지 질문들을 하였다.

기존에 있는 baleout 루틴은 너무 늦나?? 그럼 삭제되야 하나? 이 루틴도 do_try_to_free_pages 함수안에 있다.
reclaim을 덜하게 되는 것은 결국 page aging을 old하게 만드는 것이다

Skip freeing memory from zones with lots free Dec 1, 2008

Rik은 새로운 패치를 submit하였으나 Andrew는 별로 달가워 하지 않는다.
패치의 골자는 어떤 memory zone에서 free memory를 찾기 어려울 경우, 다른 zone으로부터의 과도한 memory free를 피하자는 것이다. 왜냐하면 다른 memory zone으로부터의 pageout I/O는 문제의 zone에서 page를 free하는 것을 느리게 만들기 때문이다.

이는 이미 kswapd의 balance_pgdat에서 하고 있는 것과 유사하다.

note_zone_scanning_priority(zone, priority);
/*
* We put equal pressure on every zone, unless one
* zone has way too many pages free already.
*/
if (!zone_watermark_ok(zone, order, 8*zone->pages_high,
end_zone, 0))
nr_reclaimed += shrink_zone(priority, zone, &sc);
reclaim_state->reclaimed_slab = 0;
nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
lru_pages);
nr_reclaimed += reclaim_state->reclaimed_slab;

이 패치에 대해 Peter Zijlstra나 Johannes Weiner는 Ack를 한 상태지만,
마지막 뒷다리를 잡은 것은 역시나 Andrew였다.

Andrew는 이미 그와 유사한 시도를 2002년도에 이미 했었다.

commit 26e4931632352e3c95a61edac22d12ebb72038fe
Author: akpm
Date: Sun Sep 8 19:21:55 2002 +0000

[PATCH] refill the inactive list more quickly

Fix a problem noticed by Ed Tomlinson: under shifting workloads the
shrink_zone() logic will refill the inactive load too slowly.

Bale out of the zone scan when we've reclaimed enough pages. Fixes a
rarely-occurring problem wherein refill_inactive_zone() ends up
shuffling 100,000 pages and generally goes silly.

This needs to be revisited - we should go on and rebalance the lower
zones even if we reclaimed enough pages from highmem.

Then it was reverted a year or two later:

commit 265b2b8cac1774f5f30c88e0ab8d0bcf794ef7b3
Author: akpm
Date: Fri Mar 12 16:23:50 2004 +0000

[PATCH] vmscan: zone balancing fix

We currently have a problem with the balancing of reclaim between zones: much
more reclaim happens against highmem than against lowmem.

This patch partially fixes this by changing the direct reclaim path so it
does not bale out of the zone walk after having reclaimed sufficient pages
from highmem: go on to reclaim from lowmem regardless of how many pages we
reclaimed from lowmem.

위의 글에서 보는 것과 같이 그러한 패치는 revert되었다. 왜냐하면 lowmem보다 highmem의 scanning 비율이 커지면서 page reclaiming 쪽의 zone의 scanning 불균형이 온 것이다.

이에 대해 Rik은 다시 balance_pgdat도 이미 유사한 것을 하고 있고 지금까지 side effect 없이 잘 사용해왔다고 밝히고, Andrew의 패치와는 달리 이것은 baleout이 아니고 "이미 많은 free page를 가지고 있는 zone을 skip"하자는 것이라고 강조했다.

Andrew는 이에 대해 하지만 Rik의 패치는 kswapd뿐만 아니라 direct reclaim에도 영향을 줄 수 있다고 말하며 bale out과 skip은 유사한 영향을 줄 것이라고 답변했다. 하지만 이번에는 Andrew가 틀렸다. kswapd는 shrink_zone을 direct로 호출하지 shrink_zones을 통하지 않는다. 그러므로 Rik의 패치는 kswapd에는 영향을 주지 않게 된다.

어쨌든, Rik의 패치가 zone scanning ratio에 문제를 주는 것은 사실이다.
이에 대해, Rik은 각 zone마다 같지 않은 allocation pressure로 인하여 때론 같지 않은 pressure가 바람질할 때도 있다고 반박하고 있다. 일리가 있는 말이다.
lowmem에 대한 allocation요구가 많을 때, highmem에 page를 swapout하는 것은 바람직하지 않다. 또는 numactl로 pinned된 application이 다른 NUMA node에 page를 swapout하게 하는 것은 바람직하지 않다.

또한 이미 balance_pgdat에서는 그와 같은 scannng imbalance를 만드는 코드가 들어가 있다.

if (!zone_watermark_ok(zone, order, zone->pages_high,
0, 0)) {
end_zone = i;
break;
}

하지만 양자간의 아직까지 의견일치가 되고 있지 않다.
Rik이 Andrew를 설득시키지 못하는 한 이 패치는 반영되지 않을 것이다.

Note :

Direct reclaim은 zonelkist에 모든 zone의 free page가 zone->pages_low이하로 떨어지지 않으면 들어가지 않는다.
old kernel git treee : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/old-2.6-bkcvs.git

Lockess page cache Jun 22, 2008

speculative page references, lockless pagecache, lockless gup

리눅스 커널이 2.6으로 오면서 많은 변화가 있었지만 많은 사람들이 O(1) 스케줄러 이외에는 뚜렷이 이렇다할 만한 것을 얘기하고 있지는 않다. 사실 개인적으로 생각하였을 때 더욱 중요한 것들이 많이 있다. 그 중에 하나가 항상 RCU이다.

RCU와 같은 새로운 Locking 메커니즘이 리눅스 커널에 도입되면서(사실, 2.4때부터 있어왔던 것들이다.) 커널에 많은 부분들이 RCU를 사용하여 scalability와 performance를 동시에 높여왔다. (RCU에 관한 것을 한번 다루고 싶은 데 사실 글로 언급하기가 좀 힘들다. 얼마전 아는 후배에게 RCU에 관하여 LWN에서 두른 기사를 가지고 아주 세심하게 설명하였음에도 불구하고 이해하기를 힘들어 하는 것이 글로써는 더욱 잘 설명할 자신이 없다. -_-).

하지만 위의 패치를 이해하기 위해서는 RCU에 대한 근본적인 이해가 있어야 한다. (또 정확하게는 몰라도 된다. 위의 패치는 RCU를 사용하기도 했고 사용하지 않기도 했으니..말이 애매모호하지만 Nick Piggin의 논문을 읽어보면 왜 RCU를 사용하지 않았는가를 알 수 있고(물론 RCU 를 이해하지 못하면 아무리 논문을 읽어도 왜 사용하지 않았는가를 이해할 수 없을 것이다) 또 왜 RCU를 사용했는가도 알 수 있을 것이다.

위의 패치의 내용은 page cache lookup의 read lock을 없애자는 것이다. 리눅스 커널에서 page cache lookup은 매우 빈번한 operation중의 하나이다. 그러므로 이 lock을 없앨 수 있다면 전체 성능에 큰 개선이 될 것이다. 더욱 중요한 것은 현재 사용하는 rwlock, spinlock은 CPU의 scalability에 큰 overhead가 된다는 것이다. 이 부분은 cache snooping과 bouncing으로 인한 문제이다. 그러므로 lock을 없애면 성능에 큰 개선이 될 수 있다.

그럼 생각해보자. page cache의 lock을 없애기 위해서는 먼저 page cache가 사용하는 radix tree의 lock을 없애야 한다. 그걸 하기 위해서 Nick은 먼저 lockess radix-tree를 만들어 mainline에 반영하였다. 이것이 lockess pagecache 를 하기 위함이었기 때문에 mainline에 반영이 되고나서도 한동안 사용자가 없었다. lockess radix-tree는 기존의 interface보다 다소 loose한 semantics를 갖게 되었다. 그러므로 이 loose함을 해결하기 위해서는 page cache의 새로운 알고리즘이 필요하게 되었으며 이것이 specualtive page reference이다.

specualtve page reference는 nick이 처음 논문에 실었던 내용과 거의 유사하지만 다소 변형되었다. No new page reference를 막기 위해서 새로운 page flag를 사용하지 않고 page_freeze_refs 인터페이스를 사용하기로 했다. 이는 이미 알고리즘에서 free page와의 race문제를 해결하기 사용한 인터페이스(get_page_unless_zero)의 특성을 이용한 것이다. 그러므로 굳이 새로운 page flag를 정의할 필요가 없게 되었다.

이 패치로 인해 앞으로 우리가 눈여겨 봐야 할 것은 새로운 페이지의 할당에 대해서 referece counter를 예측해서는 안된다는 것이다. 그 페이지는 이전에 page cache에 있었다 회수되어 재할당되었을 가능성이 있으며, 그런 페이지는 speculative page reference를 가지고 있을 수 있기 때문이다. 다른 하나는 radix tree의 update 전에 page를 freeze하여 lookup과의 race문제를 해결해야 한다는 것이다. lookup과 같은 reader는 더이상 rwlock에 의존하지 않기 때문에 reader와 writer의 race 문제가 발생할 수 있기 때문이다.

Ramdisk vs Ramfs May 30, 2008

http://www.linuxdevices.com/articles/AT4017834659.html

A ramdisk (like initrd) is a ram based block device, which means it's a fixed size chunk of memory that can be formatted and mounted like a disk. This means the contents of the ramdisk have to be formatted and prepared with special tools (such as mke2fs and losetup), and like all block devices it requires a filesystem driver to interpret the data at runtime. This also imposes an artificial size limit that either wastes space (if the ramdisk isn't full, the extra memory it takes up still can't be used for anything else) or limits capacity (if the ramdisk fills up but other memory is still free, you can't expand it without reformatting it).

But ramdisks actually waste even more memory due to caching. Linux is designed to cache all files and directory entries read from or written to block devices, so Linux copies data to and from the ramdisk into the "page cache" (for file data), and the "dentry cache" (for directory entries). The downside of the ramdisk pretending to be a block device is it gets treated like a block device.

A few years ago, Linus Torvalds had a neat idea: what if Linux's cache could be mounted like a filesystem? Just keep the files in cache and never get rid of them until they're deleted or the system reboots? Linus wrote a tiny wrapper around the cache called "ramfs", and other kernel developers created an improved version called "tmpfs" (which can write the data to swap space, and limit the size of a given mount point so it fills up before consuming all available memory). Initramfs is an instance of tmpfs.

These ram based filesystems automatically grow or shrink to fit the size of the data they contain. Adding files to a ramfs (or extending existing files) automatically allocates more memory, and deleting or truncating files frees that memory. There's no duplication between block device and cache, because there's no block device. The copy in the cache is the only copy of the data. Best of all, this isn't new code but a new application for the existing Linux caching code, which means it adds almost no size, is very simple, and is based on extremely well tested infrastructure.

A system using initramfs as its root filesystem doesn't even need a single filesystem driver built into the kernel, because there are no block devices to interpret as filesystems. Just files living in memory.

The big kernel lock strikes again May 26, 2008

http://lwn.net/Articles/281938/

Yanmin Zhang은 최근 2.6.26-rc1 커널부터 system performance가 약 40%가량 더 나뻐졌다고 보고했다. 그가 한 테스트는 AIM을 이용한 것이며 AIM은 많은 task들을 생성하여 각 task들이 커널의 여러 subsystem에 관련된 작업을 하는 benchmark 툴이다.

그가 찾은 문제는 generic semaphore였다. 문제를 BKL로 촛점을 맞추는 데는 그리 오랜 시간이 걸리지 않았다. BKL은 몇년전 semaphore로 교체되었었다.

Ingo Molnar은 새로운 semaphore코드로 구현을 교체하여 해결하는 방식을 제안하였다.
문제는 기존의 semaphore 구현이 너무 공평하다는 것이다. 하지만 이 공평함은 상당히 비싸다. semaphore를 얻은 쓰레드는 다른 프로세서의 run queue에 있을 수 있으며 게다가 그 쓰레드는 오랫동안 실행권을 얻지 못하여 상당히 cache cold할 것이다. 심지어 그 쓰레드는 굉장히 낮은 priority를 가지고 있어 오랫동안 실행되지 못할 수도 있다. 그 동안 그 semaphore를 기다리고 있는 큐에 저 뒤쪽에 있는 쓰레드들 (즉, 공평성의 관점에서 봤을 때 요청의 시간이 늦어 q->list의 저 뒤족에 매달려 있는 놈들) 또한 실행권을 얻지 못하는 문제가 발생한다.

결과적으로 semaphore의 어떤 쓰레드도 실행하지 못하는 dead time이 상당히 늘어난 다는 것이다.

해결은 기존의 semaphore구현을 빌려오는 것이다. 즉 공평하지 않게 처리하는 것이다.
경쟁을 하여 누가 가져갈지 모른다는 것이다.

흥미로운 것은 이 패치가 2.6.26-rc2 이전에 mainline에 merge되었다가 다시 revert되었다는 것인데 그 이유는 그 패치가 몇몇 사황에서 semaphore를 broken시켰다는 것이다. 즉 semaphore가 몇몇 상황에서 쓰레드를 깨우는 데 실패하는 경우가 있었던 것이다.

Linus는 또 다른 패치를 넣어 이 문제를 해결하였다. BKL을 spinlock을 대체하는 패치였다. 대신에 느려터지지만 공평한 generic semaphore는 그대로 남겨두었다. 하지만 이 패치는 non-preemption구간을 증가시켜 rt guy들에게는 결코 환영받지 못하였다.

하지만 Linus가 이렇게 패치를 한 이유는 BKL이나 semaphore는 제거되거나 minimization되어야 하는 상당히 낡은 메커니즘이라는 것이다. 그러한 관점에서 더 복잡했던 semaphore 코드를 다시 되돌리자고 하는 노력은 가치가 없다고 판단하였다.

그래서 이러한 결론은 다시 BKL을 커널에서 제거하고자 하는 사람들에 또다시 격려가 될지도 모른다.

Generic Semaphore

http://lwn.net/Articles/273731/

현재 커널에서 semaphore는 굉장히 optimization이 되어 있고 각 arch에 specific하게 구현되어 있었다. 하지만 Mattew는 arch에 dependent한 semaphore들을 제거하고 하나의 semaphore로 교체하였다.

구현은 상당히 직관적이고 간단하였다.

    struct semaphore {
spinlock_t  lock;
int   count;
struct list_head wait_list;
 };

그럼 사람들은 궁금해 할지도 모른다. 왜 애초에 이렇게 하지 않았나?
그에 대한 대답은 2.6.16 이전에는 semaphore가 커널의 주요 mutual exclusion중의 하나였기 때문이다. 그래서 semaphore들은 상당히 performance-critical primitive였던 것이다. 지금은 많은 부분은 mutex로 대체되었기 때문에 semaphore의 비중이 커널에서 기존보다는 많이 줄어들게 되었기 때문이다.

다른 질문은 왜 아직도 이렇게 많은 semaphore들이 사용되고 있느냐이다.

2.6.16이후로 많은 semaphore들이 mutex로 교체된 것이 사실이지만 그래도 아직 많은 semaphore들이 남아 있다. 하지만 이렇게 교체하기 위해서는 많은 audit이 필요하다. 이러한 audit이 충분해지면 semaphore의 counting feature가 필요하지 않은 것들은 mutex로 계속해서 교체될 것이며 정말 필요한 것들만으로 semaphore가 남게 될 것이다.

Toward better direct I/O scalability Apr 23, 2008

http://lwn.net/Articles/275808/

요즘 linux-mm쪽에서 nick 아저씨하고 계속해서 얘기가 진행중인 topic인데 lwn에 보니 정리가 잘 되어 있네요.

요는

linux가 CPU-intensive한 work에서 scalability가 계속해서 향상되고 있는데 반해 database-heavy workload의 경우에는 그렇지가 못하다는 데서 문제가 시작되었네요.

DB와 같은 app들은 리눅스의 page cache policy를 피하기 위해 direct I/O를 사용하게 되고, 그렇게 되면 커널의 입장에서는 읽은 데이터를 user의 메모리에 바로 써주어야 하는데 그렇게 하기 위해선 해당 프로세스의 page table에 먼제 페이지 프레임들을 고정시켜야 합니다. 그것을 하기 위해서 get_user_pages API를 사용해왔고.

근데 이 함수를 사용하려면 mm에 세마포어를 하나 잡고 사용해야 하는데 두개 이상의 프로세스가 경쟁을 하기 시작하면 scalability가 확 떨어져 버리는 단점이 생깁니다.

그래서 nick이 곰곰히 생각해보니 lock을 줄일 수 있는 point도 몇몇 있고 페이지 테이블 엔트리의 하위 12bit를 사용하여 필요한 정보를 저장하고 그것을 바탕으로 처리를 하니 전체적으로 10%의 성능 향상을 보였다네요.

Anyway, server 단에서 일을 하니 DB와 같은 것들에 문제점을 찾게 되고 어떻게 보면 사소할 수도 있지만(lock을 줄여 나가는 것) 또한 특정 상황에서는 도움이 안될 수도 있는 작은 패치일 수도 있지만 조금씩 성능을 개선해나가는 구루의 면모를 볼 수 있습니다.

꼬랑지)

패치 메일만 받고서는 뭔 내용인지 사실 잘 몰랐는데 lwn을 보니 확 이해가 되네요. 역시 Jonathan Corbet은 정리의 아버지~~

지난번 ticket spinlock(http://lwn.net/Articles/267968/)에 대해서도 원저자인 nick보다 더 훌륭한 설명을 하여 nick에게 칭잔을 받은 적도 있는데..

정리는 이렇게 하는 것이다를 보여주네요..ㅎㅎ

arm linux kernel에 system call 추가 Mar 18, 2008

기본적으로 뼈대는 x86과 동일하다. 약간의 매크로가 틀릴 뿐인데.. 중요한 것은 제일 나중에..

먼저 커널의 다음 3 파일을 수정하여 새로운 system call을 추가하자.

1. arch/arm/kernel/calls.S
2. include/asm-arm/unistd.h
3. 해당 system call을 구현 할 부분(따로 분리해서 구현해도 상관없고 기존의 아무 소스 파일에나 구현해도 상관없다. 단지 makefile만 잘 수정해준다면)

마지막으로 해당 system call을 사용하는 application을 작성하여야 한다.

#include "/home/barrios/work/xxx/linux/include/asm-arm/unistd.h"
#include

_syscall1(int,barrios_write,int,arg);
_syscall1(int,barrios_t,int,arg);

int main()
{
int i;

i = barrios_write(2);
i = barrios_t(2);

return 0;
}

자. 컴파일 하자.

arm_unknown-linux-gnu-gcc -o barrios_syscall barrios_syscall.c

에러 뜨나??
barrios_syscall.c:5: error: expected declaration specifiers or ‘...’ before ‘barrios_write’
barrios_syscall.c:5: error: expected declaration specifiers or ‘...’ before ‘arg’
barrios_syscall.c:5: warning: data definition has no type or storage class
barrios_syscall.c:6: error: expected declaration specifiers or ‘...’ before ‘barrios_t’
barrios_syscall.c:6: error: expected declaration specifiers or ‘...’ before ‘arg’
barrios_syscall.c:6: warning: data definition has no type or storage class

그럼 다음과 같이...

arm_unknown-linux-gnu-gcc -o barrios_syscall barrios_syscall.c -D__KERNEL__

cache line bouncing에 관하여 Mar 10, 2008

시스템에 관련된 서적이나 기사들을 보면 cache에 관해서 많은 용어들이 나온다.
cache thrashing, cache line bouncing, cache snooping, cache invalidation 등등.

명확한 용어를 사용하지 않는다면 듣는 사람이나 말하는 사람 모두 피곤하기 마련이다.
먼저, 가장 흔히 사용하는 용어는 cache line bouncing 이다.
Cache는 일반적으로 공간 지역성을 최대한 활용하기 위하여 cache line size만큼 한번에 메모리의 데이터를 cache line에 올려 놓게 된다. 현재 Intel의 core2duo와 같은 경우는 이 크기가 64 Byte에 해당한다.

예를 들어 메모리의 0xC0120800의 한 바이트를 read하였다고 할지라도, cache line size가 64바이트라면 0xc0120800부터 0xc0120840까지 의 데이터를 읽어들이는 셈이된다. SMP환경에서 여러 CPU들이 해당 주소를 읽게 되면 각 CPU이 local cache 안에 같은 메모리 주소의 데이터를 캐시안에 공유하고 있게 된다.
이때, 여러 CPU중 한 CPU가 0xc0120800~0xc0120840의 범위 중 한 바이트라도 변경하게 되면 MESI 프로토콜에 의해 다른 CPU들의 cache line은 invalid되게 된다. 그러므로 다른 CPU들이 그 주소 범위 내의 한 바이트를 읽으려거든 cache miss가 발생하며 memory로부터 다시 읽어와야만 한다. 이러한 것들이 하드웨어적으로 보장이 되며 우리는 Cache Coherency라고 한다.
Cache line bouncing이란 위의 설명과 같이 한 여러 CPU들의 local cache안에 공유되는 메모리 주소를 한 CPU가 변경하여 다른 CPU들의 Local cache도 update해야만 하는 상황을 일컫는다. 일부 사람들은 false sharing이 cache line bouncing이라고 얘기를 하곤 하는데, 그건 아니다. False sharing은 cache line bouncing을 발생시키는 행동 중의 하나일 뿐이다. false sharing을 설명하는 글들은 많이 있으니 굳이 설명하지 않는다. 단지 한가지 덧붙인다면 linux kernel의 hot path에서 사용되는 자료구조들은 false sharing을 막기 위해 최대한 자료 구조의 데이터와 그 데이터를 보호하는 Lock은 다른 cache line에 놓이도록 한다. Cache line bouncing을 제일 많이 일으키는 부분이바로 contention이 심한 데이터에 대한 lock이다. spin_lock의 test_and_set op는 해당 op를 발생시킨 CPU의 cache안에 해당 주소를 포함하는 cache line을 modify하고 그리고 다른 프로세서들은 자신의 cache안에 있는 데이터 중 방금 수정된 메모리 주소와 관련된 cache line을 소유하고 있는지 감시하게 되는 데.. 이것이 cache snooping이다. cache snooping을 통해서 다른 CPU의 local cache들이 invalid 모드로 바뀌는 것이다.
일반적으로, cache line bouncing이 굉장히 많이 발생하는 것을 cache thrashing이라고 한다. cache trashing은 SMP,CMP와 같은 멀티 프로세서 구조에서 심각하게 성능을 저하할 수 있으며 그 주범인 spin_lock들을 막기 위해 linux kernel은 RCU와 같은 lock free 알고리즘을 사용하여 점차 변화하고 있다.

Kernel Synchronization에 관하여 Mar 9, 2008

커널의 동기화 관련함수들이 여럿 있다.
이 함수들을 어느 시점에서 어떤 함수들을 사용하느냐는 사실 숙련된 개발자가
아니라면 여전히 많이 혼동스러운 부분이며, 굉장히 찾기 어려운 버그를 만들어 내곤한다.

http://www.kernel.org/pub/linux/kernel/people/rusty/kernel-locking/c214.html#MINIMUM-LOCK-REQIREMENTS

위의 URL은 Rusty russel이 작성한 문서이며 여러분들이 가지고 있는 커널의 /Documentation/에서도
구할 수 있는 문서이다. 커널의 locking에 관해서 예제와 함께 상당히 잘 정리된 문서이다.
하지만 기본적으로 동기화 facility들에 대한 개념이 없다면 위의 표가 잘 이해되지 않을 것이다.

커널에 관심이 많은 개발자라면 이미 softirq와 tasklet에 관해서는 이해하고 있을 것이다.
이미 많은 linux kernel 관련 서적에서 다루고 있으니 굳이 따로 설명하진 않는다.

위의 표를 이해하기 위해 우리가 알고 있어야 하는 중요한 것이 3 가지 있다.

1. 인터럽트 핸들러가 실행중일 때 다른 인터럽트 핸들러들이 저절로 block되어 지는 것은 아니다.
(많은 개발자들이 잘못 생각하고 있는 부분이다. 물론 IRQF_DISABLED 플래그를 통해 인터럽트 핸들러를 등록하면 커널은 사용자의 의해 등록된 ISR을 실행할 때 인터럽트 전체를 disable하여 시스템의 모든 인터럽트가 발생하지 않게 해줄 수 있긴하다. 이런 행동은 굉장히 지양해야 하는 행동이지만 제공하긴 한다. 커널의 default 행동은 그렇지 않다는 것이다.)

2. 한 CPU의 인터럽트가 disable되었다고, 다른 CPU도 disable되지 않는다. 즉 인터럽트 핸들러 A를 실행하지 못하게 하려고 CPU A에서 disable해봐야, 인터럽트 핸들러 A는 CPU B에서 실행될 수도 있다는 것이다. 참고로 하나의 인터럽트 핸들러가 동시에 여러 CPU에서 실행되지 않는 다는 것은 커널이 보장해 준다.

3. softirq는 다른 softirq를 선점하지 않는 다는 것이다. 이는 softirq를 이용하여 만들어진 tasklet에도 그대로 적용된다.

4. 부록으로 하나만 더 알고 넘어가자. User context가 무얼까? 커널에 관련된 책이나 자료들을 보다 보면 Interrupt Context, User Context(Process Context)에 관해 가끔 말이 나온다. User context는 실행중인 User 영역의 프로세스를 의미하는 것이 절대 아니다. 여러분들은 구분을 current로 하면 된다. 현재 시점에서 current(current가 무엇인지 모른다면 아직 이 글을 보지 말길 권유한다)가 지금 커널이 행하고 있는 일의 주체인가를 판단하면 된다. 예를 들어 인터럽트가 발생했다고 가정하자. 커널은 인터럽트 처리를 바쁘게 하고 있다. 그 시점에서 current는 누가 될지 모른다. 왜냐하면 인터럽트는 비동기적인 이벤트이기 때문이다. 인터럽트를 유발 시킨 프로세스 B, 또는 커널의 특정 서브 시스템이라고 하더라도 실제 그 인터럽트 처리는 프로세스 C가 실행 중인 시점에서 발생하여 처리를 하게 될지도 모르기 때문이다. 이것은 Interrupt Context이다.
반면, User Context는 현재 커널이 프로세스를 대신해서 수행하고 있는 경우이다. 일반적으로 system call(page fault와 같은 exception도 user context로 볼 수 있다. 왜냐하면 이때 커널은 current를 access하기 때문이다)이 이에 속한다. 이 때 프로세스란 user-mode process일 수도 있고 kernel thread일 수도 있다. Anyway, 현재 시점에서의 current는 그 일을 유발한 프로세스를 가리키고 있다는 것이다. 이해가 되었으면 좋겠다.

이 부분 만큼은 기억하고 있어야만 위의 표를 논리적으로 풀어 나갈 수 있다. (우리가 수학을 배울때도 기본 공식은 외우고 있어야 만 하듯이 위는 기본적으로 우리가 알고 있어야 하는 기본이라고 생각하길 바란다.)

위의 3가지를 암기하였다면 위의 표를 풀어 나가기 위해서는 두가지 질문을 던져 보면 된다.

1. A가 B에 의해 선점될 수 있는가?
=> If yes, you should use "[irq/bh] disalbe"
2. A의 critical section이 다른 CPU에 의해 접근될 수 있는가?
=> If yes, you should use "spin_lock".

자, 위의 법칙을 적용해 보자. (문제를 풀 때는 항상 약한(?) 놈을 A로 하며 당신의 시스템은 무조건 SMP라고 하자. SMP로 가정하는 것은 중요하다. 리눅스 커널은 general한 OS이다. 당신의 코드가 반드시 UP의 NON-PREMMPTIBLE에서만 실행된다는 보장은 없다. 그러므로 무조건 SMP를 고려해야 한다. )

예제 1) User Context A, Takelet B

1법칙의 답변 ) yes
2법칙의 답변) yes

당연하다. user context는 인터럽트에 의해 선점될 수 있다.
인터럽트 종료시 softirq가 실행되고 결국 tasklet이 실행되게 된다. 그러므로 tasklet은 interrupt context에서
동작하며 user context A에게는 일종의 인터럽트로 인한 선점으로 볼수 있다. 그러므로 A는 B에 의해
선점될 수 있으므로 1법칙의 답변은 yes이다. 그러므로 우리는 bh를 disable해야 한다. (이 경우, 굳이 irq까지 disable할 필요는 없다. 이유는 따로 설명하지 않는다.)

2법칙의 답변 또한 당연하다.
A의 코드는 SMP 환경에서 언제나 다른 프로세서의 의해 접근 될 수 있기 때문이다. 그러므로 spin lock을 사용해야 한다.

위를 종합해보면 결국 우리는 spin_lock_bh를 사용해야 한다는 것이다.

예제 2) Taklet A, Tasklet B

1법칙의 답변 ) No
2법칙의 답변) yes

2번이 왜 yes인지는 조금만 생각하면 알 수 있다. tasket은 softirq에 의해 실행된다. softirq는 인터럽트에 의해 실행된다. 인터럽트는 어느 CPU에 의해서나 처리 될 수 있다. 즉, tasklet A가 CPU A에 의해 실행중일 때 CPU B는 tasklet B를 실행시킬 수 있다는 것이다. 그러므로 A와 B 두 tasklet들이 서로 critical section을 가지고 있다면 보호되어야 마땅하다. 만일, 두 tasklet 가운데 공유되는 변수가 없다면 굳이 lock을 사용할 필요는 없어지므로 답이 NO일수도 있다.

그럼 왜 1은 NO일까? 미리 얘기했듯이 softirq는 선점되지 않는다라고 이미 얘기하였다.
결국 위를 종합해보면 spin_lock만으로 충분하다.

예제 3) softirq A, Interrupt B

1법칙의 답변 ) yes
2법칙의 답변) yes

이뙈 왜 2법칙의 답변이 yes인지 궁굼할 수도 있을 것이다. 이건 여러분들의 몫.

예제 4) Interrupt A, Interrupt B

1법칙의 답변 ) yes
2법칙의 답변) yes

하지만 이는 주의가 필요하다. 이때는 spin_lock_irqsave를 사용해야 한다는 것이다.
그 이유는 앞의 예제 3가지는 모두 이미 커널의 디자인 단계에서 irq가 enable되어 있다는 것을 보장하고 있기 때문에 우리는 아무 생각 없이 spin_lock_irq를 사용할 수 있었다. 즉 softirq, tasklet, user context에서는 항상 irq가 enable되도록 커널이 설계되었기 때문에 우리는 아무 생각 없이 spin_lock_irq를 사용하면 됐다는 것이다.

하지만, 커널의 인터럽트 핸들러가 호출되는 시점은 irq가 enable되어 있을 수도 있고 disable되어 있을 수도 있다.
(이는 인터럽트 핸들러의 등록시 SA_INTERRUPT와 같은 파라미터를 통해 결정되기도 하며, kernel의 interrupt handler들은 nesting될 수 있기 때문이기도 하다. 즉, interrupt handler A가 수행되기 전, 어떤 interrupt handler들이 어떤 순서로 어떤 일을 했을지 모르기 때문이다.) 그러므로 우리는 함부로 spin_lock_irq를 사용할 수 없고 대신, spin_lock_irqsave를 사용하여 interrupt on/off 상태를 기억하고 있어야 한다.

wordwrap에 관하여 Mar 5, 2008

가끔 커널에 패치를 보낼 때 wordwrap기능 때문에 다른 사람들이 나의 패치를 적용하다 fail이 나오는 경우가 종종 있다.

wordwrap이 무엇인가에 대해서는 다음을 참조하라.

http://mwultong.blogspot.com/2007/10/vim-vi-word-wrap.html

이를 막기 위해서는 mail client 프로그램의 wordwrap기능을 disable해야 한다.
또한 주의해야 할 것이 사용하는 편집기, 본인 같은 경우 vim을 사용하는데 편집기에서도 wordwrap기능을 disable해야 한다.하나 더 주의할 것이 복사를 할 때 vim의 윈도우상에서 그대로 copy and paste를 하게 되면 wrodwrap이 되곤한다. 즉, vim의 yy 명령을 사용하여 복사해야 한다. 그러기 싫으면 gedit등을 이용해서 복사하면 된다.

kerner stack size expansion Mar 3, 2008

mainline은 점차 커널 스택 크기를 줄여가고 있는 추세이다. 반면 실무를 하는 회사에서는??
아마도 스택 크기를 늘려달라고 아우성일 것이다.

문제는 자신들의 드라이버에 있음에도 불구하고 커널의 스택 크기를 늘려 해결하려는 것이다.
개발하다보면 가끔(?) 그런 일들에 부딪히곤 한다.

하지만 사실, 그 어디에서도 커널 스택 크기를 늘리면 왜 안되는지에 대한 이유는 나와 있지 않다. 그냥 다들 그러지 말라고말 할 뿐이지..

생각해보면 이것도 커널 메모리하고 밀접한 관련이 있음을 알 수 있다.
일반적으로 user process들이 사용하는 메모리는 4K page이다.
리눅스 커널에서 가장 많이 쓰이는 페이지도 바로 이 4K page이다.
그래서 리눅스 커널은 나름 4K page들의 캐싱에 신경을 써놨고 할당 전략들은 4K page 할당에 많은 신경을 써 놓았다. 그런데 문제는 커널 스택이다.

커널 스택은 x86을 제외하곤(4K) 내가 아는 architecture는 다 8K이다. 8K 할당은 메모리가 충분하면 문제 없지만 최대한 메모리를 모두 써버리는(?) 리눅스의 페이지 캐시 전략에서는 언젠가 메모리는 회수되기 시작할 것이다.

문제는 이 때 발생한다. order가 큰 메모리 요청일 수록 회수 알고리즘은 잘 동작하지 못한다.
그나마 최근 lumpy reclaim으로 인해 좀 나아지긴 했지만, 4K 만 못한 것이 사실이다.

그래서 커널 스택을 4K로 줄이면 줄였지 8K 이상으로 확장하려는 생각은 별로 하지 않는 것이 좋다. OOM killer를 만나보고 싶지 않다면~~

KGDB가 merge될 것 같다

http://lwn.net/Articles/270089/

Linus가 그렇게 싫어하던 KGDB가 Ingo Molnar의 각고의 노력으로 2.6.26 쯤에 merge될 것 같다.
사실 Linus는 개발자들이 디버깅 툴을 쓰는 것을 좋아하지 않는다. 사람을 생각하지 않게 만들기 때문이다. 하지만 요즘은 생각이 약간 변한 듯 하다.
어쨌든 많은 개발자들은 반기는 기색이다.

리눅스 커널 번역 프로젝트 Feb 27, 2008

리눅스 커널 번역 프로젝트 페이지를 만들었습니다.

작년 7월 일본과 중국의 번역문서들이 mainline에서 토론되어지기 시작하는 것을 보고 바로 시작하였다. 하지만 일이나 부족한 번역실력으로 한 문서를 submit하기까지는 많은 시간이 걸리고 있는 실정이다. 뿐만 아니라 보다 중요한 것은 이미 일본이나 중국은 굉장히 활발하게 활동하는 리눅스 커널 로컬 커뮤니티를 가지고 있다는 것이다.
우리나라에도 많은 커널개발자가 있다고 생각한다. 하지만 mainline을 보면 정말 몇 안되는 분들만이 활동을 하고 있는 것이 오픈소스에 대한 우리나라의 리눅스 커널 개발에 대한 실정이다. 반면, 일본은 근래들어 점차 많은 사람들이 mainline에서 활동을 하고 있으며 중국은 이미 꽤 많이 눈에 띄었었다. 이렇게 할 수 있는 가장 큰 힘은 각국의 로컬 커뮤니티라고 생각한다. 우리나라에도 많은 커널 개발자들이 있고 지역적인 커뮤니티 또는 학생들의 스터디 그룹이 있는 것으로 알고 있지만 국내를 대표할만한 리눅스 커널 커뮤니티가 없는 실정이다.
나는 리눅스 커널 번역 프로젝트가 활성화되어 많은 국내 리눅스 커널 개발자들이 의견을 교환하고 새로운 지식을 공유하며 나아가서 오픈 소스 리눅스 커널 mainline에서의 많은 국내 리눅스 커널 개발자들의 활동을 통하여 한국의 위상을 높였으면 한다.

stack back trace for arm Feb 22, 2008

예전에 작성한 arm backtrace에 관한 간단한 문서이다.

다음과 같은 테스트 프로그램을 만들어보자.

#include
#include
#include

void trace (void)
{
void *array[2]; int size;
size = backtrace (array, 2);
}

void dummy_function (void)
{
trace ();
}

int main (void)
{
dummy_function ();
return 0;
}

위의 프로그램을 컴파일 한후 역어셈 해보자. (동적 라이브러리 함수 호출과정의 설명을 생략하기 위하여 static으로 컴파일하였다.)

arm_xxx-gcc -o trace main.c -static
arm_xxx-objdump -D trace > trace.dis

trace.dis 파일을 보면 다음과 같이 trace 함수의 역어셈 결과를 볼 수 있다.

119 00008218 :
120 8218: e1a0c00d mov ip, sp
121 821c: e92dd800 stmdb sp!, {fp, ip, lr, pc}
122 8220: e24cb004 sub fp, ip, #4 ; 0x4
123 8224: e24dd00c sub sp, sp, #12 ; 0xc
124 8228: e24b3018 sub r3, fp, #24 ; 0x18
125 822c: e1a00003 mov r0, r3
126 8230: e3a01002 mov r1, #2 ; 0x2
127 8234: eb001d28 bl f6dc <__backtrace>
128 8238: e1a03000 mov r3, r0
129 823c: e50b3010 str r3, [fp, #-16]
130 8240: e24bd00c sub sp, fp, #12 ; 0xc
131 8244: e89da800 ldmia sp, {fp, sp, pc}

여느 함수와 마찬가지로 스택 프레임을 만드는 코드는 같다. - 120 ~ 122
지역 변수를 위해 12바이트 공간을 할당하였다. - 123
r0에 array 배열을 위한 첫 주소를 지정한다 - 124 ~ 125
r1에 2를 지정하여 __backtrace 함수를 호출하기 전 인수를 모두 지정한다. - 126
브랜치

7899 0000f6dc <__backtrace>:
7900 f6dc: e1a0c00d mov ip, sp
7901 f6e0: e92dd810 stmdb sp!, {r4, fp, ip, lr, pc}
7902 f6e4: e24cb004 sub fp, ip, #4 ; 0x4
7903 f6e8: e24dd004 sub sp, sp, #4 ; 0x4
7904 f6ec: e1a0c000 mov ip, r0
7905 f6f0: e3a00000 mov r0, #0 ; 0x0
7906 f6f4: e24b4011 sub r4, fp, #17 ; 0x11
7907 f6f8: e24b200c sub r2, fp, #12 ; 0xc
7908 f6fc: e1500001 cmp r0, r1
7909 f700: a91ba810 ldmgedb fp, {r4, fp, sp, pc}
7910 f704: e59fe030 ldr lr, [pc, #48] ; f73c <.text+0x766c>
7911 f708: e1520004 cmp r2, r4
7912 f70c: 391ba810 ldmccdb fp, {r4, fp, sp, pc}
7913 f710: e59e3000 ldr r3, [lr]
7914 f714: e1520003 cmp r2, r3
7915 f718: 291ba810 ldmcsdb fp, {r4, fp, sp, pc}
7916 f71c: e5923008 ldr r3, [r2, #8]
7917 f720: e78c3100 str r3, [ip, r0, lsl #2]
7918 f724: e2800001 add r0, r0, #1 ; 0x1
7919 f728: e5923000 ldr r3, [r2]
7920 f72c: e243200c sub r2, r3, #12 ; 0xc
7921 f730: e1500001 cmp r0, r1
7922 f734: a91ba810 ldmgedb fp, {r4, fp, sp, pc}
7923 f738: eafffff2 b f708 <__backtrace+0x2c>
7924 f73c: 00069828 andeq r9, r6, r8, lsr #16

backtrace 함수는 일반 함수와는 약간 달리 r4 레지스터를 stack에 추가로 보관한다. - 7901
지역변수를 위한 공간은 4byte 할당한다. - 7903
trace함수가 인수로 넘겨준 array의 주소를 ip 레지스터에 저장한다. - 7904
r0레지스터에는 0을 지정한다. 루프의 counter로 사용될 것이다. - 7905
r4레지스터에 fp - 17의 값을 저장한다. 이것은 현재 스택 프레임보다 1 바이트 더 내려간 값으로 비교를 위해 사용된다. - 7906
r2레지스터에 saved fp의 주소를 지정한다. - 7907
r0과 r1을 비교하여, 즉 counter 값이 trace에서 넘겨준 counter값보다 같거나 크다면 정리하고 함수를 빠져 pop한다. - 7908 ~ 7909
lr레지스터에 69828 값을 저장한다. 현재 pc는 0xf70c이다.여기에 48을 더하면 0xf73c이며 그곳에는 69828값이 들어 있다. - 7910
r2와 r4를 비교한 후 r2가 r4보다 작다면 pop한다. - 7911~7912
lr 을 저장되어 있는 주소에 있는 값을 가져와 r3에 저장한다. 이 주소는 69828이다. 69828의 주소는 __libc_stack_end의 라벨이 되어 있으며 image에는 0x0 값이 들어가 있지만 실행시에는 어떤 값으로 채워질 것이며(누가 채워줄까??) 이 값이 __libc_stack_end 값이 될 것이다. - 7913
r2의 saved fp의 주소가 __libc_stack_end의 값보다 같거나 더 크다면 pop한다. - 7914 ~ 7915
saved lr의 값을 r3레지스터에 저장한다. - 7916
인수로 넘어온 arrary의 주소 * count * 4의 주소에 r3 값을 저장한다. 즉 return address를 인수의 array에 저장하는 것이다. - 7917
counter를 1 증가시킨다. - 7918
saved fp에 있는 값을 r3에 저장한다. 즉 이전 스택 프레임의 주소를 r3에 저장하는 것이다. - 7919
마찬가지로 r3에서 12를 빼서 이전 스택 프레임의 saved fp의 주소의 값을 r2에 저장한다. - 7920
counter의 값이 인수로 넘어온 counter값보다 크지 않다면 9번으로 돌아간다. 그렇지 않으면 pop한다.

nopfn

http://lwn.net/Articles/242625/

populate, nopage, nopfn이 fault 메소드로 통합된지 오래다.
사실 예전부터 nopfn을 언제 써야하는지 궁금했었다. 좋은 기사에 정확한 설명이 있어 덧붙인다.

http://lwn.net/Articles/200213/

Meanwhile, one of the longstanding limitations of nopage() is that it can only handle situations where the relevant physical memory has a corresponding struct page. Those structures exist for main memory, but they do not exist when the memory is, for example, on a peripheral device and mapped into a PCI I/O memory region. Some architectures also do very strange things with special memory and multiple views of the same memory. In such cases, drivers must explicitly map the memory into user space with remap_pfn_range() instead of using nopage().

Memory Controller background reclaim Feb 18, 2008

https://lists.linux-foundation.org/pipermail/containers/2007-October/008122.html

valinux의 YAMAMOTO Takashi의 패치이다.

background reclaim 정책을 두어 굉장한 성능 향상을 한 것을 알 수 있다. 물론 위의 테스트는 SMP였을 것이라 생각된다. 하지만 UP에서도 어느 정도의 향상이 있을 것으로 보인다.

Memory Controller에 대한 얘기는 다음을 참조하라.
http://lwn.net/Articles/243795/

How much memory are applications really using Feb 17, 2008

http://lwn.net/Articles/230975/

이번 2.6.25에 추가된 기능중의 하나이다.

임베디드 시스템에서 어플리케이션이 사용하는 정확한 물리 메모리의 양을 측정하는 것은 중요하다. 왜냐하면 메모리의 양은 돈과 직결되는 문제이기도 하고, 개인적으로는 그것보다 RTOS시절의 app들이 그대로(?) 포팅되어져서 VMA와 PAGE에 대한 개념이 없기 때문이기도 하다.

Anyway,
현재 리눅스 커널이 제공하는 rss는 별로 유용하지 못하다. 왜냐하면 shared page에 대한 account 정보가 올바르지 못하기 때문이다. 또한 /proc/pid/smaps를 이용해서 보다 많은 정보를 알 수 있기는 하지만 아직은 부족하다.

Matt Mackal은 이 문제를 해결하기 위해 pagemap과 kpagemap을 구현하였다.
이는 각 프로세스가 사용중인 실제 물리 메모리에 대한 정보와 함께 PSS와 USS를 구현하여 보다 정확한 정보를 제공한다.

PSS(proportional set size)는 공유되는 페이지를 공유 프로세스의 수로 나누어서 정확하게 할당된 메모리를 파악할 수 있게 해준다. 반면 USS(unique set size)는 공유되지 않는 페이지들의 합이다. 또한 clear_refs를 각 프로세스마다 만들어 페이지 테이블에 reference bit를 초기화 할 수 있게 해준다. 그렇게 함으로써 프로세스가 실행중에 어떤 페이지들을 access하는지 알수 있게 해준다. 하지만 이 기능에 대해서는 의문이다. 굳이 필요한 기능일까? referenced bit은 kswapd와 같은 회수 처리에 의해 다시 reset될 수도 있기 때문에 특정 시점에 가서는 참조된 페이지라 할지라도 reference bit은 다시 reset되어 있을 수 있다. 현재 PSS는Fengguang Wu에 의해 smaps에 의해서도 볼 수 있다.
http://lkml.org/lkml/2007/8/13/1224

임베디드 시스템에서 페이지 회수 이슈

아직까지는 임베디드 리눅스의 제품들은 몇 안되는 프로세스가 실행된다. 이는 이전까지 RTOS를 사용하여 task 기반의 프로그램들이 비정상(?)적으로 포팅되었기 때문이다. 기존의 RTOS에서 address space를 공유하며 수행되던 task들을 리눅스에 포팅하기 위해 thread 모델을 사용하기 때문이다. 응용의 성격에 따라 틀리겠지만 PVR과 같은 어플리케이션이 아니라면 일반적으로 page cache의 페이지 보다는 anonymous page들이 시스템에 더 많이 존재하게 될 것이다.

하지만 현재까지의 Linux kernel의 페이지 회수 정책은 !SMP && !SWAP 장치에 대해서는 그리 효과적이지 못한 것 같다. 이를 해결하기 위해서는 페이지의 type별로 분리된 lru list를 만들어야 한다. 제일 우선순위가 높은 list는 물론 page cache에 있고 mapped되어 있지 않은 페이지들 일 것이다.

먼저 scan_control의 may_swap이 바뀔 필요가 있다. 또한 !SMP에서 수행될 때 pagevec은 과연 효과가 있을까?

임베디드와 같이 order 0이상의 페이지를 많이 요구하지 않는 시스템에서 주로 메모리 요구의 대상은 application일 것이고 이는 order 0의 페이지들에 대한 많은 요구가 있을 것이다. 특히나 많은 쓰레드들이 도는 환경이라면 더욱 그럴 확률이 높을 것이다.

그러므로 지금의 zone의 order 0의 페이지 캐시는 상황에 따라 dynamic 하게 더욱 많은 bulk page들을 만들도록 수정되면 성능에 향상이 있을 것이다.

마지막으로 시스템이 어떤 페이지들을 어느 순간에 얼마나 많이 사용하는지 프로파일링하고 하눈에 그래프로 볼 수 있는 프로파일링 도구가 필요하다.

PAGE_OWNER Feb 13, 2008

이 패치에 원본에 대해서는 다음을 참고하라.

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc5/2.6.24-rc5-mm1/broken-out/page-owner-tracking-leak-detector.patch

이 패치가 하는 역할은 시스템에 커널 메모리 leak을 알 수 있는 간접적인 방법을 제공하는 데 있다.
이 패치를 이번에 arm으로 포팅하였다.

동작메커니즘은 간단하다.

리눅스 커널의 물리 메모리 할당을 위해서는 반드시 alloc_pages함수를 통하게 되어 있다. 그러므로 alloc_pages함수를 통해서 buddy system으로부터 물리 메모리를 할당받는 시점에서 backtrace를 통하여 alloc_pages함수의 호출 경로를 페이지 디스크립터에 기록하여 놓는 것이다.
그런 후 할당된 물리 메모리가 해지될 때 backtrace정보를 삭제하는 것이다. 정확하게는 페이지에 더이상 tracking되지 않고 있다는 표시만 한다.

사용자는 특정 시점에

cat /proc/page_owner > page_owner_full.txt

를 통해서 시스템에 존재하는 할당된 페이지들에 대한 call path를 모두 덤프받게 되며 같이 제공된 application을 통해서 제일 많이 할당된 call path순으로 정렬하여 어떤 함수에서 제일 많이 호출했는지를 알 수 있다.

예를 들어보면 다음과 같다.

> [sorted_page_owner.txt text/plain (100.2KB)]
> 51957 times:
> Page allocated via order 0, mask 0x80d0
> [0xc015b9aa] __alloc_pages+706
> [0xc015b9f0] __get_free_pages+60
> [0xc011b7c9] pgd_alloc+60
> [0xc0122b9e] mm_init+196
> [0xc0122e06] dup_mm+101
> [0xc0122eda] copy_mm+104
> [0xc0123b8c] copy_process+1149
> [0xc0124229] do_fork+141
>
> 12335 times:
> Page allocated via order 0, mask 0x84d0
> [0xc015b9aa] __alloc_pages+706
> [0xc011b6ca] pte_alloc_one+21
> [0xc01632ac] __pte_alloc+21
> [0xc01634bb] copy_pte_range+67
> [0xc0163827] copy_page_range+284
> [0xc0122a79] dup_mmap+427
> [0xc0122e22] dup_mm+129
> [0xc0122eda] copy_mm+104

위의같이 출력된 결과를 검토하면 현재 시스템에 있는 물리 메모리 중 order 0짜리의 페이지가 pgd_alloc을 통하여 51957번 됐음을 알 수 있다. pgd_alloc은 do_fork를 통해서 호출되었다.

그러므로 우리는 현재 시스템에 pgd_alloc과 pair로써 동작하는 메모리 해지 함수가 정상적으로 호출되 되고 있지 않음을 간접적으로 알 수 있게 된다.

arm의 backtrace과정은 간단하다.

(위의 그림에서 sb는 현재 사용되지 않는다.)

위의 그림과 같이 현재의 fp 레지스터는 다음 함수의 frame pointer를 가리키고 있기 때문에 stack이 깨지지 않는 한 그냥 backtrace하여 가면 된다.

하지만 위의 기능이 제대로 동작하기 위해서는 리눅스 커널에 CONFIG_FRAME_POINTER기능이 enable되어 있어야 한다.

symbol_get/symbol_put

inter_module_xxx 관련 함수들이 2.6.10이후로 symbol_get/symbol_put 함수로 바뀌면서 제거되었다. 참조할 만한 문서는 다음과 같다.(함수에 문제가 있다는 얘기)

http://lwn.net/Articles/119013/

위의 함수들이 필요한 이유는 다음과 같다.
예를 들어 모듈 A가 특정 상황에 모듈 B의 함수 BFunction()을 사용한다고 하자.
BFunction을 사용하기 위해서는 모듈 B는 EXPORT_SYMBOL(BFunction);을 선언해야 하며 모듈 A는 BFunction을 extern으로 선언하고 사용하여야 한다.

이때 문제가 되는 것은 모듈 A가 모듈 B가 커널에 load되지 않은 상황에서 먼저 load된다면 커널은 모듈 A의 심볼테이블에 있는 Bfunction을 resolve하려다가 그만 linking error를 발생시키고 모듈 A 자체가 load 되지 못하는 현상이 발생하게 되는 것이다. 모듈 A는 그 함수를 정말 특정한 상황에서 단지 한번 호출할 때 뿐인데도 불구하고..

그래서 나온 것이 symbol_get과 symbol_put을 사용하여 runtime에 심볼을 찾아내서 호출하는 방식이다. glibc의 dl_open, dl_sym과 같은 이치라고 보면 될 것이다.

barrios kernel story