Some science applications experience unacceptably high soft page fault rates. This manifests as excessive kernel time, and reduces the CPU time available for computation.
What is a page fault?
A page fault is not an error. A page fault happens when a program requests a memory location that isn't currently mapped to physical memory. A page is a block of memory with a fixed size, the smallest unit that the memory manager will handle.
A soft (or minor) page fault can be satisfied by the operating system from a page that is already in physical memory. A hard (or major) page fault requires the operating system to copy the page from virtual memory (the page file), and is much more expensive than a soft fault because of the disk access time.
How can I check the page fault rate?
- Open Task Manager.
- In the Processes tab, click View → Select Columns...
- Add the Page Faults Delta column.
- Click OK.
- Identify the process belonging to the science application. You may need to select Show processes from all users first.
- Watch the value in the PF Delta column.
This method shows combined hard and soft faults.
First, start to run the problem project exclusively. Then, gather some statistics using the sysstat package.
# /etc/init.d/sysstat start # sar -B 3 4 Linux 2.6.23-gentoo-r8 (hostname) 11:11:27 pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff 11:11:30 0.00 56.00 185922.33 0.00 185991.67 0.00 0.00 0.00 0.00 11:11:33 0.00 0.00 183003.00 0.00 183063.00 0.00 0.00 0.00 0.00 11:11:36 0.00 18.67 185375.67 0.00 185465.00 0.00 0.00 0.00 0.00 11:11:39 0.00 0.00 185054.00 0.00 185180.33 0.00 0.00 0.00 0.00
Do this both with and without BOINC running. If you see very high values for fault/s while BOINC is running, then the page fault issue is confirmed.
The root cause is thought to be inefficient memory allocation in the research software. Originally written for supercomputers, the memory allocation often suffers in the transfer to a desktop architecture.
The extent of the problem varies depending on processor architecture, operating system, number of CPUs/cores, and especially the amount of L2 cache present. This explains why the problem is sometimes worse when several tasks from the same project are running simultaneously.
If the problem is severe, the only solution is to opt out of the problem project. Report the problem on the forums and if the problem is widespread, the technicians will try to provide a permanent fix.