Linux kernel performance and reliability monitoring and tuning

Summary

I created this section as a compendium of my collected knowledge of Linux performance monitoring and tuning. I am not an expert but I decided to collect these together with original references or citations when possible for non-original work. I started this in my private wikis because I grew tired of searching all the articles everytime a new subject would come up. There are hundreds of thousands of articles out there on topics related to Linux performance. I then decided to make my notes public to benefit others. So this collection is an outreach of that and will take time to migrate over from my private archives. In some cases original attribution is now lost however if you see something here I have referenced that you would like to take credit for please contact me by email at.

Command to monitor resources overall

top
dstat - http://dag.wiee.rs/home-made/dstat/
nmon - http://nmon.sourceforge.net/
htop - http://hisham.hm/htop/
Collectl - http://collectl.sourceforge.net/
Glances - https://nicolargo.github.io/glances/
saidar - See also http://www.binarytides.com/saidar-linux-system-monitor/
atop - http://www.atoptool.nl/
iftop - http://www.ex-parrot.com/pdw/iftop/

CPU monitoring tools

mpstat (try command: mpstat -P ALL)
sar -u
ps command
iostat
vmstat
directory /sys/devices/system/cpu or file /proc/cpuinfo to identify cpus present - see https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-devices-system-cpu

$ sar -u 12 5

Report CPU utilization. The following values are displayed:

%user: Percentage of CPU utilization that occurred while executing at the user level (application).
%nice: Percentage of CPU utilization that occurred while executing at the user level with nice priority.
%system: Percentage of CPU utilization that occurred while executing at the system level (kernel).
%iowait: Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.
%idle: Percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.

Finally, you need to determine which process is monopolizing or eating the CPUs. Following command will displays the top 10 CPU users on the Linux system.

$sudo ps -eo pcpu,pid,user,args | sort -k 1 -r | head -10

OR

$ sudo ps -eo pcpu,pid,user,args | sort -r -k1 | less

%CPU   PID USER     COMMAND
  96  2148 vivek    /usr/lib/vmware/bin/vmware-vmx -C /var/lib/vmware/Virtual Machines/Ubuntu 64-bit/Ubuntu 64-bit.vmx -@ ""
 0.7  3358 mysql    /usr/libexec/mysqld --defaults-file=/etc/my.cnf --basedir=/usr --datadir=/var/lib/mysql --user=mysql --pid-file=/var/run/mysqld/mysqld.pid --skip-locking --socket=/var/lib/mysql/mysql.sock
 0.4 29129 lighttpd /usr/bin/php
 0.4 29128 lighttpd /usr/bin/php
 0.4 29127 lighttpd /usr/bin/php
 0.4 29126 lighttpd /usr/bin/php
 0.2  2177 vivek    [vmware-rtc]
 0.0     9 root     [kacpid]
 0.0     8 root     [khelper]

To see who’s using CPU:

$ sudo ps -e -o pcpu,cpu,nice,state,cputime,args --sort=-pcpu | head -n 10
%CPU CPU  NI S     TIME COMMAND
 100   -   0 R 00:01:31 dd if=/dev/zero of=/dev/null
 0.0   -   0 S 00:00:01 /sbin/init
 0.0   -   0 S 00:00:00 [kthreadd]
 0.0   -   - S 00:00:00 [migration/0]
 0.0   -   0 S 00:00:00 [ksoftirqd/0]
 0.0   -   - S 00:00:00 [stopper/0]
 0.0   -   - S 00:00:00 [watchdog/0]
 0.0   -   - S 00:00:00 [migration/1]
 0.0   -   - S 00:00:00 [stopper/1]

To look at cpu overall:

$ vmstat 20 3
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 7173628  98296 436428    0    0     2     1   26    8  0  1 99  0  0
 1  0      0 7173612  98296 436428    0    0     0     0 1010    9 13 37 50  0  0
 1  0      0 7173116  98312 436432    0    0     0     3 1016   13 12 38 50  0  0

Field Description For Vm Mode

(a) procs is the process-related fields are:

r: The number of processes waiting for run time.
b: The number of processes in uninterruptible sleep.

(b) memory is the memory-related fields are:

swpd: the amount of virtual memory used.
free: the amount of idle memory.
buff: the amount of memory used as buffers.
cache: the amount of memory used as cache.

(c) swap is swap-related fields are:

si: Amount of memory swapped in from disk (/s).
so: Amount of memory swapped to disk (/s).

(d) io is the I/O-related fields are:

bi: Blocks received from a block device (blocks/s).
bo: Blocks sent to a block device (blocks/s).

(e) system is the system-related fields are:

in: The number of interrupts per second, including the clock.
cs: The number of context switches per second.

(f) cpu is the CPU-related fields are:

These are percentages of total CPU time.

us: Time spent running non-kernel code. (user time, including nice time)
sy: Time spent running kernel code. (system time)
id: Time spent idle. Prior to Linux 2.5.41, this includes IO-wait time.
wa: Time spent waiting for IO. Prior to Linux 2.5.41, shown as zero.

To introduce cpu load

Perhaps you want to introduce a cpu load for tuning or analysis.

You can do this with this command:

$ dd if=/dev/zero of=/dev/null

This will run forever until you enter Control-C.

A note on threading

Non-threaded applications tend to consume mostly one cpu where a well balanced threaded application should divide its time across all available cpus.

For stats on how well you are threading across multiple cpus look at file /proc/stat.

Here is a very non-balanced application:

$ cat /proc/stat
cpu  101506 126 285419 21111496 1760 19 11 1297 0
cpu0 5408 51 3912 10740445 410 18 6 596 0
cpu1 96097 75 281507 10371050 1349 0 5 700 0
...

The very first "cpu" line aggregates the numbers in all of the other "cpuN" lines.

These numbers identify the amount of time the CPU has spent performing different kinds of work. Time units are in USER_HZ or Jiffies (typically hundredths of a second).

The meanings of the columns are as follows, from left to right:

user: normal processes executing in user mode
nice: niced processes executing in user mode
system: processes executing in kernel mode
idle: twiddling thumbs
iowait: waiting for I/O to complete
irq: servicing interrupts
softirq: servicing softirqs

mpstat shows one processor favored on this instance:

$ mpstat -P ALL 20 3
Linux 2.6.32-642.3.1.el6.x86_64 (typhon)        07/27/2016      _x86_64_        (2 CPU)

...

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
Average:     all   12.47    0.00   37.54    0.00    0.00    0.00    0.00    0.00   49.99
Average:       0    0.00    0.00    0.02    0.00    0.00    0.00    0.00    0.00   99.98
Average:       1   24.95    0.00   75.05    0.00    0.00    0.00    0.00    0.00    0.00

You can show threads by using lsof command or ps. Here you can see the process that is consuming cpu all on one cpu:

$ ps -eLf
UID        PID  PPID   LWP  C NLWP STIME TTY    TIME       CMD
root         1     0     1       0    1     Jul26 ?        00:00:01 /sbin/init
root         2     0     2       0    1     Jul26 ?        00:00:00 [kthreadd]
root         3     2     3       0    1     Jul26 ?        00:00:00 [migration/0]
root         4     2     4       0    1     Jul26 ?        00:00:00 [ksoftirqd/0]
root         5     2     5       0    1     Jul26 ?        00:00:00 [stopper/0]
...
root     29210  1457 29210  0    1 14:30 ?        00:00:00 sshd: sysoper [priv]
sysoper  29212 29210 29212  0    1 14:30 ?        00:00:00 sshd: sysoper@pts/1
sysoper  29213 29212 29213  0    1 14:30 pts/1    00:00:00 -bash
sysoper  29236 29213 29236 99    1 14:31 pts/1    01:21:17 dd if=/dev/zero of=/dev/null
postfix  29569  1547 29569  0    1 15:38 ?        00:00:00 pickup -l -t fifo -u

NLWP is "Number of Light Weight Processes" - number of lwps threads in the process and LWP (aka SID or TID) (light weight process, or thread) ID of the lwp being reported. The difference between LWP and NWLP is essentially Posix standards. Since Linux 2.4.19 (or so) threads can share the pid of the parent process and have a separate thread id, TID. Most processes have just the one thread and so their TID is is same as their PID. C represents processor utilization. Currently, this is the integer value of the percent usage over the lifetime of the process In the above case we can see the number

For more info see http://yolinux.com/TUTORIALS/LinuxTutorialPosixThreads.html for a tutorial on POSIX threads usage under Linux. See also https://www.akkadia.org/drepper/nptl-design.pdf

ps command options:

-H Show threads as if they were processes
-L Show threads, possibly with LWP and NLWP columns
-T Show threads, possibly with SPID column
-m Show threads after processes

You can list threads for a given process several ways:

ps --pid <pid> -Lf, ps -eLf for all processes or ps -T to show just thread count
top -H -p <pid>
Each thread in a process creates a directory under /proc/<pid>/task. Count the number of directories, and you have the number of threads.
File /proc/<pid>/status

Examples:

$ sudo ps --pid 2702  -Lf
UID        PID  PPID   LWP  C NLWP STIME TTY          TIME CMD
tomcat    2702     1  2702  0  209 04:39 ?        00:00:00 /usr/lib/jvm/java-1.7.0/bin/java -Xmx3072M -Xm
tomcat    2702     1  2704  0  209 04:39 ?        00:00:03 /usr/lib/jvm/java-1.7.0/bin/java -Xmx3072M -Xm
tomcat    2702     1  2705  0  209 04:39 ?        00:00:21 /usr/lib/jvm/java-1.7.0/bin/java -Xmx3072M -Xm
tomcat    2702     1  2706  0  209 04:39 ?        00:00:22 /usr/lib/jvm/java-1.7.0/bin/java -Xmx3072M -Xm
tomcat    2702     1  2707  0  209 04:39 ?        00:00:39 /usr/lib/jvm/java-1.7.0/bin/java -Xmx3072M -Xm
tomcat    2702     1  2708  0  209 04:39 ?        00:00:00 /usr/lib/jvm/java-1.7.0/bin/java -Xmx3072M -Xm
tomcat    2702     1  2709  0  209 04:39 ?        00:00:01 /usr/lib/jvm/java-1.7.0/bin/java -Xmx3072M -Xm
tomcat    2702     1  2710  0  209 04:39 ?        00:00:00 /usr/lib/jvm/java-1.7.0/bin/java -Xmx3072M -Xm
tomcat    2702     1  2711  0  209 04:39 ?        00:03:03 /usr/lib/jvm/java-1.7.0/bin/java -Xmx3072M -Xm
tomcat    2702     1  2712  0  209 04:39 ?        00:03:03 /usr/lib/jvm/java-1.7.0/bin/java -Xmx3072M -Xm
...

$ sudo ps -T -p 2702 
  PID  SPID TTY          TIME CMD
 2702  2702 ?        00:00:00 java
 2702  2704 ?        00:00:03 java
 2702  2705 ?        00:00:21 java
 2702  2706 ?        00:00:22 java
 2702  2707 ?        00:00:39 java
 2702  2708 ?        00:00:00 java
 2702  2709 ?        00:00:01 java
 2702  2710 ?        00:00:00 java

$ top -H -p 2702
top - 18:30:32 up 105 days,  8:33,  1 user,  load average: 0.02, 0.05, 0.05
Tasks: 209 total,   0 running, 209 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.0%us,  0.1%sy,  0.0%ni, 98.8%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   4049992k total,  3621592k used,   428400k free,   186140k buffers
Swap:  8388604k total,    14848k used,  8373756k free,   968856k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                    
 2916 tomcat    20   0 5286m 2.3g  27m S  2.0 58.7   1:46.19 java                                        
 2920 tomcat    20   0 5286m 2.3g  27m S  2.0 58.7   0:00.96 java                                        
 2702 tomcat    20   0 5286m 2.3g  27m S  0.0 58.7   0:00.01 java                                        
 2704 tomcat    20   0 5286m 2.3g  27m S  0.0 58.7   0:03.04 java                                        
 2705 tomcat    20   0 5286m 2.3g  27m S  0.0 58.7   0:21.74 java                                        
 2706 tomcat    20   0 5286m 2.3g  27m S  0.0 58.7   0:22.02 java                                        
 2707 tomcat    20   0 5286m 2.3g  27m S  0.0 58.7   0:39.16 java                                        
 2708 tomcat    20   0 5286m 2.3g  27m S  0.0 58.7   0:00.51 java                                        
 2709 tomcat    20   0 5286m 2.3g  27m S  0.0 58.7   0:01.10 java                                        
 2710 tomcat    20   0 5286m 2.3g  27m S  0.0 58.7   0:00.00 java                                        
 2711 tomcat    20   0 5286m 2.3g  27m S  0.0 58.7   3:03.78 java                                        
 2712 tomcat    20   0 5286m 2.3g  27m S  0.0 58.7   3:04.78 java                                        
 2713 tomcat    20   0 5286m 2.3g  27m S  0.0 58.7   0:00.00 java                                        
 2714 tomcat    20   0 5286m 2.3g  27m S  0.0 58.7   0:00.00 java                                        
 2715 tomcat    20   0 5286m 2.3g  27m S  0.0 58.7   0:27.43 java                                        
 2716 tomcat    20   0 5286m 2.3g  27m S  0.0 58.7   0:00.00 java                                        
 2717 tomcat    20   0 5286m 2.3g  27m S  0.0 58.7   0:00.00 java                                        
 2718 tomcat    20   0 5286m 2.3g  27m S  0.0 58.7   0:00.00 java                                        
 2721 tomcat    20   0 5286m 2.3g  27m S  0.0 58.7   0:00.02 java                                        
 2727 tomcat    20   0 5286m 2.3g  27m S  0.0 58.7   0:00.11 java                                        
 2728 tomcat    20   0 5286m 2.3g  27m S  0.0 58.7   0:00.00 java                                        
 2731 tomcat    20   0 5286m 2.3g  27m S  0.0 58.7   0:00.04 java                                        
 2732 tomcat    20   0 5286m 2.3g  27m S  0.0 58.7   0:10.43 java                                        
 2734 tomcat    20   0 5286m 2.3g  27m S  0.0 58.7   0:00.42 java

$ sudo ls -l /proc/2702/task
total 0
dr-xr-xr-x 7 tomcat tomcat 0 Jul 27 18:46 11238
dr-xr-xr-x 7 tomcat tomcat 0 Jul 27 18:46 11998
dr-xr-xr-x 7 tomcat tomcat 0 Jul 27 18:46 12400
dr-xr-xr-x 7 tomcat tomcat 0 Jul 27 18:17 12962
dr-xr-xr-x 7 tomcat tomcat 0 Jul 27 18:17 12964
dr-xr-xr-x 7 tomcat tomcat 0 Jul 27 18:17 12965
dr-xr-xr-x 7 tomcat tomcat 0 Jul 27 18:17 13097
dr-xr-xr-x 7 tomcat tomcat 0 Jul 27 18:46 13385
dr-xr-xr-x 7 tomcat tomcat 0 Jul 27 18:46 13388
dr-xr-xr-x 7 tomcat tomcat 0 Jul 27 18:46 13389
dr-xr-xr-x 7 tomcat tomcat 0 Jul 27 18:46 13390
dr-xr-xr-x 7 tomcat tomcat 0 Jul 27 18:46 13391
dr-xr-xr-x 7 tomcat tomcat 0 Jul 27 18:46 13392
dr-xr-xr-x 7 tomcat tomcat 0 Jul 27 18:17 14928
dr-xr-xr-x 7 tomcat tomcat 0 Jul 27 18:17 14963
dr-xr-xr-x 7 tomcat tomcat 0 Jul 27 18:17 15629
dr-xr-xr-x 7 tomcat tomcat 0 Jul 27 18:17 15720
...

$ cat /proc/2702/status | grep Thread
Threads:        207

Run queues

use sar -q, it will give you the number of tasks in the task list under the column plist-sz.

$ sar -q 3 5
Linux 4.1.13-18.26.amzn1.x86_64 (ip-10-32-8-250)        07/27/2016      _x86_64_        (2 CPU)

06:32:39 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
06:32:42 PM         0       292      0.15      0.07      0.06
06:32:45 PM         0       293      0.14      0.06      0.06
06:32:48 PM         0       293      0.14      0.06      0.06
06:32:51 PM         0       293      0.13      0.06      0.05
06:32:54 PM         0       293      0.13      0.06      0.05
Average:            0       293      0.14      0.06      0.06

Memory and swap

Common memory and swap commands

free
cat /proc/meminfo
vmstat
top
htop
pmap
dmidecode -t 17 - Shows installed ram
File /proc/<pid>/status virtual memory statistics for a process

Linux memory management

Linux memory management dated but still largely relevant. http://www.linuxhowtos.org/System/Linux%20Memory%20Management.htm.

It is important to note that Linux will always try to cache the most recently used (MRU) files if memory is available. Linux in order to help performance will always try to fill memory up to a certain limit with caching.

For a look at cached memory usage look at file /proc/slabinfo. For more info on slabinfo use command "$ man slabinfo".

Just looking at the output of the free or top commands alone or looking at how much memory used is not enough to paint a clear picture of memory allocation.

For example showing active and inactive memory:

$ vmstat -a
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free  inact active   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 7173612 181684 521708    0    0     2     1   21    8  0  1 99  0  0

Active memory is memory that is being used by a particular process. Inactive memory is memory that was allocated to a process that is no longer running. For an even more detailed distribution of memory look at special file /proc/meminfo :

$ cat /proc/meminfo  
MemTotal:        8057792 kB
MemFree:         7173628 kB
Buffers:           98160 kB
Cached:           436424 kB
SwapCached:            0 kB
Active:           521712 kB
Inactive:         181732 kB
Active(anon):     169004 kB
Inactive(anon):       12 kB
Active(file):     352708 kB
Inactive(file):   181720 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:        168856 kB
Mapped:            14644 kB
Shmem:               160 kB
Slab:             102796 kB
SReclaimable:      80104 kB
SUnreclaim:        22692 kB
KernelStack:        2000 kB
PageTables:         4440 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     4028896 kB
Committed_AS:     190744 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       27420 kB
VmallocChunk:   34359705844 kB
HardwareCorrupted:     0 kB
AnonHugePages:    135168 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        6144 kB
DirectMap2M:     8382464 kB

For a better picture of active memory you need the ps command other such tools (pmap) that look at process allocations. For Java you need to look at heap size (+mx on the VM). Also if you are swapping you very likely need more memory.

Virtual set size (VSZ) vs Resident Set Size (RSS) vs shared memory

The ps command can output various pieces of information about a process, such as its process id, current running state, and resource utilization. Two of the possible outputs are VSZ and RSS, which stand for "virtual set size" and "resident set size".

$ sudo ps ux -q 1457                     
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      1457  0.0  0.0  66240  1200 ?        Ss   09:41   0:00 /usr/sbin/sshd

In the above example process PID 1457 has a virtual size of 66mb (ps reports in kilobytes) and a resident size of 1.2mb. ps is not reporting the real memory usage of processes. What it is really doing is showing how much real memory this process would take up if it were the only process running. A Linux machine has several dozen processes running at any given time, which means that the VSZ and RSS numbers reported by ps are almost definitely incorrect. In order to understand why, it is necessary to learn how Linux handles shared libraries in programs. Because of shared libraries especially commonly referenced ones like libc, are used by many of the programs running on a Linux system. Due to this sharing, Linux is able to load a single copy of the shared libraries into memory and use that one copy for every program that references it. Most tools don't care very much about sharing; they simply report how much memory a process uses, regardless of whether that memory is shared with other processes as well. Two programs could therefore use a large shared library and yet have its size count towards both of their memory usage totals; the library is being double-counted, which can be very misleading if you don't know what is going on.

Unfortunately, a perfect representation of process memory usage isn't easy to obtain. Seeing a process's memory map. Let's see what the situation is with that "huge" SSH process. To see what PID 1457's memory looks like, we'll use the pmap program (with the -d flag):

$ sudo pmap -d 1457
1457:   /usr/sbin/sshd
Address           Kbytes Mode  Offset           Device    Mapping
00007f1cb346e000      52 r-x-- 0000000000000000 0ca:00001 libnss_files-2.12.so
00007f1cb347b000    2044 ----- 000000000000d000 0ca:00001 libnss_files-2.12.so
00007f1cb367a000       4 r---- 000000000000c000 0ca:00001 libnss_files-2.12.so
00007f1cb367b000       4 rw--- 000000000000d000 0ca:00001 libnss_files-2.12.so
00007f1cb367c000      28 r-x-- 0000000000000000 0ca:00001 librt-2.12.so
00007f1cb3683000    2044 ----- 0000000000007000 0ca:00001 librt-2.12.so
00007f1cb3882000       4 r---- 0000000000006000 0ca:00001 librt-2.12.so
00007f1cb3883000       4 rw--- 0000000000007000 0ca:00001 librt-2.12.so
00007f1cb3884000     228 r-x-- 0000000000000000 0ca:00001 libnspr4.so
00007f1cb38bd000    2048 ----- 0000000000039000 0ca:00001 libnspr4.so
00007f1cb3abd000       4 r---- 0000000000039000 0ca:00001 libnspr4.so
00007f1cb3abe000       8 rw--- 000000000003a000 0ca:00001 libnspr4.so
00007f1cb3ac0000       8 rw--- 0000000000000000 000:00000   [ anon ]
00007f1cb3ac2000      12 r-x-- 0000000000000000 0ca:00001 libplds4.so
00007f1cb3ac5000    2044 ----- 0000000000003000 0ca:00001 libplds4.so
00007f1cb3cc4000       4 r---- 0000000000002000 0ca:00001 libplds4.so
00007f1cb3cc5000       4 rw--- 0000000000003000 0ca:00001 libplds4.so
00007f1cb3cc6000      16 r-x-- 0000000000000000 0ca:00001 libplc4.so
00007f1cb3cca000    2044 ----- 0000000000004000 0ca:00001 libplc4.so
00007f1cb3ec9000       4 r---- 0000000000003000 0ca:00001 libplc4.so
00007f1cb3eca000       4 rw--- 0000000000004000 0ca:00001 libplc4.so
00007f1cb3ecb000     152 r-x-- 0000000000000000 0ca:00001 libnssutil3.so
00007f1cb3ef1000    2044 ----- 0000000000026000 0ca:00001 libnssutil3.so
00007f1cb40f0000      24 r---- 0000000000025000 0ca:00001 libnssutil3.so
00007f1cb40f6000       4 rw--- 000000000002b000 0ca:00001 libnssutil3.so
...
00007f1cb763d000       8 r---- 000000000001f000 0ca:00001 ld-2.12.so
00007f1cb763f000       4 rw--- 0000000000021000 0ca:00001 ld-2.12.so
00007f1cb7640000       4 rw--- 0000000000000000 000:00000   [ anon ]
00007f1cb7641000     544 r-x-- 0000000000000000 0ca:00001 sshd
00007f1cb78c8000      12 r---- 0000000000087000 0ca:00001 sshd
00007f1cb78cb000       4 rw--- 000000000008a000 0ca:00001 sshd
00007f1cb78cc000      36 rw--- 0000000000000000 000:00000   [ anon ]
00007f1cb8f60000     132 rw--- 0000000000000000 000:00000   [ anon ]
00007ffc200b3000      84 rw--- 0000000000000000 000:00000   [ stack ]
00007ffc2010f000       4 r-x-- 0000000000000000 000:00000   [ anon ]
ffffffffff600000       4 r-x-- 0000000000000000 000:00000   [ anon ]
mapped: 66240K    writeable/private: 816K    shared: 0K

Reduced a lot of the output; the rest is similar to what is shown. Even without the complete output, we can see some very interesting things. One important thing to note about the output is that each shared library is listed twice; once for its code segment and once for its data segment. The code segment can be shared however the data (also known as "text") segment cannot be shared and must be forked in memory with each invocation of the program creating multiple copies of the same memory. The code segments have a mode of "r-x--", while the data is set to "rw---". The Kbytes, Mode, and Mapping columns are the only ones we will care about, as the rest are unimportant to our analysis.

If you go through the output, you will find that the lines with the largest Kbytes number are usually the code segments of the included shared libraries (the ones that start with "lib" are the shared libraries). What is great about that is that they are the ones that can be shared between processes. If you factor out all of the parts that are shared between processes, you end up with the "writeable/private" total, which is shown at the bottom of the output. This is what can be considered the incremental cost of this process, factoring out the shared libraries. Therefore, the cost to run this instance of SSH (assuming that all of the shared libraries were already loaded) is around 800 kilobytes. That is quite a different story from the 66 or 1.2 megabytes that ps reported.

Keep that in mind when sizing applications or determining true application sizing. The moral of this story is that process memory usage on Linux is a complex matter; you can't just run ps and know what is going on. This is especially true when you deal with programs that create a lot of identical children processes, like Apache. ps might report that each Apache process uses 10 megabytes of memory, when the reality might be that the marginal cost of each Apache process is 1 megabyte of memory. This information becomes critical when tuning Apache's MaxClients setting, which determines how many simultaneous requests your server can handle.

Swap

Linux divides its physical RAM (random access memory) into chucks of memory called pages. Swapping is the process whereby a page of memory is copied to the preconfigured space on the hard disk, called swap space, to free up that page of memory. The combined sizes of the physical memory and the swap space is the amount of virtual memory available.

Swapping is necessary for two important reasons. First, when the system requires more memory than is physically available, the kernel swaps out less used pages and gives memory to the current application (process) that needs the memory immediately. Second, a significant number of the pages used by an application during its startup phase may only be used for initialization and then never used again. The system can swap out those pages and free the memory for other applications or even for the disk cache.

However, swapping does have a downside. Compared to memory, disks are very slow. Memory speeds can be measured in nanoseconds, while disks are measured in milliseconds, so accessing the disk can be tens of thousands times slower than accessing physical memory. The more swapping that occurs, the slower your system will be. Sometimes excessive swapping or thrashing occurs where a page is swapped out and then very soon swapped in and then swapped out again and so on. In such situations the system is struggling to find free memory and keep applications running at the same time. In this case only adding more RAM will help.

Linux has two forms of swap space: the swap partition and the swap file. The swap partition is an independent section of the hard disk used solely for swapping; no other files can reside there. The swap file is a special file in the filesystem that resides amongst your system and data files.

To see what swap space you have, use the command swapon -s. The output will look something like this:

Filename  Type       Size       Used Priority
/dev/sda5 partition  859436  0       -1

Tuning swap

It is possible to run a Linux system without a swap space, and the system will run well if you have a large amount of memory -- but if you run out of physical memory then the system will crash, as it has nothing else it can do, so it is advisable to have a swap space, especially since disk space is relatively cheap.

The Linux 2.6 kernel added a new kernel parameter called swappiness to let administrators tweak the way Linux swaps. It is a number from 0 to 100. In essence, higher values lead to more pages being swapped, and lower values lead to more applications being kept in memory, even if they are idle.

The default value for swappiness is 60. You can alter it temporarily (until you next reboot) by typing as root:

echo 50 > /proc/sys/vm/swappiness If you want to alter it permanently then you need to change the vm.swappiness parameter in the /etc/sysctl.conf file.

Measuring swap activity

Paging is not the same as swapping. You might have paging activity when calling executables to read portions of their binary code off disk or working with memory-mapped files. Essentially paging is loading code and text data into memory as part of the RSS. Swapping is removing your process entirely into the swapping file. Swapping is a hugely expensive operation compared to demand loading (paging).

A fault is loading a portion of memory from disk (as part of the VSZ) into RSS and making it resident in memory for use.

For the above reasons I do not rely on the sar -W command to get a better idea of swap activity, take a look at the si / so counters of vmstat.

$ sudo vmstat 5 10
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 7350088  50412 365992    0    0     7     2   23   17  0  0 100  0  0
 0  0      0 7350088  50412 365992    0    0     0     0   17   10  0  0 100  0  0
 0  0      0 7350088  50420 365992    0    0     0     3   18   12  0  0 100  0  0
 0  0      0 7350088  50420 365992    0    0     0     0   13    9  0  0 100  0  0
 0  0      0 7350088  50420 365992    0    0     0     0   21   11  0  0 100  0  0
 0  0      0 7350088  50428 365992    0    0     0     2   15   11  0  0 100  0  0
 0  0      0 7350088  50428 365992    0    0     0     2   21   11  0  0 100  0  0
 0  0      0 7350088  50428 365992    0    0     0     0   15   10  0  0 100  0  0
 0  0      0 7350088  50428 365992    0    0     0     0   15    9  0  0 100  0  0
 0  0      0 7350088  50428 365992    0    0     0     0   18   12  0  0 100  0  0

The above system is not swapping as si and so are zero.

Java processes and faulting

$ sudo sar -B 3 5
Linux 2.6.32-642.3.1.el6.x86_64 (typhon)        07/26/2016      _x86_64_        (2 CPU)

05:40:16 PM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff
05:40:19 PM      0.00      6.67     12.33      0.00     22.67      0.00      0.00      0.00      0.00
05:40:22 PM      0.00      0.00     14.00      0.00     25.00      0.00      0.00      0.00      0.00
05:40:25 PM      0.00      0.00     16.33      0.00     28.67      0.00      0.00      0.00      0.00
05:40:28 PM      0.00      0.00     14.33      0.00     26.33      0.00      0.00      0.00      0.00
05:40:31 PM      0.00      0.00     11.00      0.00     22.67      0.00      0.00      0.00      0.00
Average:         0.00      1.33     13.60      0.00     25.07      0.00      0.00      0.00      0.00

What is the difference between a "fault", sometimes known as a "soft fault", and a "major fault" (aka "hard fault")? Soft fault happens when the process needs a page that is already in memory, but was freed by the page replacement process. Major or "hard" fault happens when the page needs to be brought into memory from disk. Major faults are, of course, much more expensive and take much longer to complete then the soft ones. Large number of major page faults can slow the system down to the crawl. On an average system, major page faults are responsible for the vast majority of the CPU time spent in the kernel mode.

Also look at major faults on the Java process with ps:

$ sudo ps -o pid,ppid,flags,rss,resident,size,min_flt,maj_flt,share,vsize 3275
   PID  PPID F   RSS   RES    SZ  MINFL  MAJFL -    VSZ
  3275  3239 0 1555480   - 2871068 443437    0 – 2972676

"MAJFL" should be zero at all times. The VSZ (virtual size) is the virtual size of the whole process in core and out. This number should equal your JVM heap size maximum plus a few more for overhead. RSS is resident set size which is how much of the memory of the VSZ is resident in the core at this moment. This is "garbage collection" (GC) happening this number will fluctuate with the GC going on. SZ is size this is how much ram is mapped in physical pages but not necessarily in the core right now. VSZ is the number of pages mapped in both physical pages and virtual ones (virtual meaning swap to disk also).

OOM killer

OOM or Out of memory errors and adjustment.

https://linux-mm.org/OOM_Killer - "It is the job of the linux 'oom killer' to sacrifice one or more processes in order to free up memory for the system when all else fails. It will also kill any process sharing the same mm_struct as the selected process, for obvious reasons. Any particular process leader may be immunized against the oom killer if the value of its /proc/<pid>/oomadj is set to the constant OOM_DISABLE (currently defined as -17)."

An example of an OOM in the systems log (/var/log/messages usually):

 Mar 13 15:40:33 web1 kernel: mysqld invoked oom-killer: gfp_mask=0x201da, order=0,
   oom_adj=0, oom_score_adj=0
 Mar 13 15:40:33 web1 kernel: mysqld cpuset=/ mems_allowed=0
 Mar 13 15:40:33 web1 kernel: Pid: 18355, comm: mysqld Not tainted 3.2.1-linode40 #1
 Mar 13 15:40:33 web1 kernel: Call Trace:
 Mar 13 15:40:33 web1 kernel: [<c01958fd>] ? T.662+0x7d/0x1b0
 Mar 13 15:40:33 web1 kernel: [<c01067fb>] ? xen_restore_fl_direct_reloc+0x4/0x4
 Mar 13 15:40:33 web1 kernel: [<c06e1431>] ? _raw_spin_unlock_irqrestore+0x11/0x20
 Mar 13 15:40:33 web1 kernel: [<c04707a7>] ? ___ratelimit+0x97/0x110
 Mar 13 15:40:33 web1 kernel: [<c0158fb1>] ? get_task_cred+0x11/0x50
 Mar 13 15:40:33 web1 kernel: [<c0195a8e>] ? T.661+0x5e/0x150

I've just read the kernel documentation for "oom_adj" (filesytems/proc.txt) :

 2.12 /proc/<pid>/oom_adj - Adjust the oom-killer score
 ------------------------------------------------------
 This file can be used to adjust the score used to select which processes
 should be killed in an  out-of-memory  situation.  Giving it a high score will
 increase the likelihood of this process being killed by the oom-killer.  Valid
 values are in the range -16 to +15, plus the special value -17, which disables
 oom-killing altogether for this process.