Linux Memory and swap Monitoring and Tuning

Common memory and swap commands

free
cat /proc/meminfo
vmstat
top
htop
pmap
dmidecode -t 17 - Shows installed ram
File /proc/<pid>/status virtual memory statistics for a process

Linux memory management

Linux memory management dated but still largely relevant see http://www.linuxhowtos.org/System/Linux%20Memory%20Management.htm.

It is important to note that Linux will always try to cache the most recently used (MRU) files if memory is available. Linux in order to help performance will always try to fill memory up to a certain limit with caching.

For a look at cached memory usage look at file /proc/slabinfo. For more info on slabinfo use command "$ man slabinfo".

Just looking at the output of the free or top commands alone or looking at how much memory used is not enough to paint a clear picture of memory allocation.

For example showing active and inactive memory:

$ vmstat -a
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free  inact active   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 7173612 181684 521708    0    0     2     1   21    8  0  1 99  0  0

Active memory is memory that is being used by a particular process. Inactive memory is memory that was allocated to a process that is no longer running. For an even more detailed distribution of memory look at special file /proc/meminfo :

$ cat /proc/meminfo  
MemTotal:        8057792 kB
MemFree:         7173628 kB
Buffers:           98160 kB
Cached:           436424 kB
SwapCached:            0 kB
Active:           521712 kB
Inactive:         181732 kB
Active(anon):     169004 kB
Inactive(anon):       12 kB
Active(file):     352708 kB
Inactive(file):   181720 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:        168856 kB
Mapped:            14644 kB
Shmem:               160 kB
Slab:             102796 kB
SReclaimable:      80104 kB
SUnreclaim:        22692 kB
KernelStack:        2000 kB
PageTables:         4440 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     4028896 kB
Committed_AS:     190744 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       27420 kB
VmallocChunk:   34359705844 kB
HardwareCorrupted:     0 kB
AnonHugePages:    135168 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        6144 kB
DirectMap2M:     8382464 kB

For a better picture of active memory you need the ps command other such tools (pmap) that look at process allocations. For Java you need to look at heap size (+mx on the VM). Also if you are swapping you very likely need more memory.

Virtual set size (VSZ) vs Resident Set Size (RSS) vs shared memory

The ps command can output various pieces of information about a process, such as its process id, current running state, and resource utilization. Two of the possible outputs are VSZ and RSS, which stand for "virtual set size" and "resident set size".

$ sudo ps ux -q 1457                     
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      1457  0.0  0.0  66240  1200 ?        Ss   09:41   0:00 /usr/sbin/sshd

In the above example process PID 1457 has a virtual size of 66mb (ps reports in kilobytes) and a resident size of 1.2mb. ps is not reporting the real memory usage of processes. What it is really doing is showing how much real memory this process would take up if it were the only process running. A Linux machine has several dozen processes running at any given time, which means that the VSZ and RSS numbers reported by ps are almost definitely incorrect. In order to understand why, it is necessary to learn how Linux handles shared libraries in programs. Because of shared libraries especially commonly referenced ones like libc, are used by many of the programs running on a Linux system. Due to this sharing, Linux is able to load a single copy of the shared libraries into memory and use that one copy for every program that references it. Most tools don't care very much about sharing; they simply report how much memory a process uses, regardless of whether that memory is shared with other processes as well. Two programs could therefore use a large shared library and yet have its size count towards both of their memory usage totals; the library is being double-counted, which can be very misleading if you don't know what is going on.

Unfortunately, a perfect representation of process memory usage isn't easy to obtain. Seeing a process's memory map. Let's see what the situation is with that "huge" SSH process. To see what PID 1457's memory looks like, we'll use the pmap program (with the -d flag):

$ sudo pmap -d 1457
1457:   /usr/sbin/sshd
Address           Kbytes Mode  Offset           Device    Mapping
00007f1cb346e000      52 r-x-- 0000000000000000 0ca:00001 libnss_files-2.12.so
00007f1cb347b000    2044 ----- 000000000000d000 0ca:00001 libnss_files-2.12.so
00007f1cb367a000       4 r---- 000000000000c000 0ca:00001 libnss_files-2.12.so
00007f1cb367b000       4 rw--- 000000000000d000 0ca:00001 libnss_files-2.12.so
00007f1cb367c000      28 r-x-- 0000000000000000 0ca:00001 librt-2.12.so
00007f1cb3683000    2044 ----- 0000000000007000 0ca:00001 librt-2.12.so
00007f1cb3882000       4 r---- 0000000000006000 0ca:00001 librt-2.12.so
00007f1cb3883000       4 rw--- 0000000000007000 0ca:00001 librt-2.12.so
00007f1cb3884000     228 r-x-- 0000000000000000 0ca:00001 libnspr4.so
00007f1cb38bd000    2048 ----- 0000000000039000 0ca:00001 libnspr4.so
00007f1cb3abd000       4 r---- 0000000000039000 0ca:00001 libnspr4.so
00007f1cb3abe000       8 rw--- 000000000003a000 0ca:00001 libnspr4.so
00007f1cb3ac0000       8 rw--- 0000000000000000 000:00000   [ anon ]
00007f1cb3ac2000      12 r-x-- 0000000000000000 0ca:00001 libplds4.so
00007f1cb3ac5000    2044 ----- 0000000000003000 0ca:00001 libplds4.so
00007f1cb3cc4000       4 r---- 0000000000002000 0ca:00001 libplds4.so
00007f1cb3cc5000       4 rw--- 0000000000003000 0ca:00001 libplds4.so
00007f1cb3cc6000      16 r-x-- 0000000000000000 0ca:00001 libplc4.so
00007f1cb3cca000    2044 ----- 0000000000004000 0ca:00001 libplc4.so
00007f1cb3ec9000       4 r---- 0000000000003000 0ca:00001 libplc4.so
00007f1cb3eca000       4 rw--- 0000000000004000 0ca:00001 libplc4.so
00007f1cb3ecb000     152 r-x-- 0000000000000000 0ca:00001 libnssutil3.so
00007f1cb3ef1000    2044 ----- 0000000000026000 0ca:00001 libnssutil3.so
00007f1cb40f0000      24 r---- 0000000000025000 0ca:00001 libnssutil3.so
00007f1cb40f6000       4 rw--- 000000000002b000 0ca:00001 libnssutil3.so
...
00007f1cb763d000       8 r---- 000000000001f000 0ca:00001 ld-2.12.so
00007f1cb763f000       4 rw--- 0000000000021000 0ca:00001 ld-2.12.so
00007f1cb7640000       4 rw--- 0000000000000000 000:00000   [ anon ]
00007f1cb7641000     544 r-x-- 0000000000000000 0ca:00001 sshd
00007f1cb78c8000      12 r---- 0000000000087000 0ca:00001 sshd
00007f1cb78cb000       4 rw--- 000000000008a000 0ca:00001 sshd
00007f1cb78cc000      36 rw--- 0000000000000000 000:00000   [ anon ]
00007f1cb8f60000     132 rw--- 0000000000000000 000:00000   [ anon ]
00007ffc200b3000      84 rw--- 0000000000000000 000:00000   [ stack ]
00007ffc2010f000       4 r-x-- 0000000000000000 000:00000   [ anon ]
ffffffffff600000       4 r-x-- 0000000000000000 000:00000   [ anon ]
mapped: 66240K    writeable/private: 816K    shared: 0K

Reduced a lot of the output; the rest is similar to what is shown. Even without the complete output, we can see some very interesting things. One important thing to note about the output is that each shared library is listed twice; once for its code segment and once for its data segment. The code segment can be shared however the data (also known as "text") segment cannot be shared and must be forked in memory with each invocation of the program creating multiple copies of the same memory. The code segments have a mode of "r-x--", while the data is set to "rw---". The Kbytes, Mode, and Mapping columns are the only ones we will care about, as the rest are unimportant to our analysis.

If you go through the output, you will find that the lines with the largest Kbytes number are usually the code segments of the included shared libraries (the ones that start with "lib" are the shared libraries). What is great about that is that they are the ones that can be shared between processes. If you factor out all of the parts that are shared between processes, you end up with the "writeable/private" total, which is shown at the bottom of the output. This is what can be considered the incremental cost of this process, factoring out the shared libraries. Therefore, the cost to run this instance of SSH (assuming that all of the shared libraries were already loaded) is around 800 kilobytes. That is quite a different story from the 66 or 1.2 megabytes that ps reported.

Keep that in mind when sizing applications or determining true application sizing. The moral of this story is that process memory usage on Linux is a complex matter; you can't just run ps and know what is going on. This is especially true when you deal with programs that create a lot of identical children processes, like Apache. ps might report that each Apache process uses 10 megabytes of memory, when the reality might be that the marginal cost of each Apache process is 1 megabyte of memory. This information becomes critical when tuning Apache's MaxClients setting, which determines how many simultaneous requests your server can handle.

Swap

Linux divides its physical RAM (random access memory) into chucks of memory called pages. Swapping is the process whereby a page of memory is copied to the preconfigured space on the hard disk, called swap space, to free up that page of memory. The combined sizes of the physical memory and the swap space is the amount of virtual memory available.

Swapping is necessary for two important reasons. First, when the system requires more memory than is physically available, the kernel swaps out less used pages and gives memory to the current application (process) that needs the memory immediately. Second, a significant number of the pages used by an application during its startup phase may only be used for initialization and then never used again. The system can swap out those pages and free the memory for other applications or even for the disk cache.

However, swapping does have a downside. Compared to memory, disks are very slow. Memory speeds can be measured in nanoseconds, while disks are measured in milliseconds, so accessing the disk can be tens of thousands times slower than accessing physical memory. The more swapping that occurs, the slower your system will be. Sometimes excessive swapping or thrashing occurs where a page is swapped out and then very soon swapped in and then swapped out again and so on. In such situations the system is struggling to find free memory and keep applications running at the same time. In this case only adding more RAM will help.

Linux has two forms of swap space: the swap partition and the swap file. The swap partition is an independent section of the hard disk used solely for swapping; no other files can reside there. The swap file is a special file in the filesystem that resides amongst your system and data files.

To see what swap space you have, use the command swapon -s. The output will look something like this:

Filename  Type       Size       Used Priority
/dev/sda5 partition  859436  0       -1

Tuning swap

It is possible to run a Linux system without a swap space, and the system will run well if you have a large amount of memory -- but if you run out of physical memory then the system will crash, as it has nothing else it can do, so it is advisable to have a swap space, especially since disk space is relatively cheap.

The Linux 2.6 kernel added a new kernel parameter called swappiness to let administrators tweak the way Linux swaps. It is a number from 0 to 100. In essence, higher values lead to more pages being swapped, and lower values lead to more applications being kept in memory, even if they are idle.

The default value for swappiness is 60. You can alter it temporarily (until you next reboot) by typing as root:

echo 50 > /proc/sys/vm/swappiness If you want to alter it permanently then you need to change the vm.swappiness parameter in the /etc/sysctl.conf file.

Measuring swap activity

Paging is not the same as swapping. You might have paging activity when calling executables to read portions of their binary code off disk or working with memory-mapped files. Essentially paging is loading code and text data into memory as part of the RSS. Swapping is removing your process entirely into the swapping file. Swapping is a hugely expensive operation compared to demand loading (paging).

A fault is loading a portion of memory from disk (as part of the VSZ) into RSS and making it resident in memory for use.

For the above reasons I do not rely on the sar -W command to get a better idea of swap activity, take a look at the si / so counters of vmstat.

$ sudo vmstat 5 10
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 7350088  50412 365992    0    0     7     2   23   17  0  0 100  0  0
 0  0      0 7350088  50412 365992    0    0     0     0   17   10  0  0 100  0  0
 0  0      0 7350088  50420 365992    0    0     0     3   18   12  0  0 100  0  0
 0  0      0 7350088  50420 365992    0    0     0     0   13    9  0  0 100  0  0
 0  0      0 7350088  50420 365992    0    0     0     0   21   11  0  0 100  0  0
 0  0      0 7350088  50428 365992    0    0     0     2   15   11  0  0 100  0  0
 0  0      0 7350088  50428 365992    0    0     0     2   21   11  0  0 100  0  0
 0  0      0 7350088  50428 365992    0    0     0     0   15   10  0  0 100  0  0
 0  0      0 7350088  50428 365992    0    0     0     0   15    9  0  0 100  0  0
 0  0      0 7350088  50428 365992    0    0     0     0   18   12  0  0 100  0  0

The above system is not swapping as si and so are zero.

Java processes and faulting

$ sudo sar -B 3 5
Linux 2.6.32-642.3.1.el6.x86_64 (typhon)        07/26/2016      _x86_64_        (2 CPU)

05:40:16 PM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff
05:40:19 PM      0.00      6.67     12.33      0.00     22.67      0.00      0.00      0.00      0.00
05:40:22 PM      0.00      0.00     14.00      0.00     25.00      0.00      0.00      0.00      0.00
05:40:25 PM      0.00      0.00     16.33      0.00     28.67      0.00      0.00      0.00      0.00
05:40:28 PM      0.00      0.00     14.33      0.00     26.33      0.00      0.00      0.00      0.00
05:40:31 PM      0.00      0.00     11.00      0.00     22.67      0.00      0.00      0.00      0.00
Average:         0.00      1.33     13.60      0.00     25.07      0.00      0.00      0.00      0.00

What is the difference between a "fault", sometimes known as a "soft fault", and a "major fault" (aka "hard fault")? Soft fault happens when the process needs a page that is already in memory, but was freed by the page replacement process. Major or "hard" fault happens when the page needs to be brought into memory from disk. Major faults are, of course, much more expensive and take much longer to complete then the soft ones. Large number of major page faults can slow the system down to the crawl. On an average system, major page faults are responsible for the vast majority of the CPU time spent in the kernel mode.

Also look at major faults on the Java process with ps:

$ sudo ps -o pid,ppid,flags,rss,resident,size,min_flt,maj_flt,share,vsize 3275
   PID  PPID F   RSS   RES    SZ  MINFL  MAJFL -    VSZ
  3275  3239 0 1555480   - 2871068 443437    0 – 2972676

"MAJFL" should be zero at all times. The VSZ (virtual size) is the virtual size of the whole process in core and out. This number should equal your JVM heap size maximum plus a few more for overhead. RSS is resident set size which is how much of the memory of the VSZ is resident in the core at this moment. This is "garbage collection" (GC) happening this number will fluctuate with the GC going on. SZ is size this is how much ram is mapped in physical pages but not necessarily in the core right now. VSZ is the number of pages mapped in both physical pages and virtual ones (virtual meaning swap to disk also).

OOM killer

OOM or Out of memory errors and adjustment.

https://linux-mm.org/OOM_Killer - "It is the job of the linux 'oom killer' to sacrifice one or more processes in order to free up memory for the system when all else fails. It will also kill any process sharing the same mm_struct as the selected process, for obvious reasons. Any particular process leader may be immunized against the oom killer if the value of its /proc/<pid>/oomadj is set to the constant OOM_DISABLE (currently defined as -17)."

An example of an OOM in the systems log (/var/log/messages usually):

 Mar 13 15:40:33 web1 kernel: mysqld invoked oom-killer: gfp_mask=0x201da, order=0,
   oom_adj=0, oom_score_adj=0
 Mar 13 15:40:33 web1 kernel: mysqld cpuset=/ mems_allowed=0
 Mar 13 15:40:33 web1 kernel: Pid: 18355, comm: mysqld Not tainted 3.2.1-linode40 #1
 Mar 13 15:40:33 web1 kernel: Call Trace:
 Mar 13 15:40:33 web1 kernel: [<c01958fd>] ? T.662+0x7d/0x1b0
 Mar 13 15:40:33 web1 kernel: [<c01067fb>] ? xen_restore_fl_direct_reloc+0x4/0x4
 Mar 13 15:40:33 web1 kernel: [<c06e1431>] ? _raw_spin_unlock_irqrestore+0x11/0x20
 Mar 13 15:40:33 web1 kernel: [<c04707a7>] ? ___ratelimit+0x97/0x110
 Mar 13 15:40:33 web1 kernel: [<c0158fb1>] ? get_task_cred+0x11/0x50
 Mar 13 15:40:33 web1 kernel: [<c0195a8e>] ? T.661+0x5e/0x150

I've just read the kernel documentation for "oom_adj" (filesytems/proc.txt) :

 2.12 /proc/<pid>/oom_adj - Adjust the oom-killer score
 ------------------------------------------------------
 This file can be used to adjust the score used to select which processes
 should be killed in an  out-of-memory  situation.  Giving it a high score will
 increase the likelihood of this process being killed by the oom-killer.  Valid
 values are in the range -16 to +15, plus the special value -17, which disables
 oom-killing altogether for this process.

TLB, Huge Pages and transparent Huge Pages

TBA

A Translation lookaside buffer (TLB) is a memory cache that is used to reduce the time taken to access a user memory location

NUMA and memory fragmentation

Work in progress.

As processor workloads and memory requirements increased now outpacing physical hardware configurations requiring multiprocessors and multiple memory backplanes no longer directly addressable from a single or even a multi-core CPU. This is not a new concept and I remember working on these types of systems back in the mid to late 90's with machines such as the Sun E10K and IBM RS6000/SP or "Silver nodes".

From the numa man page:

"Non-Uniform Memory Access (NUMA) refers to multiprocessor systems whose memory is divided into multiple memory nodes. The access time of a memory node depends on the relative locations of the accessing CPU and the accessed node. (This contrasts with a symmetric multiprocessor system, where the access time for all of the memory is the same for all CPUs.) Normally, each CPU on a NUMA system has a local memory node whose contents can be accessed faster than the memory in the node local to another CPU or the memory on a bus shared by all CPUs." Work in progress

http://rhelblog.redhat.com/2015/01/12/mysteries-of-numa-memory-management-revealed/

http://andorian.blogspot.com/2014/03/making-sense-of-procbuddyinfo.html

This is coming back into focus for me again as AWS now has instances as large as p2.16xlarge with 16 GPU's sliced as 64 cores with 732 GiB physical memory. These kind of configurations can only be obtained through NUMA.