Linux Memory and swap Monitoring and Tuning
Common memory and swap commands
- cat /proc/meminfo
- dmidecode -t 17 - Shows installed ram
- File /proc/<pid>/status virtual memory statistics for a process
Linux memory management
Linux memory management dated but still largely relevant see http://www.linuxhowtos.org/System/Linux%20Memory%20Management.htm.
It is important to note that Linux will always try to cache the most recently used (MRU) files if memory is available. Linux in order to help performance will always try to fill memory up to a certain limit with caching.
For a look at cached memory usage look at file /proc/slabinfo. For more info on slabinfo use command "$ man slabinfo".
Just looking at the output of the free or top commands alone or looking at how much memory used is not enough to paint a clear picture of memory allocation.
For example showing active and inactive memory:
$ vmstat -a procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free inact active si so bi bo in cs us sy id wa st 1 0 0 7173612 181684 521708 0 0 2 1 21 8 0 1 99 0 0
Active memory is memory that is being used by a particular process. Inactive memory is memory that was allocated to a process that is no longer running. For an even more detailed distribution of memory look at special file /proc/meminfo :
$ cat /proc/meminfo MemTotal: 8057792 kB MemFree: 7173628 kB Buffers: 98160 kB Cached: 436424 kB SwapCached: 0 kB Active: 521712 kB Inactive: 181732 kB Active(anon): 169004 kB Inactive(anon): 12 kB Active(file): 352708 kB Inactive(file): 181720 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 168856 kB Mapped: 14644 kB Shmem: 160 kB Slab: 102796 kB SReclaimable: 80104 kB SUnreclaim: 22692 kB KernelStack: 2000 kB PageTables: 4440 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 4028896 kB Committed_AS: 190744 kB VmallocTotal: 34359738367 kB VmallocUsed: 27420 kB VmallocChunk: 34359705844 kB HardwareCorrupted: 0 kB AnonHugePages: 135168 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 6144 kB DirectMap2M: 8382464 kB
For a better picture of active memory you need the ps command other such tools (pmap) that look at process allocations. For Java you need to look at heap size (+mx on the VM). Also if you are swapping you very likely need more memory.
The ps command can output various pieces of information about a process, such as its process id, current running state, and resource utilization. Two of the possible outputs are VSZ and RSS, which stand for "virtual set size" and "resident set size".
$ sudo ps ux -q 1457 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1457 0.0 0.0 66240 1200 ? Ss 09:41 0:00 /usr/sbin/sshd
In the above example process PID 1457 has a virtual size of 66mb (ps reports in kilobytes) and a resident size of 1.2mb. ps is not reporting the real memory usage of processes. What it is really doing is showing how much real memory this process would take up if it were the only process running. A Linux machine has several dozen processes running at any given time, which means that the VSZ and RSS numbers reported by ps are almost definitely incorrect. In order to understand why, it is necessary to learn how Linux handles shared libraries in programs. Because of shared libraries especially commonly referenced ones like libc, are used by many of the programs running on a Linux system. Due to this sharing, Linux is able to load a single copy of the shared libraries into memory and use that one copy for every program that references it. Most tools don't care very much about sharing; they simply report how much memory a process uses, regardless of whether that memory is shared with other processes as well. Two programs could therefore use a large shared library and yet have its size count towards both of their memory usage totals; the library is being double-counted, which can be very misleading if you don't know what is going on.
Unfortunately, a perfect representation of process memory usage isn't easy to obtain. Seeing a process's memory map. Let's see what the situation is with that "huge" SSH process. To see what PID 1457's memory looks like, we'll use the pmap program (with the -d flag):
$ sudo pmap -d 1457 1457: /usr/sbin/sshd Address Kbytes Mode Offset Device Mapping 00007f1cb346e000 52 r-x-- 0000000000000000 0ca:00001 libnss_files-2.12.so 00007f1cb347b000 2044 ----- 000000000000d000 0ca:00001 libnss_files-2.12.so 00007f1cb367a000 4 r---- 000000000000c000 0ca:00001 libnss_files-2.12.so 00007f1cb367b000 4 rw--- 000000000000d000 0ca:00001 libnss_files-2.12.so 00007f1cb367c000 28 r-x-- 0000000000000000 0ca:00001 librt-2.12.so 00007f1cb3683000 2044 ----- 0000000000007000 0ca:00001 librt-2.12.so 00007f1cb3882000 4 r---- 0000000000006000 0ca:00001 librt-2.12.so 00007f1cb3883000 4 rw--- 0000000000007000 0ca:00001 librt-2.12.so 00007f1cb3884000 228 r-x-- 0000000000000000 0ca:00001 libnspr4.so 00007f1cb38bd000 2048 ----- 0000000000039000 0ca:00001 libnspr4.so 00007f1cb3abd000 4 r---- 0000000000039000 0ca:00001 libnspr4.so 00007f1cb3abe000 8 rw--- 000000000003a000 0ca:00001 libnspr4.so 00007f1cb3ac0000 8 rw--- 0000000000000000 000:00000 [ anon ] 00007f1cb3ac2000 12 r-x-- 0000000000000000 0ca:00001 libplds4.so 00007f1cb3ac5000 2044 ----- 0000000000003000 0ca:00001 libplds4.so 00007f1cb3cc4000 4 r---- 0000000000002000 0ca:00001 libplds4.so 00007f1cb3cc5000 4 rw--- 0000000000003000 0ca:00001 libplds4.so 00007f1cb3cc6000 16 r-x-- 0000000000000000 0ca:00001 libplc4.so 00007f1cb3cca000 2044 ----- 0000000000004000 0ca:00001 libplc4.so 00007f1cb3ec9000 4 r---- 0000000000003000 0ca:00001 libplc4.so 00007f1cb3eca000 4 rw--- 0000000000004000 0ca:00001 libplc4.so 00007f1cb3ecb000 152 r-x-- 0000000000000000 0ca:00001 libnssutil3.so 00007f1cb3ef1000 2044 ----- 0000000000026000 0ca:00001 libnssutil3.so 00007f1cb40f0000 24 r---- 0000000000025000 0ca:00001 libnssutil3.so 00007f1cb40f6000 4 rw--- 000000000002b000 0ca:00001 libnssutil3.so ... 00007f1cb763d000 8 r---- 000000000001f000 0ca:00001 ld-2.12.so 00007f1cb763f000 4 rw--- 0000000000021000 0ca:00001 ld-2.12.so 00007f1cb7640000 4 rw--- 0000000000000000 000:00000 [ anon ] 00007f1cb7641000 544 r-x-- 0000000000000000 0ca:00001 sshd 00007f1cb78c8000 12 r---- 0000000000087000 0ca:00001 sshd 00007f1cb78cb000 4 rw--- 000000000008a000 0ca:00001 sshd 00007f1cb78cc000 36 rw--- 0000000000000000 000:00000 [ anon ] 00007f1cb8f60000 132 rw--- 0000000000000000 000:00000 [ anon ] 00007ffc200b3000 84 rw--- 0000000000000000 000:00000 [ stack ] 00007ffc2010f000 4 r-x-- 0000000000000000 000:00000 [ anon ] ffffffffff600000 4 r-x-- 0000000000000000 000:00000 [ anon ] mapped: 66240K writeable/private: 816K shared: 0K
Reduced a lot of the output; the rest is similar to what is shown. Even without the complete output, we can see some very interesting things. One important thing to note about the output is that each shared library is listed twice; once for its code segment and once for its data segment. The code segment can be shared however the data (also known as "text") segment cannot be shared and must be forked in memory with each invocation of the program creating multiple copies of the same memory. The code segments have a mode of "r-x--", while the data is set to "rw---". The Kbytes, Mode, and Mapping columns are the only ones we will care about, as the rest are unimportant to our analysis.
If you go through the output, you will find that the lines with the largest Kbytes number are usually the code segments of the included shared libraries (the ones that start with "lib" are the shared libraries). What is great about that is that they are the ones that can be shared between processes. If you factor out all of the parts that are shared between processes, you end up with the "writeable/private" total, which is shown at the bottom of the output. This is what can be considered the incremental cost of this process, factoring out the shared libraries. Therefore, the cost to run this instance of SSH (assuming that all of the shared libraries were already loaded) is around 800 kilobytes. That is quite a different story from the 66 or 1.2 megabytes that ps reported.
Keep that in mind when sizing applications or determining true application sizing. The moral of this story is that process memory usage on Linux is a complex matter; you can't just run ps and know what is going on. This is especially true when you deal with programs that create a lot of identical children processes, like Apache. ps might report that each Apache process uses 10 megabytes of memory, when the reality might be that the marginal cost of each Apache process is 1 megabyte of memory. This information becomes critical when tuning Apache's MaxClients setting, which determines how many simultaneous requests your server can handle.
Linux divides its physical RAM (random access memory) into chucks of memory called pages. Swapping is the process whereby a page of memory is copied to the preconfigured space on the hard disk, called swap space, to free up that page of memory. The combined sizes of the physical memory and the swap space is the amount of virtual memory available.
Swapping is necessary for two important reasons. First, when the system requires more memory than is physically available, the kernel swaps out less used pages and gives memory to the current application (process) that needs the memory immediately. Second, a significant number of the pages used by an application during its startup phase may only be used for initialization and then never used again. The system can swap out those pages and free the memory for other applications or even for the disk cache.
However, swapping does have a downside. Compared to memory, disks are very slow. Memory speeds can be measured in nanoseconds, while disks are measured in milliseconds, so accessing the disk can be tens of thousands times slower than accessing physical memory. The more swapping that occurs, the slower your system will be. Sometimes excessive swapping or thrashing occurs where a page is swapped out and then very soon swapped in and then swapped out again and so on. In such situations the system is struggling to find free memory and keep applications running at the same time. In this case only adding more RAM will help.
Linux has two forms of swap space: the swap partition and the swap file. The swap partition is an independent section of the hard disk used solely for swapping; no other files can reside there. The swap file is a special file in the filesystem that resides amongst your system and data files.
To see what swap space you have, use the command swapon -s. The output will look something like this:
Filename Type Size Used Priority /dev/sda5 partition 859436 0 -1
It is possible to run a Linux system without a swap space, and the system will run well if you have a large amount of memory -- but if you run out of physical memory then the system will crash, as it has nothing else it can do, so it is advisable to have a swap space, especially since disk space is relatively cheap.
See also Why isn't swap in TGIE supplied AMIs?.
The Linux 2.6 kernel added a new kernel parameter called swappiness to let administrators tweak the way Linux swaps. It is a number from 0 to 100. In essence, higher values lead to more pages being swapped, and lower values lead to more applications being kept in memory, even if they are idle.
The default value for swappiness is 60. You can alter it temporarily (until you next reboot) by typing as root:
echo 50 > /proc/sys/vm/swappiness If you want to alter it permanently then you need to change the vm.swappiness parameter in the /etc/sysctl.conf file.
Measuring swap activity
Paging is not the same as swapping. You might have paging activity when calling executables to read portions of their binary code off disk or working with memory-mapped files. Essentially paging is loading code and text data into memory as part of the RSS. Swapping is removing your process entirely into the swapping file. Swapping is a hugely expensive operation compared to demand loading (paging).
A fault is loading a portion of memory from disk (as part of the VSZ) into RSS and making it resident in memory for use.
For the above reasons I do not rely on the sar -W command to get a better idea of swap activity, take a look at the si / so counters of vmstat.
$ sudo vmstat 5 10 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 7350088 50412 365992 0 0 7 2 23 17 0 0 100 0 0 0 0 0 7350088 50412 365992 0 0 0 0 17 10 0 0 100 0 0 0 0 0 7350088 50420 365992 0 0 0 3 18 12 0 0 100 0 0 0 0 0 7350088 50420 365992 0 0 0 0 13 9 0 0 100 0 0 0 0 0 7350088 50420 365992 0 0 0 0 21 11 0 0 100 0 0 0 0 0 7350088 50428 365992 0 0 0 2 15 11 0 0 100 0 0 0 0 0 7350088 50428 365992 0 0 0 2 21 11 0 0 100 0 0 0 0 0 7350088 50428 365992 0 0 0 0 15 10 0 0 100 0 0 0 0 0 7350088 50428 365992 0 0 0 0 15 9 0 0 100 0 0 0 0 0 7350088 50428 365992 0 0 0 0 18 12 0 0 100 0 0
The above system is not swapping as si and so are zero.
Java processes and faulting
$ sudo sar -B 3 5 Linux 2.6.32-642.3.1.el6.x86_64 (typhon) 07/26/2016 _x86_64_ (2 CPU) 05:40:16 PM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff 05:40:19 PM 0.00 6.67 12.33 0.00 22.67 0.00 0.00 0.00 0.00 05:40:22 PM 0.00 0.00 14.00 0.00 25.00 0.00 0.00 0.00 0.00 05:40:25 PM 0.00 0.00 16.33 0.00 28.67 0.00 0.00 0.00 0.00 05:40:28 PM 0.00 0.00 14.33 0.00 26.33 0.00 0.00 0.00 0.00 05:40:31 PM 0.00 0.00 11.00 0.00 22.67 0.00 0.00 0.00 0.00 Average: 0.00 1.33 13.60 0.00 25.07 0.00 0.00 0.00 0.00
What is the difference between a "fault", sometimes known as a "soft fault", and a "major fault" (aka "hard fault")? Soft fault happens when the process needs a page that is already in memory, but was freed by the page replacement process. Major or "hard" fault happens when the page needs to be brought into memory from disk. Major faults are, of course, much more expensive and take much longer to complete then the soft ones. Large number of major page faults can slow the system down to the crawl. On an average system, major page faults are responsible for the vast majority of the CPU time spent in the kernel mode.
Also look at major faults on the Java process with ps:
$ sudo ps -o pid,ppid,flags,rss,resident,size,min_flt,maj_flt,share,vsize 3275 PID PPID F RSS RES SZ MINFL MAJFL - VSZ 3275 3239 0 1555480 - 2871068 443437 0 – 2972676
"MAJFL" should be zero at all times. The VSZ (virtual size) is the virtual size of the whole process in core and out. This number should equal your JVM heap size maximum plus a few more for overhead. RSS is resident set size which is how much of the memory of the VSZ is resident in the core at this moment. This is "garbage collection" (GC) happening this number will fluctuate with the GC going on. SZ is size this is how much ram is mapped in physical pages but not necessarily in the core right now. VSZ is the number of pages mapped in both physical pages and virtual ones (virtual meaning swap to disk also).
OOM or Out of memory errors and adjustment.
https://linux-mm.org/OOM_Killer - "It is the job of the linux 'oom killer' to sacrifice one or more processes in order to free up memory for the system when all else fails. It will also kill any process sharing the same mm_struct as the selected process, for obvious reasons. Any particular process leader may be immunized against the oom killer if the value of its /proc/<pid>/oomadj is set to the constant OOM_DISABLE (currently defined as -17)."
An example of an OOM in the systems log (/var/log/messages usually):
Mar 13 15:40:33 web1 kernel: mysqld invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0 Mar 13 15:40:33 web1 kernel: mysqld cpuset=/ mems_allowed=0 Mar 13 15:40:33 web1 kernel: Pid: 18355, comm: mysqld Not tainted 3.2.1-linode40 #1 Mar 13 15:40:33 web1 kernel: Call Trace: Mar 13 15:40:33 web1 kernel: [<c01958fd>] ? T.662+0x7d/0x1b0 Mar 13 15:40:33 web1 kernel: [<c01067fb>] ? xen_restore_fl_direct_reloc+0x4/0x4 Mar 13 15:40:33 web1 kernel: [<c06e1431>] ? _raw_spin_unlock_irqrestore+0x11/0x20 Mar 13 15:40:33 web1 kernel: [<c04707a7>] ? ___ratelimit+0x97/0x110 Mar 13 15:40:33 web1 kernel: [<c0158fb1>] ? get_task_cred+0x11/0x50 Mar 13 15:40:33 web1 kernel: [<c0195a8e>] ? T.661+0x5e/0x150
I've just read the kernel documentation for "oom_adj" (filesytems/proc.txt) :
2.12 /proc/<pid>/oom_adj - Adjust the oom-killer score ------------------------------------------------------ This file can be used to adjust the score used to select which processes should be killed in an out-of-memory situation. Giving it a high score will increase the likelihood of this process being killed by the oom-killer. Valid values are in the range -16 to +15, plus the special value -17, which disables oom-killing altogether for this process.
TLB, Huge Pages and transparent Huge Pages
A Translation lookaside buffer (TLB) is a memory cache that is used to reduce the time taken to access a user memory location
NUMA and memory fragmentation
Work in progress.
As processor workloads and memory requirements increased now outpacing physical hardware configurations requiring multiprocessors and multiple memory backplanes no longer directly addressable from a single or even a multi-core CPU. This is not a new concept and I remember working on these types of systems back in the mid to late 90's with machines such as the Sun E10K and IBM RS6000/SP or "Silver nodes".
From the numa man page:
"Non-Uniform Memory Access (NUMA) refers to multiprocessor systems whose memory is divided into multiple memory nodes. The access time of a memory node depends on the relative locations of the accessing CPU and the accessed node. (This contrasts with a symmetric multiprocessor system, where the access time for all of the memory is the same for all CPUs.) Normally, each CPU on a NUMA system has a local memory node whose contents can be accessed faster than the memory in the node local to another CPU or the memory on a bus shared by all CPUs." Work in progress
This is coming back into focus for me again as AWS now has instances as large as p2.16xlarge with 16 GPU's sliced as 64 cores with 732 GiB physical memory. These kind of configurations can only be obtained through NUMA.