常见的Linux操作系统内核参数
背景
操作系统为了适应更多的硬件环境,许多初始的设置值,宽容度都很高。
如果不经调整,这些值可能无法适应HPC,或者硬件稍好些的环境。
无法发挥更好的硬件性能,甚至可能影响某些应用软件的使用,特别是数据库。
数据库关心的OS内核参数
512GB 内存为例
1. 参数fs.aio-max-nr
参数解释
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | aio-nr & aio-max-nr: . aio-nr is the running total of the number of events specified on the io_setup system call for all currently active aio contexts. . If aio-nr reaches aio-max-nr then io_setup will fail with EAGAIN. . Note that raising aio-max-nr does not result in the pre-allocation or re-sizing of any kernel data structures. . aio-nr & aio-max-nr: . aio-nr shows the current system-wide number of asynchronous io requests. . aio-max-nr allows you to change the maximum value aio-nr can grow to. |
推荐设置
1 2 3 4 5 | fs.aio-max-nr = 1xxxxxx . PostgreSQL, Greenplum 均未使用io_setup创建aio contexts. 无需设置。 如果Oracle数据库,要使用aio的话,需要设置它。 设置它也没什么坏处,如果将来需要适应异步IO,可以不需要重新修改这个设置。 |
2. 参数 fs.file-max
参数解释
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | file-max & file-nr: . The value in file-max denotes the maximum number of file handles that the Linux kernel will allocate. . When you get lots of error messages about running out of file handles, you might want to increase this limit. . Historically, the kernel was able to allocate file handles dynamically, but not to free them again. . The three values in file-nr denote : the number of allocated file handles , the number of allocated but unused file handles , the maximum number of file handles. . Linux 2.6 always reports 0 as the number of free file handles -- this is not an error, it just means that the number of allocated file handles exactly matches the number of used file handles. . Attempts to allocate more file descriptors than file-max are reported with printk, look for "VFS: file-max limit <number> reached". |
推荐设置
1 2 3 4 5 6 | fs.file-max = 7xxxxxxx . PostgreSQL 有一套自己管理的VFS,真正打开的FD与内核管理的文件打开关闭有一套映射的机制,所以真实情况不需要使用那么多的file handlers。 max_files_per_process 参数。 假设1GB内存支撑100个连接,每个连接打开1000个文件,那么一个PG实例需要打开10万个文件,一台机器按512G内存来算可以跑500个PG实例,则需要5000万个file handler。 以上设置绰绰有余。 |
3.参数 kernel.core_pattern
参数解释
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | core_pattern: . core_pattern is used to specify a core dumpfile pattern name. . max length 128 characters; default value is "core" . core_pattern is used as a pattern template for the output filename; certain string patterns (beginning with '%') are substituted with their actual values. . backward compatibility with core_uses_pid: If core_pattern does not include "%p" (default does not) and core_uses_pid is set, then .PID will be appended to the filename. . corename format specifiers: %<NUL> '%' is dropped %% output one '%' %p pid %P global pid (init PID namespace) %i tid %I global tid (init PID namespace) %u uid %g gid %d dump mode, matches PR_SET_DUMPABLE and /proc/sys/fs/suid_dumpable %s signal number %t UNIX time of dump %h hostname %e executable filename (may be shortened) %E executable path %<OTHER> both are dropped . If the first character of the pattern is a '|', the kernel will treat the rest of the pattern as a command to run. The core dump will be written to the standard input of that program instead of to a file. |
推荐设置
1 2 3 4 5 6 | kernel.core_pattern = /xxx/core_%e_%u_%t_%s.%p . 这个目录要777的权限,如果它是个软链,则真实目录需要777的权限 mkdir /xxx chmod 777 /xxx 留足够的空间 |
4. 参数 kernel.sem
参数解释
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | kernel.sem = 4096 2147483647 2147483646 512000 . 4096 每组多少信号量 (>=17, PostgreSQL 每16个进程一组, 每组需要17个信号量) , 2147483647 总共多少信号量 (2^31-1 , 且大于4096*512000 ) , 2147483646 每个semop()调用支持多少操作 (2^31-1), 512000 多少组信号量 (假设每GB支持100个连接, 512GB支持51200个连接, 加上其他进程, > 51200*2/16 绰绰有余) . # sysctl -w kernel.sem="4096 2147483647 2147483646 512000" . # ipcs -s -l ------ Semaphore Limits -------- max number of arrays = 512000 max semaphores per array = 4096 max semaphores system wide = 2147483647 max ops per semop call = 2147483646 semaphore max value = 32767 |
推荐设置
1 2 3 | kernel.sem = 4096 2147483647 2147483646 512000 . 4096可能能够适合更多的场景, 所以大点无妨,关键是512000 arrays也够了。 |
5. 参数 shmall、shmmax和shmmni
1 2 3 | kernel.shmall = 107374182 kernel.shmmax = 274877906944 kernel.shmmni = 819200 |
参数解释
1 2 3 4 5 6 7 8 | 假设主机内存 512GB . shmmax 单个共享内存段最大 256GB (主机内存的一半,单位字节) shmall 所有共享内存段加起来最大 (主机内存的80%,单位PAGE) shmmni 一共允许创建819200个共享内存段 (每个数据库启动需要2个共享内存段。将来允许动态创建共享内存段,可能需求量更大) . # getconf PAGE_SIZE 4096 |
推荐设置
1 2 3 4 5 6 7 8 9 10 11 12 13 | kernel.shmall = 107374182 kernel.shmmax = 274877906944 kernel.shmmni = 819200 . 9.2以及以前的版本,数据库启动时,对共享内存段的内存需求非常大,需要考虑以下几点 Connections: (1800 + 270 * max_locks_per_transaction) * max_connections Autovacuum workers: (1800 + 270 * max_locks_per_transaction) * autovacuum_max_workers Prepared transactions: (770 + 270 * max_locks_per_transaction) * max_prepared_transactions Shared disk buffers: (block_size + 208) * shared_buffers WAL buffers: (wal_block_size + 8) * wal_buffers Fixed space requirements: 770 kB . 以上建议参数根据9.2以前的版本设置,后期的版本同样适用。 |
6. 参数 net.core.netdev_max_backlog
参数解释
1 2 3 4 | netdev_max_backlog ------------------ Maximum number of packets, queued on the INPUT side, when the interface receives packets faster than kernel can process them. |
推荐设置
1 2 3 | net.core.netdev_max_backlog=1xxxx . INPUT链表越长,处理耗费越大,如果用了iptables管理的话,需要加大这个值。 |
7. 参数 net.core.X
1 2 3 4 | net.core.rmem_default net.core.rmem_max net.core.wmem_default net.core.wmem_max |
参数解释
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | rmem_default ------------ The default setting of the socket receive buffer in bytes. . rmem_max -------- The maximum receive socket buffer size in bytes. . wmem_default ------------ The default setting (in bytes) of the socket send buffer. . wmem_max -------- The maximum send socket buffer size in bytes. |
推荐设置
1 2 3 4 | net.core.rmem_default = 262144 net.core.rmem_max = 4194304 net.core.wmem_default = 262144 net.core.wmem_max = 4194304 |
8. 参数 net.core.somaxconn
参数解释
1 2 3 4 | somaxconn - INTEGER Limit of socket listen() backlog, known in userspace as SOMAXCONN. Defaults to 128. See also tcp_max_syn_backlog for additional tuning for TCP sockets. |
推荐设置
1 | net.core.somaxconn=4xxx |
9. 参数 net.ipv4.tcp_max_syn_backlog
参数解释
1 2 3 4 5 6 | tcp_max_syn_backlog - INTEGER Maximal number of remembered connection requests, which have not received an acknowledgment from connecting client. The minimal value is 128 for low memory machines, and it will increase in proportion to the memory of machine. If server suffers from overload, try increasing this number. |
推荐设置
1 2 3 | net.ipv4.tcp_max_syn_backlog=4xxx pgpool-II 使用了这个值,用于将超过num_init_child以外的连接queue。 所以这个值决定了有多少连接可以在队列里面等待。 |
10. 参数net.ipv4.tcp
1 2 3 | net.ipv4.tcp_keepalive_intvl=20 net.ipv4.tcp_keepalive_probes=3 net.ipv4.tcp_keepalive_time=60 |
参数解释
1 2 3 4 5 6 7 8 9 10 11 12 13 | tcp_keepalive_time - INTEGER How often TCP sends out keepalive messages when keepalive is enabled. Default: 2hours. . tcp_keepalive_probes - INTEGER How many keepalive probes TCP sends out, until it decides that the connection is broken. Default value: 9. . tcp_keepalive_intvl - INTEGER How frequently the probes are send out. Multiplied by tcp_keepalive_probes it is time to kill not responding connection, after probes started. Default value: 75sec i.e. connection will be aborted after ~11 minutes of retries. |
推荐设置
1 2 3 4 5 | net.ipv4.tcp_keepalive_intvl=20 net.ipv4.tcp_keepalive_probes=3 net.ipv4.tcp_keepalive_time=60 . 连接空闲60秒后, 每隔20秒发心跳包, 尝试3次心跳包没有响应,关闭连接。 从开始空闲,到关闭连接总共历时120秒。 |
11. 参数 net.ipv4.tcp_mem
1 | net.ipv4.tcp_mem=8388608 12582912 16777216 |
参数解释
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | tcp_mem - vector of 3 INTEGERs: min, pressure, max 单位 page min: below this number of pages TCP is not bothered about its memory appetite. . pressure: when amount of memory allocated by TCP exceeds this number of pages, TCP moderates its memory consumption and enters memory pressure mode, which is exited when memory consumption falls under "min". . max: number of pages allowed for queueing by all TCP sockets. . Defaults are calculated at boot time from amount of available memory. 64GB 内存,自动计算的值是这样的 net.ipv4.tcp_mem = 1539615 2052821 3079230 . 512GB 内存,自动计算得到的值是这样的 net.ipv4.tcp_mem = 49621632 66162176 99243264 . 这个参数让操作系统启动时自动计算,问题也不大 |
推荐设置
1 2 3 | net.ipv4.tcp_mem=8388608 12582912 16777216 . 这个参数让操作系统启动时自动计算,问题也不大 |
12. 参数 net.ipv4.tcp_fin_timeout
参数解释
1 2 3 4 5 6 7 8 9 | tcp_fin_timeout - INTEGER The length of time an orphaned (no longer referenced by any application) connection will remain in the FIN_WAIT_2 state before it is aborted at the local end. While a perfectly valid "receive only" state for an un-orphaned connection, an orphaned connection in FIN_WAIT_2 state could otherwise wait forever for the remote to close its end of the connection. Cf. tcp_max_orphans Default: 60 seconds |
推荐设置
1 2 3 | net.ipv4.tcp_fin_timeout=5 . 加快僵尸连接回收速度 |
13. 参数 net.ipv4.tcp_synack_retries
参数解释
1 2 3 4 5 6 | tcp_synack_retries - INTEGER Number of times SYNACKs for a passive TCP connection attempt will be retransmitted. Should not be higher than 255. Default value is 5, which corresponds to 31seconds till the last retransmission with the current initial RTO of 1second. With this the final timeout for a passive TCP connection will happen after 63seconds. |
推荐设置
1 2 3 | net.ipv4.tcp_synack_retries=2 . 缩短tcp syncack超时时间 |
参数解释
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | tcp_syncookies - BOOLEAN Only valid when the kernel was compiled with CONFIG_SYN_COOKIES Send out syncookies when the syn backlog queue of a socket overflows. This is to prevent against the common 'SYN flood attack' Default: 1 . Note, that syncookies is fallback facility. It MUST NOT be used to help highly loaded servers to stand against legal connection rate. If you see SYN flood warnings in your logs, but investigation shows that they occur because of overload with legal connections, you should tune another parameters until this warning disappear. See: tcp_max_syn_backlog, tcp_synack_retries, tcp_abort_on_overflow. . syncookies seriously violate TCP protocol, do not allow to use TCP extensions, can result in serious degradation of some services (f.e. SMTP relaying), visible not by you, but your clients and relays, contacting you. While you see SYN flood warnings in logs not being really flooded, your server is seriously misconfigured. . If you want to test which effects syncookies have to your network connections you can set this knob to 2 to enable unconditionally generation of syncookies. |
推荐设置
1 2 3 | net.ipv4.tcp_syncookies=1 . 防止syn flood攻击 |
15. 参数 net.ipv4.tcp_timestamps
参数解释
1 2 | tcp_timestamps - BOOLEAN Enable timestamps as defined in RFC1323. |
推荐设置
1 2 3 | net.ipv4.tcp_timestamps=1 . tcp_timestamps 是 tcp 协议中的一个扩展项,通过时间戳的方式来检测过来的包以防止 PAWS(Protect Against Wrapped Sequence numbers),可以提高 tcp 的性能。 |
16.参数 net.ipv4.tcp_X
1 2 3 | net.ipv4.tcp_tw_recycle net.ipv4.tcp_tw_reuse net.ipv4.tcp_max_tw_buckets |
参数解释
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | tcp_tw_recycle - BOOLEAN Enable fast recycling TIME-WAIT sockets. Default value is 0. It should not be changed without advice/request of technical experts. . tcp_tw_reuse - BOOLEAN Allow to reuse TIME-WAIT sockets for new connections when it is safe from protocol viewpoint. Default value is 0. It should not be changed without advice/request of technical experts. . tcp_max_tw_buckets - INTEGER Maximal number of timewait sockets held by system simultaneously. If this number is exceeded time-wait socket is immediately destroyed and warning is printed. This limit exists only to prevent simple DoS attacks, you _must_ not lower the limit artificially, but rather increase it (probably, after increasing installed memory), if network conditions require more than default value. |
推荐设置
1 2 3 4 5 | net.ipv4.tcp_tw_recycle=0 net.ipv4.tcp_tw_reuse=1 net.ipv4.tcp_max_tw_buckets = 2xxxxx . net.ipv4.tcp_tw_recycle和net.ipv4.tcp_timestamps不建议同时开启 |
17. 参数 net.ipv4.tcp_rmem
1 2 | net.ipv4.tcp_rmem net.ipv4.tcp_wmem |
参数解释
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | tcp_wmem - vector of 3 INTEGERs: min, default, max min: Amount of memory reserved for send buffers for TCP sockets. Each TCP socket has rights to use it due to fact of its birth. Default: 1 page . default: initial size of send buffer used by TCP sockets. This value overrides net.core.wmem_default used by other protocols. It is usually lower than net.core.wmem_default. Default: 16K . max: Maximal amount of memory allowed for automatically tuned send buffers for TCP sockets. This value does not override net.core.wmem_max. Calling setsockopt() with SO_SNDBUF disables automatic tuning of that socket's send buffer size, in which case this value is ignored. Default: between 64K and 4MB, depending on RAM size. . tcp_rmem - vector of 3 INTEGERs: min, default, max min: Minimal size of receive buffer used by TCP sockets. It is guaranteed to each TCP socket, even under moderate memory pressure. Default: 1 page . default: initial size of receive buffer used by TCP sockets. This value overrides net.core.rmem_default used by other protocols. Default: 87380 bytes. This value results in window of 65535 with default setting of tcp_adv_win_scale and tcp_app_win:0 and a bit less for default tcp_app_win. See below about these variables. . max: maximal size of receive buffer allowed for automatically selected receiver buffers for TCP socket. This value does not override net.core.rmem_max. Calling setsockopt() with SO_RCVBUF disables automatic tuning of that socket's receive buffer size, in which case this value is ignored. Default: between 87380B and 6MB, depending on RAM size. |
推荐设置
1 2 3 4 | net.ipv4.tcp_rmem=8192 87380 16777216 net.ipv4.tcp_wmem=8192 65536 16777216 . 许多数据库的推荐设置,提高网络性能 |
18. 参数net.nf_conntrack_max和net.netfilter.nf_conntrack_max
1 2 | net.nf_conntrack_max net.netfilter.nf_conntrack_max |
支持系统
1 | CentOS 6 |
参数解释
1 2 3 | nf_conntrack_max - INTEGER Size of connection tracking table. Default value is nf_conntrack_buckets value * 4. |
推荐设置
1 2 | net.nf_conntrack_max=1xxxxxx net.netfilter.nf_conntrack_max=1xxxxxx |
19. 参数 vm.dirty_X
1 2 3 4 | vm.dirty_background_bytes vm.dirty_expire_centisecs vm.dirty_ratio vm.dirty_writeback_centisecs |
参数解释
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | ============================================================== . dirty_background_bytes . Contains the amount of dirty memory at which the background kernel flusher threads will start writeback. . Note: dirty_background_bytes is the counterpart of dirty_background_ratio. Only one of them may be specified at a time. When one sysctl is written it is immediately taken into account to evaluate the dirty memory limits and the other appears as 0 when read. . ============================================================== . dirty_background_ratio . Contains, as a percentage of total system memory, the number of pages at which the background kernel flusher threads will start writing out dirty data. . ============================================================== . dirty_bytes . Contains the amount of dirty memory at which a process generating disk writes will itself start writeback. . Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be specified at a time. When one sysctl is written it is immediately taken into account to evaluate the dirty memory limits and the other appears as 0 when read. . Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any value lower than this limit will be ignored and the old configuration will be retained. . ============================================================== . dirty_expire_centisecs . This tunable is used to define when dirty data is old enough to be eligible for writeout by the kernel flusher threads. It is expressed in 100'ths of a second. Data which has been dirty in-memory for longer than this interval will be written out next time a flusher thread wakes up. . ============================================================== . dirty_ratio . Contains, as a percentage of total system memory, the number of pages at which a process which is generating disk writes will itself start writing out dirty data. . ============================================================== . dirty_writeback_centisecs . The kernel flusher threads will periodically wake up and write `old' data out to disk. This tunable expresses the interval between those wakeups, in 100'ths of a second. . Setting this to zero disables periodic writeback altogether. . ============================================================== |
推荐设置
1 2 3 4 5 6 | vm.dirty_background_bytes = 4096000000 vm.dirty_expire_centisecs = 6000 vm.dirty_ratio = 80 vm.dirty_writeback_centisecs = 50 . 减少数据库进程刷脏页的频率,dirty_background_bytes根据实际IOPS能力以及内存大小设置 |
20. 参数 vm.extra_free_kbytes
支持系统
1 | CentOS 6 |
参数解释
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | extra_free_kbytes . This parameter tells the VM to keep extra free memory between the threshold where background reclaim (kswapd) kicks in, and the threshold where direct reclaim (by allocating processes) kicks in. . This is useful for workloads that require low latency memory allocations and have a bounded burstiness in memory allocations, for example a realtime application that receives and transmits network traffic (causing in-kernel memory allocations) with a maximum total message burst size of 200MB may need 200MB of extra free memory to avoid direct reclaim related latencies. . 目标是尽量让后台进程回收内存,比用户进程提早多少kbytes回收,因此用户进程可以快速分配内存。 |
推荐设置
1 | vm.extra_free_kbytes=4xxxxxx |
21. 参数 vm.min_free_kbytes
参数解释
1 2 3 4 5 6 7 8 9 10 11 12 13 | min_free_kbytes: . This is used to force the Linux VM to keep a minimum number of kilobytes free. The VM uses this number to compute a watermark[WMARK_MIN] value for each lowmem zone in the system. Each lowmem zone gets a number of reserved free pages based proportionally on its size. . Some minimal amount of memory is needed to satisfy PF_MEMALLOC allocations; if you set this to lower than 1024KB, your system will become subtly broken, and prone to deadlock under high loads. . Setting this too high will OOM your machine instantly. |
推荐设置
1 2 3 | vm.min_free_kbytes = 2xxxxxx # vm.min_free_kbytes 建议每32G内存分配1G vm.min_free_kbytes . 防止在高负载时系统无响应,减少内存分配死锁概率。 |
设置方法:
- echo 4194304 > /proc/sys/vm/min_free_kbytes,或
- sysctl -w vm.min_free_kbytes=4194304,或
- 编辑/etc/sysctl.conf文件,加入vm.min_free_kbytes=4194304
内存水位线
Linux系统中,剩余内存有3个水位线,从高到低记为high、low和min。并且有如下的关系:
min = vm.min_free_kbytes,low = min 5 / 4,high = min 3 / 2。
当剩余内存低于high值时,系统会认为内存有一定的压力。当剩余内存低于low值时,守护进程kswapd就会开始进行内存回收。当其进一步降低到min值时,就会触发系统的直接回收(direct reclaim),此时会阻塞程序的运行,使延迟变大。
因此vm.min_free_kbytes的值既不应过小,也不应过大。如果过小(比如只有几十M),low与min之间的差值就会非常小,极易触发直接回收,使效率降低。而如果设置得过大,又会造成内存资源的浪费,kswapd回收时也会耗费更多的时间。上面的语句中设置成了4G,视物理内存大小,一般设在1G~8G之间。
22. 参数 vm.mmap_min_addr
参数解释
1 2 3 4 5 6 7 8 9 10 | mmap_min_addr . This file indicates the amount of address space which a user process will be restricted from mmapping. Since kernel null dereference bugs could accidentally operate based on the information in the first couple of pages of memory userspace processes should not be allowed to write to them. By default this value is set to 0 and no protections will be enforced by the security module. Setting this value to something like 64k will allow the vast majority of applications to work correctly and provide defense in depth against future potential kernel bugs. |
推荐设置
1 2 3 | vm.mmap_min_addr=6xxxx . 防止内核隐藏的BUG导致的问题 |
23. 参数 vm.overcommit_memory和vm.overcommit_ratio
参数解释
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | ============================================================== . overcommit_kbytes: . When overcommit_memory is set to 2, the committed address space is not permitted to exceed swap plus this amount of physical RAM. See below. . Note: overcommit_kbytes is the counterpart of overcommit_ratio. Only one of them may be specified at a time. Setting one disables the other (which then appears as 0 when read). . ============================================================== . overcommit_memory: . This value contains a flag that enables memory overcommitment. . When this flag is 0, the kernel attempts to estimate the amount of free memory left when userspace requests more memory. . When this flag is 1, the kernel pretends there is always enough memory until it actually runs out. . When this flag is 2, the kernel uses a "never overcommit" policy that attempts to prevent any overcommit of memory. Note that user_reserve_kbytes affects this policy. . This feature can be very useful because there are a lot of programs that malloc() huge amounts of memory "just-in-case" and don't use much of it. . The default value is 0. . See Documentation/vm/overcommit-accounting and security/commoncap.c::cap_vm_enough_memory() for more information. . ============================================================== . overcommit_ratio: . When overcommit_memory is set to 2, the committed address space is not permitted to exceed swap + this percentage of physical RAM. See above. . ============================================================== |
推荐设置
1 2 3 4 | vm.overcommit_memory = 0 vm.overcommit_ratio = 90 . vm.overcommit_memory = 0 时 vm.overcommit_ratio可以不设置 |
24. 参数vm.swappiness
参数解释
1 2 3 4 5 6 | swappiness . This control is used to define how aggressive the kernel will swap memory pages. Higher values will increase agressiveness, lower values decrease the amount of swap. . The default value is 60. |
推荐设置
1 | vm.swappiness = 0 |
设置方法:
- echo 0 > /proc/sys/vm/swappiness,或
- sysctl -w vm.swappiness=0,或
- 编辑/etc/sysctl.conf文件,加入vm.swappiness=0
swap即交换空间,作用类似于Windows中的虚拟内存,也就是当物理内存不足时,将硬盘上的swap分区当做内存来使用。但是,由于磁盘的读写速率与内存相比差太多,一旦发生大量交换,系统延迟就会增加,甚至会造成服务长期不可用,这对于大数据集群而言是致命的。vm.swappiness参数用于控制内核对交换空间的使用积极性,默认是60。值越高,就代表内核越多地使用交换空间。对于内存较大的CDH集群,我们一般将这个值设为0或1。0表示只有当可用物理内存小于最小阈值vm.min_free_kbytes(后面会提到)时才使用交换空间,1则表示最低限度地使用交换空间。
关于这个配置的具体机制,找到了两种解释:
- 当物理内存占用率高于(100 - vm.swappiness)%时,开始使用交换分区。
- vm.swappiness通过控制内存回收时,回收的匿名内存更多一些还是回收的文件缓存更多一些来达到这个效果。如果等于100,表示匿名内存和文件缓存将用同样的优先级进行回收,默认60表示文件缓存会优先被回收掉。
25. 参数 vm.zone_reclaim_mode
参数解释
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | zone_reclaim_mode: . Zone_reclaim_mode allows someone to set more or less aggressive approaches to reclaim memory when a zone runs out of memory. If it is set to zero then no zone reclaim occurs. Allocations will be satisfied from other zones / nodes in the system. . This is value ORed together of . 1 = Zone reclaim on 2 = Zone reclaim writes dirty pages out 4 = Zone reclaim swaps pages . zone_reclaim_mode is disabled by default. For file servers or workloads that benefit from having their data cached, zone_reclaim_mode should be left disabled as the caching effect is likely to be more important than data locality. . zone_reclaim may be enabled if it's known that the workload is partitioned such that each partition fits within a NUMA node and that accessing remote memory would cause a measurable performance reduction. The page allocator will then reclaim easily reusable pages (those page cache pages that are currently not used) before allocating off node pages. . Allowing zone reclaim to write out pages stops processes that are writing large amounts of data from dirtying pages on other nodes. Zone reclaim will write out dirty pages if a zone fills up and so effectively throttle the process. This may decrease the performance of a single process since it cannot use all of system memory to buffer the outgoing writes anymore but it preserve the memory on other nodes so that the performance of other processes running on other nodes will not be affected. . Allowing regular swap effectively restricts allocations to the local node unless explicitly overridden by memory policies or cpuset configurations. |
推荐设置
1 2 3 | vm.zone_reclaim_mode=0 . 不使用NUMA |
设置方法:
- echo 0 > /proc/sys/vm/zone_reclaim_mode,或
- sysctl -w vm.zone_reclaim_mode=0,或
- 编辑/etc/sysctl.conf文件,加入vm.zone_reclaim_mode=0
这个参数与NUMA(Non-uniform Memory Access,“非统一性内存访问”)特性有关。
如图所示,NUMA使得每个CPU都有自己专属的内存区域
只有当CPU访问自身直接attach内存对应的物理地址时,才会有较短的响应时间(Local Access)。而如果需要访问其他CPU attach的内存的数据时,就需要通过inter-connect通道访问,响应时间就相比之前变慢了(Remote Access)。NUMA这种特性可能会导致CPU内存使用不均衡,部分CPU的local内存不够使用,频繁需要回收,进而可能发生大量swap,系统响应延迟会严重抖动。而与此同时其他部分CPU的local内存可能都很空闲。这就会产生一种怪现象:使用free命令查看当前系统还有部分空闲物理内存,系统却不断发生swap,导致某些应用性能急剧下降。因此必须改进NUMA的内存回收策略,即vm.zone_reclaim_mode。这个参数可以取值0/1/3/4。其中0表示在local内存不够用的情况下可以去其他的内存区域分配内存;1表示在local内存不够用的情况下本地先回收再分配;3表示本地回收尽可能先回收文件缓存对象;4表示本地回收优先使用swap回收匿名内存。由此可见,将其设为0可以降低swap发生的概率。
26.参数net.ipv4.ip_local_port_range
参数解释
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | ip_local_port_range - 2 INTEGERS Defines the local port range that is used by TCP and UDP to choose the local port. The first number is the first, the second the last local port number. The default values are 32768 and 61000 respectively. . ip_local_reserved_ports - list of comma separated ranges Specify the ports which are reserved for known third-party applications. These ports will not be used by automatic port assignments (e.g. when calling connect() or bind() with port number 0). Explicit port allocation behavior is unchanged. . The format used for both input and output is a comma separated list of ranges (e.g. "1,2-4,10-10" for ports 1, 2, 3, 4 and 10). Writing to the file will clear all previously reserved ports and update the current list with the one given in the input. . Note that ip_local_port_range and ip_local_reserved_ports settings are independent and both are considered by the kernel when determining which ports are available for automatic port assignments. . You can reserve ports which are not in the current ip_local_port_range, e.g.: . $ cat /proc/sys/net/ipv4/ip_local_port_range 32000 61000 $ cat /proc/sys/net/ipv4/ip_local_reserved_ports 8080,9148 . although this is redundant. However such a setting is useful if later the port range is changed to a value that will include the reserved ports. . Default: Empty |
推荐设置
1 2 3 | net.ipv4.ip_local_port_range=40000 65535 . 限制本地动态端口分配范围,防止占用监听端口。 |
27.参数vm.nr_hugepages
参数解释
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | ============================================================== nr_hugepages Change the minimum size of the hugepage pool. See Documentation/vm/hugetlbpage.txt ============================================================== nr_overcommit_hugepages Change the maximum size of the hugepage pool. The maximum is nr_hugepages + nr_overcommit_hugepages. See Documentation/vm/hugetlbpage.txt . The output of "cat /proc/meminfo" will include lines like: ...... HugePages_Total: vvv HugePages_Free: www HugePages_Rsvd: xxx HugePages_Surp: yyy Hugepagesize: zzz kB . where: HugePages_Total is the size of the pool of huge pages. HugePages_Free is the number of huge pages in the pool that are not yet allocated. HugePages_Rsvd is short for "reserved," and is the number of huge pages for which a commitment to allocate from the pool has been made, but no allocation has yet been made. Reserved huge pages guarantee that an application will be able to allocate a huge page from the pool of huge pages at fault time. HugePages_Surp is short for "surplus," and is the number of huge pages in the pool above the value in /proc/sys/vm/nr_hugepages. The maximum number of surplus huge pages is controlled by /proc/sys/vm/nr_overcommit_hugepages. . /proc/filesystems should also show a filesystem of type "hugetlbfs" configured in the kernel. . /proc/sys/vm/nr_hugepages indicates the current number of "persistent" huge pages in the kernel's huge page pool."Persistent" huge pages will be returned to the huge page pool when freed by a task. A user with root privileges can dynamically allocate more or free some persistent huge pages by increasing or decreasing the value of 'nr_hugepages'. |
推荐设置
1 2 | 如果要使用PostgreSQL的huge page,建议设置它。 大于数据库需要的共享内存即可。 |
28.参数fs.nr_open
参数解释
1 2 3 4 5 6 7 8 | nr_open: This denotes the maximum number of file-handles a process can allocate. Default value is 1024*1024 (1048576) which should be enough for most machines. Actual limit depends on RLIMIT_NOFILE resource limit. 它还影响security/limits.conf 的文件句柄限制,单个进程的打开句柄不能大于fs.nr_open,所以要加大文件句柄限制,首先要加大nr_open |
推荐设置
1 2 | 对于有很多对象(表、视图、索引、序列、物化视图等)的PostgreSQL数据库,建议设置为2000万, 例如fs.nr_open=20480000 |
数据库关心的资源限制
1. 通过/etc/security/limits.conf设置,或者ulimit设置
2. 通过/proc/$pid/limits查看当前进程的设置
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | #- core - limits the core file size (KB) #- memlock - max locked-in-memory address space (KB) #- nofile - max number of open files 建议设置为1000万 , 但是必须设置sysctl, fs.nr_open大于它,否则会导致系统无法登陆。 #- nproc - max number of processes 以上四个是非常关心的配置 .... #- data - max data size (KB) #- fsize - maximum filesize (KB) #- rss - max resident set size (KB) #- stack - max stack size (KB) #- cpu - max CPU time (MIN) #- as - address space limit (KB) #- maxlogins - max number of logins for this user #- maxsyslogins - max number of logins on the system #- priority - the priority to run user process with #- locks - max number of file locks the user can hold #- sigpending - max number of pending signals #- msgqueue - max memory used by POSIX message queues (bytes) #- nice - max nice priority allowed to raise to values: [-20, 19] #- rtprio - max realtime priority |
数据库关心的IO调度规则
1. 目前操作系统支持的IO调度策略包括cfq, deadline, noop 等。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | /kernel-doc-xxx/Documentation/block -r--r--r-- 1 root root 674 Apr 8 16:33 00-INDEX -r--r--r-- 1 root root 55006 Apr 8 16:33 biodoc.txt -r--r--r-- 1 root root 618 Apr 8 16:33 capability.txt -r--r--r-- 1 root root 12791 Apr 8 16:33 cfq-iosched.txt -r--r--r-- 1 root root 13815 Apr 8 16:33 data-integrity.txt -r--r--r-- 1 root root 2841 Apr 8 16:33 deadline-iosched.txt -r--r--r-- 1 root root 4713 Apr 8 16:33 ioprio.txt -r--r--r-- 1 root root 2535 Apr 8 16:33 null_blk.txt -r--r--r-- 1 root root 4896 Apr 8 16:33 queue-sysfs.txt -r--r--r-- 1 root root 2075 Apr 8 16:33 request.txt -r--r--r-- 1 root root 3272 Apr 8 16:33 stat.txt -r--r--r-- 1 root root 1414 Apr 8 16:33 switching-sched.txt -r--r--r-- 1 root root 3916 Apr 8 16:33 writeback_cache_control.txt |
如果你要详细了解这些调度策略的规则,可以查看WIKI或者看内核文档。
从这里可以看到它的调度策略
1 2 | cat /sys/block/vdb/queue/scheduler noop [deadline] cfq |
修改
1 | echo deadline > /sys/block/hda/queue/scheduler |
或者修改启动参数
1 2 | grub.conf elevator=deadline |
从很多测试结果来看,数据库使用deadline调度,性能会更稳定一些。
Linux内核参数
- 使用 sysctl -w 命令配置为临时生效,写入 /etc/sysctl.conf 配置永久生效。
网络类
参数 | 说明 | 初始化配置 |
---|---|---|
net.ipv4.tcp_tw_recycle | 该参数用于快速回收 TIME_WAIT 连接。关闭时,内核不检查包的时间戳。开启时则会进行检查。 不建议开启该参数,在时间戳非单调增长的情况下,会引起丢包问题,高版本内核已经移除了该参数。 | 0 |
net.core.somaxconn | 对应三次握手结束,还没有 accept 队列时的 establish 状态。accept 队列较多则说明服务端 accept 效率不高,或短时间内突发了大量新建连接。该值过小会导致服务器收到 syn 不回包,是由于 somaxconn 表满而删除新建的 syn 连接引起。若为高并发业务,则可尝试增大该值,但有可能增大延迟。 | 128 |
net.ipv4.tcp_max_syn_backlog | 对应半连接的上限,曾用来防御常见的 synflood 攻击,但当 tcp_syncookies=1 时半连接可超过该上限。 | - |
net.ipv4.tcp_syncookies | 对应开启 SYN Cookies,表示启用 Cookies 来处理,可防范部分 SYN 攻击,当出现 SYN 等待队列溢出时也可继续连接。但开启后会使用 SHA1 验证 Cookies,理论上会增大 CPU 使用率。 | 1 |
net.core.rmem_default net.core.rmem_max net.ipv4.tcp_mem net.ipv4.tcp_rmem | 这些参数配置了数据接收的缓存大小。配置过大容易造成内存资源浪费,过小则会导致丢包。建议判断自身业务是否属于高并发连接或少并发高吞吐量情形,进行优化配置。rmem_default 的理论最优配置策略为带宽/RTT 积,其配置会覆盖 tcp_rmem,tcp_rmem 不单独配置。rmem_max 配置约为 rmem_default 的5倍。tcp_mem 为总的 TCP 占用内存,一般由 OS 自动配置为 CVM 可用内存的3/32、1/8或3/16,tcp_mem 及 rmem_default 也决定了最大并发链接数。 | rmem_default=655360 rmem_max=3276800 |
net.core.wmem_default net.core.wmem_max net.ipv4.tcp_wmem | 这些参数用于配置数据发送缓存,腾讯云平台上数据发送通常不会出现瓶颈,可不做配置。 | - |
net.ipv4.tcp_keepalive_intvl net.ipv4.tcp_keepalive_probes net.ipv4.tcp_keepalive_time | 这些参数与 TCP KeepAlive 有关,默认为75/9/7200。表示某个 TCP 连接在空闲7200秒后,内核才发起探测,探测9次(每次75秒)不成功,内核才发送 RST。对服务器而言,默认值比较大,可结合业务调整到30/3/1800。 | - |
net.ipv4.ip_local_port_range | 配置可用端口的范围,请按需调整。 | - |
tcp_tw_reuse | 该参数允许将 TIME-WAIT 状态的 socket 用于新的 TCP 连接。对快速重启动某些占用固定端口的链接有帮助,但基于 NAT 网络有潜在的隐患,高版本内核变为0/1/2三个值,并配置为2。 | - |
net.ipv4.ip_forward net.ipv6.conf.all.forwarding | IP 转发功能,若用于 docker 的路由转发场景可将其配置为1。 | 0 |
net.ipv4.conf.default.rp_filter | 该参数为网卡对接收到的数据包进行反向路由验证的规则,可配置为0/1/2。根据 RFC3704建议,推荐设置为1,打开严格反向路由验证,可防止部分 DDos 攻击及防止 IP Spoofing 等。 | - |
net.ipv4.conf.default.accept_source_route | 根据 CentOS 官网建议,默认不允许接受含有源路由信息的 IP 包。 | 0 |
net.ipv4.conf.all.promote_secondaries net.ipv4.conf.default.promote_secondaries | 当主 IP 地址被删除时,第二 IP 地址是否成为新的主 IP 地址。 | 1 |
net.ipv6.neigh.default.gc_thresh3 net.ipv4.neigh.default.gc_thresh3 | 保存在 ARP 高速缓存中的最多记录的限制,一旦高速缓存中的数目高于设定值,垃圾收集器将马上运行。 | 4096 |
内存类
参数 | 说明 | 初始化配置 |
---|---|---|
vm.vfs_cache_pressure | 原始值为100,表示扫描 dentry 的力度。以100为基准,该值越大内核回收算法越倾向于回收内存。很多基于 curl 的业务上,通常由于 dentry 的积累导致占满所有可用内存,容易触发 OOM 或内核 bug 之类的问题。综合考虑回收频率和性能后,选择配置为250,可按需调整。 | 250 |
vm.min_free_kbytes | 该值是启动时根据系统可用物理内存 MEM 自动计算出:4 * sqrt(MEM)。其含义是让系统运行时至少要预留出的 KB 内存,一般情况下提供给内核线程使用,该值无需设置过大。当机器包量出现微突发,则有一定概率会出现击穿 vm.min_free_kbytes,造成 OOM。建议大配置的机器下默认将 vm.min_free_kbytes 配置为总内存的1%左右。 | - |
kernel.printk | 内核 printk 函数打印级别,默认配置为大于5。 | 5 4 1 7 |
kernel.numa_balancing | 该参数表示可以由内核自发的将进程的数据移动到对应的 NUMA 上,但是实际应用的效果不佳且有其他性能影响,redis 的场景下可以尝试开启。 | 0 |
kernel.shmall kernel.shmmax | shmmax 设置一次分配 shared memory 的最大长度,单位为 byte。shmall 设置一共能分配 shared memory 的最大长度,单位为 page。 | kernel.shmmax=68719476736 kernel.shmall=4294967296 |
进程类
参数 | 说明 | 初始化配置 |
---|---|---|
fs.file-max fs.nr_open | 分别控制系统所有进程和单进程能同时打开的最大文件数量:file-max 由 OS 启动时自动配置,近似为10万/GB。nr_open 为固定值1048576,但为针对用户态打开最大文件数的限制,一般不改动这个值,通常设置 ulimit -n 实现,对应配置文件为 /etc/security/limits.conf。 | ulimit 的 open files 为100001 fs.nr_open=1048576 |
kernel.pid_max | 系统内最大进程数,官方镜像默认为32768,可按需调整。 | - |
kernel.core_uses_pid | 该配置决定 coredump 文件生成的时候是否含有 pid。 | 1 |
kernel.sysrq | 开启该参数后,后续可对 /proc/sysrq-trigger 进行相关操作。 | 1 |
kernel.msgmnb kernel.msgmax | 分别表示消息队列中的最大字节数和单个最大消息队列容量。 | 65536 |
kernel.softlockup_panic | 当配置了 softlockup_panic 时,内核检测到某进程 softlockup 时,会发生 panic,结合 kdump 的配置可生成 vmcore,用以分析 softlockup 的原因。 | - |
IO 类
参数 | 说明 | 初始化配置 |
---|---|---|
vm.dirty_background_bytes vm.dirty_background_ratio vm.dirty_bytes vm.dirty_expire_centisecs vm.dirty_ratio vm.dirty_writeback_centisecs | 这部分参数主要配置 IO 写回磁盘的策略:dirty_background_bytes/dirty_bytes 和 dirty_background_ratio/dirty_ratio 分别对应内存脏页阈值的绝对数量和比例数量,一般情况下设置 ratio。dirty_background_ratio 指当文件系统缓存脏页数量达到系统内存百分之多少时(默认10%)唤醒内核的 flush 等进程,写回磁盘。dirty_ratio 为最大脏页比例,当脏页数达到该比例时,必须将所有脏数据提交到磁盘,同时所有新的 IO 都会被阻塞,直到脏数据被写入磁盘,通常会造成 IO 卡顿。系统先会达到 vm.dirty_background_ratio 的条件然后触发 flush 进程进行异步的回写操作,此时应用进程仍然可以进行写操作,如果达到 vm.dirty_ratio 这个参数所设定的值,此时操作系统会转入同步地处理脏页的过程,阻塞应用进程。 vm.dirty_expire_centisecs 表示脏页能存活的时间,flush 进程会检查数据是否超过了该时间限制,单位为1/100秒。vm.dirty_writeback_centisecs 表示 flush 进程的唤醒周期,单位为1/100秒。 | - |