Oracle服务器不返回消息给客户端(失去连接),告警日志中“Fatal NI connect error 12170”、TNS-12535、TNS-00505、nt OS err code、nt main err code等错误处理
Tags: DCDora-12170OracleTNS-00505TNS-12535丢失连接假死故障处理长连接
现象
现象1
在数据库服务器上执行一个update语句或执行一个存储过程,时间大概40分钟,可以正常结束,可以获取到执行成功的返回消息,但是拿到远程客户端去执行的时候(无论是使用sqlplus还是plsql developer或者Navicat执行),执行了很久也没有返回结果,,,但是,通过查询会话,发现该SQL已经正常执行完成,只是没有返回消息给客户端:
1 | SELECT dbms_sqltune.report_sql_monitor(sql_id => '99qfh6psd3cpy',type => 'html',report_level => 'all') FROM dual; |
查看告警日志报错:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | Fatal NI connect error 12170. VERSION INFORMATION: TNS for Linux: Version 11.2.0.4.0 - Production Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production Time: 08-MAY-2017 10:24:32 Tracing not turned on. Tns error struct: ns main err code: 12535 TNS-12535: TNS:operation timed out ns secondary err code: 12560 nt main err code: 505 TNS-00505: Operation timed out nt secondary err code: 110 nt OS err code: 0 Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.176.172.44)(PORT=55353)) Fatal NI connect error 12170. VERSION INFORMATION: TNS for 64-bit Windows: Version 19.0.0.0.0 - Production Oracle Bequeath NT Protocol Adapter for 64-bit Windows: Version 19.0.0.0.0 - Production Windows NT TCP/IP NT Protocol Adapter for 64-bit Windows: Version 19.0.0.0.0 - Production Version 19.3.0.0.0 Time: 13-4月 -2023 11:08:23 Tracing not turned on. Tns error struct: ns main err code: 12535 TNS-12535: TNS: 操作超时 ns secondary err code: 12560 nt main err code: 505 TNS-00505: 操作超时 nt secondary err code: 60 nt OS err code: 0 Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.142.28.51)(PORT=53889)) Fatal NI connect error 12170. VERSION INFORMATION: TNS for Linux: Version 19.0.0.0.0 - Production Oracle Bequeath NT Protocol Adapter for Linux: Version 19.0.0.0.0 - Production TCP/IP NT Protocol Adapter for Linux: Version 19.0.0.0.0 - Production Version 19.3.0.0.0 Time: 08-MAY-2023 19:04:19 Tracing not turned on. Tns error struct: ns main err code: 12535 TNS-12535: TNS:operation timed out ns secondary err code: 12560 nt main err code: 505 TNS-00505: Operation timed out nt secondary err code: 110 nt OS err code: 0 Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.100.2.63)(PORT=62813)) 2023-05-08T19:04:23.016284+08:00 |
仔细分析出问题的时间点,就是SQL执行完成的时间。。。。
现象2
有意思的是,INACTIVE状态的会话并不会被断开,而ACTIVE状态的会话跑着跑着却会被断开,但其实服务端会话仍然在,这个通过开多个sqlplus客户端可以观察到。
分析
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | [oracle@lhrora19c ~]$ oerr ora 12170 12170, 00000, "TNS:Connect timeout occurred" // *Cause: The server shut down because connection establishment or // communication with a client failed to complete within the allotted time // interval. This may be a result of network or system delays; or this may // indicate that a malicious client is trying to cause a Denial of Service // attack on the server. // *Action: If the error occurred because of a slow network or system, // reconfigure one or all of the parameters SQLNET.INBOUND_CONNECT_TIMEOUT, // SQLNET.SEND_TIMEOUT, SQLNET.RECV_TIMEOUT in sqlnet.ora to larger values. // If a malicious client is suspected, use the address in sqlnet.log to // identify the source and restrict access. Note that logged addresses may // not be reliable as they can be forged (e.g. in TCP/IP). [oracle@lhrora19c ~]$ |
从MOS上的信息反馈看,这个类型错误提示是一种正常的Oracle工作机制。当客户端进程Client Process与服务器进程Server Process建立联系之后,两者就形成了“同生共死”的关系(专有连接模式)。除非客户端主动发起中断或者Server Process被异常kill。
在实际运行环境中,这种理想状态常常被打破。如果Client Process只是保持连接,不执行语句,会话就处于idle状态。这种连接很容易被诸如防火墙等网络层面设备切断。
在Oracle11gR2中,如果长期没有连接动作的Server Process被外力切断,Oracle就会自动将信息作为提示错误写入到alert log中,作为一种提示。在11R1版本中,这种信息是会写入到sqlnet.log中。
归纳MOS和网络中的各种方法,大体有两重策略,分别为使用DCD和禁用ADR。
DCD全称Dead Connection Detection,是一种基于主动测探方式检查Oracle僵尸客户端进程Client Process的策略。配置DCD的关键是设置sqlnet.expire_time参数在SQL Net体系下,Oracle会依据这个时间间隔给所有的Client Process发送网络通信包,用来确定Client是否存活。正是借助这个包通信,可以让防火墙认为这个网络连接还是处在active状态,不会进行强制断开动作。类似的机制还有Linux上的tcp keep live机制,也是使用类似的策略进行检查。
另一种方式也是Oracle推荐的,就是关闭11g的ADR机制。ADR(Automatic Diagnostic Repository)是Oracle进行自动诊断、自动提醒的工具组件。Oracle认为如果用户不需要在SQL Net组件中应用ADR,可以再sqlnet.ora中进行配置关闭。
nt secondary err code返回值的不同
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | The 'nt secondary err code' translates to underlying network transport timeouts for the following Operating Systems: For the Solaris system: nt secondary err code: 145: ETIMEDOUT 145 /* Connection timed out */ For the Linux operating system: nt secondary err code: 110 ETIMEDOUT 110 Connection timed out For the HP-UX system: nt secondary err code: 238: ETIMEDOUT 238 /* Connection timed out */ For AIX: nt secondary err code: 78: ETIMEDOUT 78 /* Connection timed out */ For Windows based platforms: nt secondary err code: 60 (which translates to Winsock Error: 10060) |
生成日志
可以配置trace进行跟踪,生成日志来分析:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | -- 服务端trace TRACE_LEVEL_SERVER = 16 TRACE_FILE_SERVER = SERVER TRACE_DIRECTORY_SERVER= /u01/app/oracle/network/trace TRACE_TIMESTAMP_SERVER = ON TRACE_UNIQUE_SERVER = ON DIAG_ADR_ENABLED=OFF -- 客户端trace DIAG_ADR_ENABLED = off TRACE_LEVEL_CLIENT = 16 TRACE_UNIQUE_CLIENT = on TRACE_DIRECTORY_CLIENT = C:\Users\Administrator\Desktop\instantclient_21_9\network\admin\log TRACE_FILE_CLIENT = client |
也可以使用事件跟踪“ORA-03135: 连接失去联系错误”生成trace:
1 2 3 4 5 6 7 8 9 10 11 | SELECT VALUE FROM V$DIAG_INFO WHERE NAME LIKE '%Default%'; 15:24:14 SQL> alter session set events '3135 trace name errorstack level 3'; 会话已更改。 已用时间: 00: 00: 00.02 15:24:16 SQL> show parameter name ORA-03135: 连接失去联系 进程 ID: 1000 会话 ID: 371 序列号: 15777 |
SQL查询排查
1 2 3 4 | select a.LAST_CALL_ET,a.sid,a.SERIAL#,a.status,a.sql_id,a.USERNAME,a.EVENT,port from v$session a where a.username is not null and a.PROGRAM='sqlplus.exe'; SELECT dbms_sqltune.report_sql_monitor(sql_id => 'dgrq7u1w7vp9j',type => 'html',report_level => 'all') FROM dual; |
解决
1、修改数据库服务器端的listener.ora 和 sqlnet.ora文件
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | -- listener.ora INBOUND_CONNECT_TIMEOUT_LISTENER = 604800 DIAG_ADR_ENABLED_LISTENER = OFF -- sqlnet.ora USE_NS_PROBES_FOR_DCD=TRUE sqlnet.expire_time = 5 SQLNET.INBOUND_CONNECT_TIMEOUT=604800 SQLNET.OUTBOUND_CONNECT_TIMEOUT=604800 SQLNET.RECV_TIMEOUT=604800 SQLNET.SEND_TIMEOUT=604800 TCP.CONNECT_TIMEOUT=604800 DISABLE_OOB=ON -- 12c SQLNET.EXPIRE_ACTION=trace |
2、若是Linux服务器,则配置如下参数:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | cat /proc/sys/net/ipv4/tcp_keepalive_time cat /proc/sys/net/ipv4/tcp_keepalive_intvl cat /proc/sys/net/ipv4/tcp_keepalive_probes cat /proc/sys/net/ipv4/tcp_max_tw_buckets cat /proc/sys/net/ipv4/tcp_max_syn_backlog cat /proc/sys/net/core/somaxconn cat >> /etc/sysctl.conf <<"EOF" net.ipv4.tcp_keepalive_time=300 net.ipv4.tcp_keepalive_intvl=30 net.ipv4.tcp_keepalive_probes=10 net.ipv4.tcp_max_tw_buckets = 262144 net.core.somaxconn = 4096 net.ipv4.tcp_max_syn_backlog = 4096 EOF sysctl -p |
对于Linux服务器,如果参数net.ipv4.tcp_max_tw_buckets
、net.core.somaxconn
和net.ipv4.tcp_max_syn_backlog
配置过小,则可能会导致该文中的问题。
对于Windows平台请参考:https://www.xmmup.com/linuxhewindowspingtaishangtcp_keepalive_timetcp_keepalive_intvlhetcp_keepalive_probesdepeizhi.html
3、重新reload监听器配置,或者重启监听器
其它可能原因排查
1、数据库防火墙原因,请关闭防火墙或配置防火墙的长连接。这里的防火墙涉及数据库服务端和客户端2个主机的防火墙。
2、检查客户端和服务器之间是否有丢包及延迟现象,可以使用ping或mtr测试,参考:https://www.xmmup.com/shiyong-mtr-fenxiwangluoyanchijidiubaowenti.html
3、检查profile文件是否有限制:
1 2 3 4 5 6 7 | show parameter resource_limit select * from dba_profiles d where d.resource_name='IDLE_TIME'; select u.username,d.profile, d.resource_name ,d.limit from dba_users u , dba_profiles d where u.PROFILE=d.profile and d.resource_name='IDLE_TIME' and U.username='LHR'; alter profile ORA_STIG_PROFILE limit IDLE_TIME UNLIMITED; |
4、若数据库是12.2版本及以上,请检查是否配置了参数MAX_IDLE_TIME
和MAX_IDLE_BLOCKER_TIME
。
5、是否配置了定时任务,定时清理会话,请查询:dba_jobs
和dba_scheduler_jobs
。
6、检查pga_aggregate_limit
和pga_aggregate_target
是否过小。
7、其它杀毒软件导致的短连接,请关闭Windows Defender 防火墙等。
8、是否有云层面、高层次、顶层的防火墙配置,例如,新华三防火墙“H3C SECPATH F5000-M”需要配置老化时间为较长时间:
参考:
https://zhiliao.h3c.com/questions/dispcont/114689
https://zhiliao.h3c.com/Theme/details/164108
另外,例如天融信防火墙配置TCP维持超时等时间,如下:
9、修改所有配置后,最后重启OS,再进行尝试。
其它处理
SQLNET.AUTHENTICATION_SERVICES的推荐值
另外,对于sqlnet.ora文件的参数SQLNET.AUTHENTICATION_SERVICES
的推荐值:
1 2 3 4 5 6 7 | -- windows SQLNET.AUTHENTICATION_SERVICES= (NTS) NAMES.DIRECTORY_PATH= (TNSNAMES, EZCONNECT, HOSTNAME) -- linux 建议注释掉SQLNET.AUTHENTICATION_SERVICES的值 SQLNET.AUTHENTICATION_SERVICES= (ALL) NAMES.DIRECTORY_PATH= (TNSNAMES, EZCONNECT, HOSTNAME) |
大事务分批处理
另外,对于大事务,可以考虑分批更新,参考:https://www.xmmup.com/oraclezaicaozuodashujuliangshiruhefenpicharufenpigengxinfenpishanchufenpitijiao.html:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | DECLARE TYPE RIDARRAY IS TABLE OF ROWID; L_RIDS RIDARRAY; CURSOR C IS SELECT ROWID FROM t_obj; BEGIN OPEN C; LOOP FETCH C BULK COLLECT INTO L_RIDS LIMIT 100000; FORALL I IN 1 .. L_RIDS.COUNT UPDATE t_obj SET owner = '2' WHERE ROWID = L_RIDS(I); COMMIT; EXIT WHEN C%NOTFOUND; END LOOP; CLOSE C; END; / |
若是因为网络不稳定的原因,或通过代理方式、或通过VPN等方式导致的时断时连,则可以考虑配置客户端软件的长连接。
对于Navicat来说,配置高级属性中的 保持连接间隔。选项:
对于PLSQL Developer软件,进入Preferences,然后选择左边的Connection选型,选择Check connection,配置后,每隔60秒会检查1次:
参数介绍
sqlnet.expire_time
sqlnet.expire_time = 5
指定用于定期验证客户端和服务器连接活动状态的时间间隔,单位为分钟。例如,在这里表示每隔5分钟发一个 probe 包证明 client/server connections 是 active 的。
To specify a time interval, in minutes, to send a check to verify that client/server connections are active.
DISABLE_OOB
"DISABLE_OOB"是Oracle数据库中的一个参数,指定是否在Out-Of-Band (OOB)通道上进行数据传输。
OOB通道是一个备用的、不同于主数据通道的通道,用于在主通道无法使用时进行数据传输。例如,当主通道出现故障或拥塞时,OOB通道可以用于重要数据的传输,以确保数据的可靠性和完整性。
如果将DISABLE_OOB参数设置为TRUE,则Oracle数据库将禁用OOB通道,只使用主数据通道进行数据传输。这可能会降低数据传输的可靠性和速度,但可以提高系统的安全性,因为攻击者无法利用OOB通道进行潜在的攻击。
如果将DISABLE_OOB参数设置为FALSE(默认值),则Oracle数据库将启用OOB通道,以便在必要时使用备用通道进行数据传输。
请参考: Oracle Net Performance Tuning (Doc ID 67983.1)
Data exception or break is a function in Oracle NET that allows a transaction to be interrupted before it is
completed. It returns both the client and the server to a condition from which they can continue. A break
such as Ctrl-c can be sent as part of the normal data stream (inband), or as a separate asynchronous
message (outband). An outband break is much faster and interrupts the flow of data.Out Of Band Breaks (OOB) are enabled by default provided the underlying protocol supports sending
urgent data.If the parameter DISABLE_OOB is set to OFF then it enables Oracle Net to send and receive "break"
messages using urgent data provided by the underlying protocol.本人提供Oracle(OCP、OCM)、MySQL(OCP)、PostgreSQL(PGCA、PGCE、PGCM)等数据库的培训和考证业务,私聊QQ646634621或微信db_bao,谢谢!