合 Oracle服务器不返回消息给客户端(失去连接),告警日志中“Fatal NI connect error 12170”、TNS-12535、TNS-00505、nt OS err code、nt main err code等错误处理
Tags: Oracle故障处理DCDora-12170TNS-00505TNS-12535丢失连接假死长连接
现象
现象1
在数据库服务器上执行一个update语句或执行一个存储过程,时间大概40分钟,可以正常结束,可以获取到执行成功的返回消息,但是拿到远程客户端去执行的时候(无论是使用sqlplus还是plsql developer或者Navicat执行),执行了很久也没有返回结果,,,但是,通过查询会话,发现该SQL已经正常执行完成,只是没有返回消息给客户端:
1 | SELECT dbms_sqltune.report_sql_monitor(sql_id => '99qfh6psd3cpy',type => 'html',report_level => 'all') FROM dual; |
查看告警日志报错:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | Fatal NI connect error 12170. VERSION INFORMATION: TNS for Linux: Version 11.2.0.4.0 - Production Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production Time: 08-MAY-2017 10:24:32 Tracing not turned on. Tns error struct: ns main err code: 12535 TNS-12535: TNS:operation timed out ns secondary err code: 12560 nt main err code: 505 TNS-00505: Operation timed out nt secondary err code: 110 nt OS err code: 0 Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.176.172.44)(PORT=55353)) Fatal NI connect error 12170. VERSION INFORMATION: TNS for 64-bit Windows: Version 19.0.0.0.0 - Production Oracle Bequeath NT Protocol Adapter for 64-bit Windows: Version 19.0.0.0.0 - Production Windows NT TCP/IP NT Protocol Adapter for 64-bit Windows: Version 19.0.0.0.0 - Production Version 19.3.0.0.0 Time: 13-4月 -2023 11:08:23 Tracing not turned on. Tns error struct: ns main err code: 12535 TNS-12535: TNS: 操作超时 ns secondary err code: 12560 nt main err code: 505 TNS-00505: 操作超时 nt secondary err code: 60 nt OS err code: 0 Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.142.28.51)(PORT=53889)) Fatal NI connect error 12170. VERSION INFORMATION: TNS for Linux: Version 19.0.0.0.0 - Production Oracle Bequeath NT Protocol Adapter for Linux: Version 19.0.0.0.0 - Production TCP/IP NT Protocol Adapter for Linux: Version 19.0.0.0.0 - Production Version 19.3.0.0.0 Time: 08-MAY-2023 19:04:19 Tracing not turned on. Tns error struct: ns main err code: 12535 TNS-12535: TNS:operation timed out ns secondary err code: 12560 nt main err code: 505 TNS-00505: Operation timed out nt secondary err code: 110 nt OS err code: 0 Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.100.2.63)(PORT=62813)) 2023-05-08T19:04:23.016284+08:00 |
仔细分析出问题的时间点,就是SQL执行完成的时间。。。。
现象2
有意思的是,INACTIVE状态的会话并不会被断开,而ACTIVE状态的会话跑着跑着却会被断开,但其实服务端会话仍然在,这个通过开多个sqlplus客户端可以观察到。
分析
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | [oracle@lhrora19c ~]$ oerr ora 12170 12170, 00000, "TNS:Connect timeout occurred" // *Cause: The server shut down because connection establishment or // communication with a client failed to complete within the allotted time // interval. This may be a result of network or system delays; or this may // indicate that a malicious client is trying to cause a Denial of Service // attack on the server. // *Action: If the error occurred because of a slow network or system, // reconfigure one or all of the parameters SQLNET.INBOUND_CONNECT_TIMEOUT, // SQLNET.SEND_TIMEOUT, SQLNET.RECV_TIMEOUT in sqlnet.ora to larger values. // If a malicious client is suspected, use the address in sqlnet.log to // identify the source and restrict access. Note that logged addresses may // not be reliable as they can be forged (e.g. in TCP/IP). [oracle@lhrora19c ~]$ |
从MOS上的信息反馈看,这个类型错误提示是一种正常的Oracle工作机制。当客户端进程Client Process与服务器进程Server Process建立联系之后,两者就形成了“同生共死”的关系(专有连接模式)。除非客户端主动发起中断或者Server Process被异常kill。
在实际运行环境中,这种理想状态常常被打破。如果Client Process只是保持连接,不执行语句,会话就处于idle状态。这种连接很容易被诸如防火墙等网络层面设备切断。
在Oracle11gR2中,如果长期没有连接动作的Server Process被外力切断,Oracle就会自动将信息作为提示错误写入到alert log中,作为一种提示。在11R1版本中,这种信息是会写入到sqlnet.log中。
归纳MOS和网络中的各种方法,大体有两重策略,分别为使用DCD和禁用ADR。
DCD全称Dead Connection Detection,是一种基于主动测探方式检查Oracle僵尸客户端进程Client Process的策略。配置DCD的关键是设置sqlnet.expire_time参数在SQL Net体系下,Oracle会依据这个时间间隔给所有的Client Process发送网络通信包,用来确定Client是否存活。正是借助这个包通信,可以让防火墙认为这个网络连接还是处在active状态,不会进行强制断开动作。类似的机制还有Linux上的tcp keep live机制,也是使用类似的策略进行检查。
另一种方式也是Oracle推荐的,就是关闭11g的ADR机制。ADR(Automatic Diagnostic Repository)是Oracle进行自动诊断、自动提醒的工具组件。Oracle认为如果用户不需要在SQL Net组件中应用ADR,可以再sqlnet.ora中进行配置关闭。
nt secondary err code返回值的不同
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | The 'nt secondary err code' translates to underlying network transport timeouts for the following Operating Systems: For the Solaris system: nt secondary err code: 145: ETIMEDOUT 145 /* Connection timed out */ For the Linux operating system: nt secondary err code: 110 ETIMEDOUT 110 Connection timed out For the HP-UX system: nt secondary err code: 238: ETIMEDOUT 238 /* Connection timed out */ For AIX: nt secondary err code: 78: ETIMEDOUT 78 /* Connection timed out */ For Windows based platforms: nt secondary err code: 60 (which translates to Winsock Error: 10060) |
生成日志
可以配置trace进行跟踪,生成日志来分析:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | -- 服务端trace TRACE_LEVEL_SERVER = 16 TRACE_FILE_SERVER = SERVER TRACE_DIRECTORY_SERVER= /u01/app/oracle/network/trace TRACE_TIMESTAMP_SERVER = ON TRACE_UNIQUE_SERVER = ON DIAG_ADR_ENABLED=OFF -- 客户端trace DIAG_ADR_ENABLED = off TRACE_LEVEL_CLIENT = 16 TRACE_UNIQUE_CLIENT = on TRACE_DIRECTORY_CLIENT = C:\Users\Administrator\Desktop\instantclient_21_9\network\admin\log TRACE_FILE_CLIENT = client |
也可以使用事件跟踪“ORA-03135: 连接失去联系错误”生成trace:
1 2 3 4 5 6 7 8 9 10 11 | SELECT VALUE FROM V$DIAG_INFO WHERE NAME LIKE '%Default%'; 15:24:14 SQL> alter session set events '3135 trace name errorstack level 3'; 会话已更改。 已用时间: 00: 00: 00.02 15:24:16 SQL> show parameter name ORA-03135: 连接失去联系 进程 ID: 1000 会话 ID: 371 序列号: 15777 |
SQL查询排查
1 2 3 4 | select a.LAST_CALL_ET,a.sid,a.SERIAL#,a.status,a.sql_id,a.USERNAME,a.EVENT,port from v$session a where a.username is not null and a.PROGRAM='sqlplus.exe'; SELECT dbms_sqltune.report_sql_monitor(sql_id => 'dgrq7u1w7vp9j',type => 'html',report_level => 'all') FROM dual; |