合 【故障处理】rac节点不能启动报“has a disk HB, but no network HB”的错误
今天同事说有一套rac 19c的环境,不能使用了,让我帮忙看看。
这套rac环境是搭建在华为云ECS上的,操作系统为CentOS 7.6。根据经验,rac不能启动,主要是2个方面的原因:一个是共享存储,一个网络。共享存储常见原因是盘掉了,或盘坏了,或多路径软件出问题等等,而网络问题常见原因是私网网卡坏了,或节点之间网络不通(注意:修改ssh端口或修改oracle和grid密码不会影响rac的正常运行)。
很不幸,这套环境的共享和网络都有问题,下面慢慢分析。
原因一:共享盘掉了
首先,看看2个节点的共享盘是不是一致的,查看后发现节点2少了一块盘,让客户把节点2的盘重新挂载一下,
然后查看,共享盘已经一致了:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | [root@oracle-rac2 ~]# ll /dev/asm* lrwxrwxrwx 1 root root 3 Jul 30 11:09 /dev/asm-diska -> sde lrwxrwxrwx 1 root root 3 Jul 30 11:09 /dev/asm-diskb -> sdd lrwxrwxrwx 1 root root 3 Jul 30 11:03 /dev/asm-diskc -> sdc lrwxrwxrwx 1 root root 3 Jul 30 11:03 /dev/asm-diskd -> sdb lrwxrwxrwx 1 root root 3 Jul 30 10:55 /dev/asm-diske -> sda [root@oracle-rac1 trace]# ll /dev/asm* lrwxrwxrwx 1 root root 3 Jul 30 11:10 /dev/asm-diska -> sde lrwxrwxrwx 1 root root 3 Jul 30 11:03 /dev/asm-diskb -> sdb lrwxrwxrwx 1 root root 3 Jul 30 10:23 /dev/asm-diskc -> sda lrwxrwxrwx 1 root root 3 Jul 30 11:10 /dev/asm-diskd -> sdd lrwxrwxrwx 1 root root 3 Jul 30 11:03 /dev/asm-diske -> sdc [root@oracle-rac2 ~]# $GRID_HOME/bin/kfod disks=asm st=true ds=true cluster=true -------------------------------------------------------------------------------- Disk Size Header Path Disk Group User Group ================================================================================ 1: 81920 MB MEMBER /dev/asm-diska DATA grid asmadmin 2: 81920 MB MEMBER /dev/asm-diskb OCR grid asmadmin 3: 81920 MB MEMBER /dev/asm-diskc DATA grid asmadmin 4: 81920 MB MEMBER /dev/asm-diskd DATA grid asmadmin 5: 81920 MB MEMBER /dev/asm-diske DATA grid asmadmin -------------------------------------------------------------------------------- ORACLE_SID ORACLE_HOME HOST_NAME ================================================================================ [root@oracle-rac1 trace]# $GRID_HOME/bin/kfod disks=asm st=true ds=true cluster=true -------------------------------------------------------------------------------- Disk Size Header Path Disk Group User Group ================================================================================ 1: 81920 MB MEMBER /dev/asm-diska DATA grid asmadmin 2: 81920 MB MEMBER /dev/asm-diskb DATA grid asmadmin 3: 81920 MB MEMBER /dev/asm-diskc DATA grid asmadmin 4: 81920 MB MEMBER /dev/asm-diskd OCR grid asmadmin 5: 81920 MB MEMBER /dev/asm-diske DATA grid asmadmin -------------------------------------------------------------------------------- ORACLE_SID ORACLE_HOME HOST_NAME ================================================================================ |
这里,磁盘顺序虽然不一样,但是,没有关系,用的是udev绑定的,不影响rac的运行。
等rac节点正常启动后,可以看到如下内容:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | -- 节点1 SYS@orcl1> set line 9999 SYS@orcl1> set pagesize 9999 SYS@orcl1> col path format a60 SYS@orcl1> SELECT a.group_number, disk_number,mount_status, a.name, path FROM v$asm_disk a order by a.disk_number; select instance_name,status from v$instance; GROUP_NUMBER DISK_NUMBER MOUNT_STATUS NAME PATH ------------ ----------- -------------- ------------- --------------- 1 0 CACHED DATA_0000 /dev/asm-diskc 2 0 CACHED OCR_0000 /dev/asm-diskd 1 1 CACHED DATA_0001 /dev/asm-diske 1 2 CACHED DATA_0002 /dev/asm-diska 1 3 CACHED DATA_0003 /dev/asm-diskb -- 节点2 SQL> set line 9999 SQL> set pagesize 9999 SQL> col path format a60 SQL> SELECT a.group_number, disk_number,mount_status, a.name, path FROM v$asm_disk a order by a.disk_number; select instance_name,status from v$instance; GROUP_NUMBER DISK_NUMBER MOUNT_S NAME PATH ------------ ----------- ------- ---------- --------------- 2 0 CACHED OCR_0000 /dev/asm-diskb 1 0 CACHED DATA_0000 /dev/asm-diske 1 1 CACHED DATA_0001 /dev/asm-diskc 1 2 CACHED DATA_0002 /dev/asm-diska 1 3 CACHED DATA_0003 /dev/asm-diskd |
原因二:安全组封了
登陆ECS,发现只有节点1在运行,而节点2没有运行集群服务。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | [root@oracle-rac1 ~]# crsctl stat res -t -------------------------------------------------------------------------------- Name Target State Server State details -------------------------------------------------------------------------------- Local Resources -------------------------------------------------------------------------------- ora.LISTENER.lsnr ONLINE ONLINE oracle-rac1 STABLE ora.chad ONLINE ONLINE oracle-rac1 STABLE ora.net1.network ONLINE ONLINE oracle-rac1 STABLE ora.ons ONLINE ONLINE oracle-rac1 STABLE -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.ASMNET1LSNR_ASM.lsnr(ora.asmgroup) 1 ONLINE ONLINE oracle-rac1 STABLE 2 ONLINE OFFLINE STABLE 3 ONLINE OFFLINE STABLE ora.DATA.dg(ora.asmgroup) 1 ONLINE ONLINE oracle-rac1 STABLE 2 ONLINE OFFLINE STABLE 3 OFFLINE OFFLINE STABLE ora.LISTENER_SCAN1.lsnr 1 ONLINE ONLINE oracle-rac1 STABLE ora.OCR.dg(ora.asmgroup) 1 ONLINE ONLINE oracle-rac1 STABLE 2 ONLINE OFFLINE STABLE 3 OFFLINE OFFLINE STABLE ora.asm(ora.asmgroup) 1 ONLINE ONLINE oracle-rac1 Started,STABLE 2 ONLINE OFFLINE STABLE 3 OFFLINE OFFLINE STABLE ora.asmnet1.asmnetwork(ora.asmgroup) 1 ONLINE ONLINE oracle-rac1 STABLE 2 ONLINE OFFLINE STABLE 3 OFFLINE OFFLINE STABLE ora.cvu 1 ONLINE ONLINE oracle-rac1 STABLE ora.oracle-rac1.vip 1 ONLINE ONLINE oracle-rac1 STABLE ora.oracle-rac2.vip 1 ONLINE INTERMEDIATE oracle-rac1 FAILED OVER,STABLE ora.orcl.db 1 ONLINE ONLINE oracle-rac1 Open,HOME=/u01/app/o racle/product/19.3.0 /dbhome_1,STABLE 2 ONLINE OFFLINE Instance Shutdown,ST ABLE ora.qosmserver 1 ONLINE ONLINE oracle-rac1 STABLE ora.scan1.vip 1 ONLINE ONLINE oracle-rac1 STABLE -------------------------------------------------------------------------------- |
使用命令crsctl start has
启动节点2的集群服务后,通过crsctl stat res -t -init
观察启动过程:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | [root@oracle-rac2 ~]# crsctl start has CRS-4123: Oracle High Availability Services has been started. [root@oracle-rac2 ~]# crsctl stat res -t -init -------------------------------------------------------------------------------- Name Target State Server State details -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.asm 1 ONLINE OFFLINE STABLE ora.cluster_interconnect.haip 1 ONLINE OFFLINE STABLE ora.crf 1 ONLINE ONLINE oracle-rac2 STABLE ora.crsd 1 ONLINE OFFLINE STABLE ora.cssd 1 ONLINE OFFLINE oracle-rac2 STARTING ora.cssdmonitor 1 ONLINE ONLINE oracle-rac2 STABLE ora.ctssd 1 ONLINE OFFLINE STABLE ora.diskmon 1 OFFLINE OFFLINE STABLE ora.evmd 1 ONLINE INTERMEDIATE oracle-rac2 STABLE ora.gipcd 1 ONLINE ONLINE oracle-rac2 STABLE ora.gpnpd 1 ONLINE ONLINE oracle-rac2 STABLE ora.mdnsd 1 ONLINE ONLINE oracle-rac2 STABLE ora.storage 1 ONLINE OFFLINE STABLE -------------------------------------------------------------------------------- |
多次执行crsctl stat res -t -init
看到,启动一直卡在ora.cssd服务这里,查看日志:
traceroute默认使用UDP数据包探测。