Oracle 11g DG新特性--Automatic block repair

0    119    1

Tags:

👉 本文共约5957个字,系统预计阅读时间或需23分钟。

在Oracle 的世界中,corrupted block一般都是块分裂。(不是快分裂,呵呵) 一般是块头和块尾的TRAIL NUMBER不一致才导致的块分裂,这样的块分裂是逻辑损坏。

在强大的11GR2 中,Oracle 可以自动去检测CORRUPTED BLOCK而去修复。

在手工恢复的年代,如果遇到坏块,那需要RMAN 恢复对应的ARCHIVED LOGS去恢复;或者应用flashback logs去恢复;或者如果坏块所在的OBJECT是索引,那可以重新REBUILD ONLINE。那自动修复BLOCK是如何实现的?和常规的手工恢复是一样的?

先看下NEW FEATURE中的描述:

Automatic block repair allows corrupt blocks on the primary database or physical standby database to be automatically repaired, as soon as they are detected, by

transferring good blocks from the other destination. In addition, RECOVER BLOCK is enhanced to restore blocks from a physical standby database. The physical standby database must be in real-time query mode.

This feature reduces time when production data cannot be accessed, due to block corruption, by automatically repairing the corruptions as soon as they are detected in real-time using good blocks from a physical standby database. This reduces block recovery time by using up-to-date good blocks from a real-time, synchronized physical standby database as opposed to disk or tape backups or flashback logs.

主要有两点:

1) 这个自动修复是需要物理DG的配合才能完成的,而且要求DG建立在实时查询模式;

2) 该特性由于直接从DG中找到对应的BLOCK来恢复,而直接跳过RESTORE 归档日志或者FLASHBACK 日志,当然更快/更强大;

对物理DG有些限制:

1) 物理DG必须运行在实时查询模式,而该模式需要ACTIVE DATA GUARD Lience;

2) 物理DG 最好运行在实时应用模式;因为任何模式的DG保护模式都是支持自动块修复,但是修复主库中损坏块的效率大大依赖DG中没有损坏块的版本。

这段话说起来有点绕,还是举个简单的例子。

一个主库中CORRUPT块的SCN 为 1005,

而备库中对应块的SCN为1001,那ORACLE为修复该BLOCK,必须要让备库中的块SCN到达1005才能REPAIR。如果备库同步的时间越晚,那需要同步的时间就越多,修复的效率就越低。

反过来,如备库的块损坏,那ORACLE可以从主库中把对应的块拷贝过来,这需要在备库中的参数要有效的设置:

FAL_SERVER ,主库的NET SERICE NAME;

LOG_ARCHIVE_CONFIG

修复好后,会在库中alert日志写上日志;

这个自动修复,不会100%的成功,如果不能修复,那会报ORA-1578 的错误,

Error: ORA-1578

Text: ORACLE data block corrupted (file # %s, block # %s)

那这时需要DBA的介入,或者REBUILD或者跳过坏块导入/导出;

在oracle 11g DataGuard 中引入Automatic Block Media Repair,自动的块恢复

如果在主库发现有损坏的数据块,该特性将通过后台进程ABMR自动将物理备库(physical standby)上的好的数据块传输到主库(primary database)上恢复

同样的若物理备库上发现数据块损坏那么也可以利用到以上特性来修复

此特性默认是开启的

SQL> select * from v$version where rownum <=1;

BANNER

--------------------------------------------------------------------------------

Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production

SQL> set linesize 132

SQL> column name format a30

SQL> column value format a25

SQL> select

2 x.ksppinm name,

3 y.ksppstvl value,

4 y.ksppstdf isdefault,

5 decode(bitand(y.ksppstvf,7),1,'MODIFIED',4,'SYSTEM_MOD','FALSE') ismod,

6 decode(bitand(y.ksppstvf,2),2,'TRUE','FALSE') isadj

7 from

8 sys.x$ksppi x,

9 sys.x$ksppcv y

10 where

11 x.inst_id = userenv('Instance') and

12 y.inst_id = userenv('Instance') and

13 x.indx = y.indx and

14 x.ksppinm like '%_&par%'

15 order by

16 translate(x.ksppinm, ' _', ' ')

17 /

Enter value for par: auto_bmr

old 14: x.ksppinm like '%_&par%'

new 14: x.ksppinm like '%_auto_bmr%'

NAME VALUE ISDEFAULT ISMOD ISADJ


_auto_bmr enabled TRUE FALSE FALSE

_auto_bmr_bg_time 3600 TRUE FALSE FALSE

_auto_bmr_fc_time 60 TRUE FALSE FALSE

_auto_bmr_pub_timeout 10 TRUE FALSE FALSE

_auto_bmr_req_timeout 60 TRUE FALSE FALSE

_auto_bmr_sess_threshold 30 TRUE FALSE FALSE

_auto_bmr_sys_threshold 100 TRUE FALSE FALSE

7 rows selected.

下面我们做一个实验:

主库上创建实验表

SQL> create table goolen as select * from dba_objects;

Table created.

SQL> select count(*) from goolen;

COUNT(*)

----------

74432

SQL> SELECT

2 dbms_rowid.rowid_relative_fno(rowid) REL_FNO,

3 dbms_rowid.rowid_block_number(rowid) BLOCKNO

4 from &table_name where rownum <=10;

Enter value for table_name: goolen

old 4: from &table_name where rownum <=10

new 4: from goolen where rownum <=10

REL_FNO BLOCKNO


​ 1 71969

​ 1 71969

​ 1 71969

​ 1 71969

​ 1 71969

​ 1 71969

​ 1 71969

​ 1 71969

​ 1 71969

​ 1 71969

10 rows selected.

SQL> select dest_name,status,error,target,process from v$archive_dest where substr(dest_name,-1) in (1,2);

DEST_NAME STATUS ERROR TARGET PROCESS


LOG_ARCHIVE_DEST_1 VALID PRIMARY ARCH

LOG_ARCHIVE_DEST_2 VALID STANDBY LGWR

+++备库查看此表是否已经同步到备库

SQL> select count(*) from goolen;

COUNT(*)

----------

本人提供Oracle(OCP、OCM)、MySQL(OCP)、PostgreSQL(PGCA、PGCE、PGCM)等数据库的培训和考证业务,私聊QQ646634621或微信db_bao,谢谢!

74432

+++我们看到备库已经把这张表同步过来了

+++下面我们在主库上人为破坏一个块

RMAN> recover datafile 1 block 71969 clear;

Starting recover at 2013-12-27 11:17:02

using channel ORA_DISK_1

Finished recover at 2013-12-27 11:17:03

+++查询一下

SQL> select count(*) from goolen;

COUNT(*)

----------

74432

++++然后我们查看alter日志中的信息:

Corrupt block relative dba: 0x00411921 (file 1, block 71969)

Bad header found during multiblock buffer read

Data in bad block:

type: 6 format: 2 rdba: 0x00411921

last change scn: 0x0002.1a0a61e7 seq: 0x1 flg: 0x04

spare1: 0x0 spare2: 0x0 spare3: 0x99

consistency value in tail: 0x61e70601

check value in block header: 0xa7f1

computed block checksum: 0x8723

Reading datafile '/tol/oradata/test11g/system01.dbf' for corruption at rdba: 0x00411921 (file 1, block 71969)

Reread (file 1, block 71969) found same corrupt data

Starting background process ABMR

Fri Dec 27 11:17:22 2013

ABMR started with pid=116, OS id=4266

Auto BMR service is active.

Requesting Auto BMR for (file# 1, block# 71969)

Waiting Auto BMR response for (file# 1, block# 71969)

Auto BMR successful

+++在备库模拟

++++在备库手工损坏一个block

RMAN> recover datafile 1 block 71969 clear;

Starting recover at 2013-12-27 11:17:52

using target database control file instead of recovery catalog

allocated channel: ORA_DISK_1

channel ORA_DISK_1: SID=1424 device type=DISK

Finished recover at 2013-12-27 11:18:04

+++dbv检测

[oracle@dg_136 ~]$ dbv file=/tol/oradata/test11g/system01.dbf

DBVERIFY: Release 11.2.0.1.0 - Production on Fri Dec 27 11:20:21 2013

Copyright (c) 1982, 2009, Oracle and/or its affiliates. All rights reserved.

DBVERIFY - Verification starting : FILE = /tol/oradata/test11g/system01.dbf

Page 71969 is marked corrupt

Corrupt block relative dba: 0x00411921 (file 1, block 71969)

Bad check value found during dbv:

Data in bad block:

type: 6 format: 2 rdba: 0x00411921

last change scn: 0x0002.1a0a61e7 seq: 0x1 flg: 0x04

spare1: 0x0 spare2: 0x0 spare3: 0x0

consistency value in tail: 0x61e70601

check value in block header: 0x204b

computed block checksum: 0x82ae

DBVERIFY - Verification complete

Total Pages Examined : 256000

Total Pages Processed (Data) : 227158

Total Pages Failing (Data) : 0

Total Pages Processed (Index): 14076

Total Pages Failing (Index): 0

Total Pages Processed (Other): 3459

Total Pages Processed (Seg) : 1

Total Pages Failing (Seg) : 0

Total Pages Empty : 11306

Total Pages Marked Corrupt : 1

Total Pages Influx : 0

Total Pages Encrypted : 0

Highest block SCN : 436888803 (2.436888803)

++++查询这张表

SQL> select count(*) from goolen;

COUNT(*)

----------

74432

++++但是在备库的alter日志里面没有发现相关的信息

ORACLE 11g ACTIVE Data Guard 不仅提供了physical standby 可以open read-only ,还提供了一个无论是主备库发现currupt block的情况下,在满足一定条件下,透明的有ABMR后台进程自动恢复该块从另一方的可用的block.下面附上一段官方解释,开始我的测试。

Automatic Block Repair

Automatic block repair allows corrupt data blocks to be automatically repaired as soon as the corruption is detected. This feature reduces the amount of time that data is inaccessible due to block corruption. This reduces block recovery time by using up-to-date good blocks in real-time, as opposed to retrieving blocks from disk or tape backups, or from Flashback logs.

Note:
Oracle Active Data Guard is packaged as a separate database option for Oracle Enterprise Edition. It requires a license for production database and all of the physical standby databases that are used for the Oracle Active Data Guard option.

if
A corrupt data block is discovered on a primary database
then
A physical standby database operating in real-time query mode can be used to repair corrupt data blocks in a primary database. If possible, any corrupt data block encountered when a primary database is accessed will be automatically replaced with an uncorrupted copy of that block from a physical standby database operating in real-time query mode. An ORA-1578 error is returned when automatic repair is not possible.
if
A corrupt data block is discovered on a physical standby database
then
The server attempts to automatically repair the corruption by obtaining a copy of the block from the primary database if the following database initialization parameters are configured on the standby database:

Configure the LOG_ARCHIVE_CONFIG parameter with a DG_CONFIG list
Configure a LOG_ARCHIVE_DEST_n parameter for the primary database

You can also manually repair a corrupted data block by using the RMAN RECOVER BLOCK command.

read more

PORA92 is primary site, SORA245 is standby site

搭建测试环境11G ADG 部门省略。

1,测试主库的currupt block auto repair

alert log

ALTER SYSTEM: Flushing buffer cache
2014-02-20 14:22:23.779000 +08:00
Hex dump of (file 5, block 131) in trace file /oracle/diag/rdbms/pora92/pora92/trace/pora92_ora_21197.trc
Corrupt block relative dba: 0×01400083 (file 5, block 131)
Completely zero block found during buffer read
Reading datafile ‘/oradata/pora92/pora92/tbs_test01.dbf’ for corruption at rdba: 0×01400083 (file 5, block 131)
Reread (file 5, block 131) found same corrupt data (no logical check)
Starting background process ABMR
ABMR started with pid=32, OS id=21199
Automatic block media recovery service is active.
Automatic block media recovery requested for (file# 5, block# 131)
Automatic block media recovery successful for (file# 5, block# 131)
Automatic block media recovery successful for (file# 5, block# 131)

2,测试备库的currupt block auto repair

– on standby site

note:
now RMAN helped me to prove that there is a block corruption as the backup is failed with “ORA-19566: exceeded limit of 0 corrupt blocks”
Also, now Oracle reported the error in alert log as following

alert log

2014-02-20 15:22:53.353000 +08:00
Hex dump of (file 5, block 131) in trace file /oracle/diag/rdbms/sora245/sora245/trace/sora245_ora_2318.trc
Corrupt block relative dba: 0×01400083 (file 5, block 131)
Completely zero block found during backing up datafile
Reread of blocknum=131, file=/oradata/sora245/tbs_test01.dbf. found same corrupt data
Reread of blocknum=131, file=/oradata/sora245/tbs_test01.dbf. found same corrupt data
Reread of blocknum=131, file=/oradata/sora245/tbs_test01.dbf. found same corrupt data
Reread of blocknum=131, file=/oradata/sora245/tbs_test01.dbf. found same corrupt data
Reread of blocknum=131, file=/oradata/sora245/tbs_test01.dbf. found same corrupt data
Deleted Oracle managed file /flrvarea/SORA245/SORA245/backupset/2014_02_20/o1_mf_nnndf_TAG20140220T1522539jccbf9x.bkp

sys@sora245>select * from anbob.testbmr;
ID
——————–
1

sys@sora245>alter system flush buffer_cache;
System altered.

sys@sora245>select dbms_rowid.ROWID_RELATIVE_FNO(rowid), dbms_rowid.rowid_block_number(rowid) from anbob.testbmr;
DBMS_ROWID.ROWID_RELATIVE_FNO(ROWID) DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID)
———————————— ————————————
5 131

sys@sora245>select * from anbob.testbmr;
ID
——————–
1

sys@sora245>select * from V$DATABASE_BLOCK_CORRUPTION;
FILE# BLOCK# BLOCKS CORRUPTION_CHANGE# CORRUPTIO
——————– ——————– ——————– ——————– ———
5 131 1 0 ALL ZERO

NOTE:
这里出了个小插曲, 可以查询select,但是块提示还是curuppted,此刻再次rman backup 还是报此块损坏,exp 逻辑正常,而且alert 日志也没有出现abmr的日志输出。

下面查看是不是禁用了ABMR?_auto_bmr 隐藏控制。和log_archive_config 参数archive_log_dest_N配置也是正常的

后来做了 dump block trace发现物理块上是空块,多次flush buffer cache的操作查询仍然可以,那只有来自buffer cache,下面确认的确是。

TIP:
mrec state means block in media recovery mode.

还曾经怀疑备库的noformated block不可以被automatic repair,其实不是的。解决上面的问题是在主库对该条记录进行更新,如果发于standby 的block 是空或zero all,会自动re-formated.

后做了data block trace 发现块是更新了,dbv, rman 也检查正常通过了。下面来看一下buffer cache中flush 不掉的mrec block,通过重启redo apply清除。

对于上面的standby 端的automatic block repair 是失败的,因为有BH中的mrec状态 block,虽然后来通过primary site的块更新覆盖的standby 的block,在下一篇我会重新测试,如果bh 中没有mrec的block 是不是可以触发ABMR.

上一篇 中,primary site 的ABMR 测试是成功的,而standby site 的ABMR没有成功, 后来是怀疑buffer header中有mrec 的block, 现在我接着测试如果buffer header 中没有此类block,ABMR是否成功?以前用bbed 标记corrupt block的方式, 如果有corrupt block时rman backup 又会怎么样?实验开始..

alert log writen the following :

Hex dump of (file 5, block 131) in trace file /oracle/diag/rdbms/sora245/sora245/trace/sora245_ora_10022.trc
Corrupt block relative dba: 0×01400083 (file 5, block 131)
Completely zero block found during multiblock buffer read
Reading datafile ‘/oradata/sora245/tbs_test01.dbf’ for corruption at rdba: 0×01400083 (file 5, block 131)
Reread (file 5, block 131) found same corrupt data (no logical check)
Automatic block media recovery requested for (file# 5, block# 131)
2014-02-21 13:41:02.134000 +08:00
Automatic block media recovery requested for (file# 5, block# 131)
Automatic block media recovery successful for (file# 5, block# 131)
Automatic block media recovery successful for (file# 5, block# 131)

TIP:
可以看到这次用了同样的dd 破坏造成的物理corrupt block,buffer header中没有mrec 状态的block 时是可以成功自动repair的(select时).

BBED 构造逻辑currup block及rman 备份测试

TIP:
可以看到bbed 只是修改了block的几个标志位造成corrupt block, rman 备份时会出错,通过select 可以触发ABMR,也会在primary site ,block 发现更新时覆盖standby site的corrupt block,此时alert 无记录。

标签:

头像

小麦苗

学习或考证,均可联系麦老师,请加微信db_bao或QQ646634621

您可能还喜欢...

发表回复

嘿,我是小麦,需要帮助随时找我哦
  • 18509239930
  • 个人微信

  • 麦老师QQ聊天
  • 个人邮箱
  • 点击加入QQ群
  • 个人微店

  • 回到顶部
返回顶部