一遍ASM储存高可用故障化解进度剖析,ASM改动磁

日期:2019-10-06编辑作者:威尼斯在线平台

原标题:守得云开见月明:叁次ASM存款和储蓄高可用故障消除进度深入分析

今日他俩要测验IBM的可怜SVC存款和储蓄同步的东西,然后需求把服务器上的盘都踢出去后再加进去,不过如此的话磁盘名称就能变了。由此供给把ASM中的磁盘名称都换了,其实进度也很简短:

图片 1

1、修改asm实例的asm磁盘暗许查找渠道参数asm_diskstring,使用如下命令:

我 | 姜劲松,云和恩墨专家帮衬部Oracle本领专家,Oracle OCP,MySQL OCP,CR-VHCE等申明专家。长时间服务活动运营厂家当客户,通晓 oracle 品质优化,故障会诊,特殊苏醒领域。23年IT从业经历、资深数据库及系统软硬件集成专家。

alter system set asm_diskstring='/dev/rhdisk*';

百万级客户规模经营出卖账务系统研究开发及实践运行经验,主持过11省千万级电力经营出卖工作系统运行老板工作;设计实行过10多少个Ali云平台新财富SAAS系统。历任开荒技术员、项目经理、技巧首席营业官、项目首席实行官、运维总经理、云平台架构师等职分。

2、关闭全部Cluster,等待她们踢盘加盘后再修改如下属性,小编的是RAC情状进而一下操作要在不无节点上实践

前言

修改磁盘客商及属组:

Oracle ASM 全称为Automated Storage Management,即自行存储管理,它是自 Oracle10g 那一个版本 Oracle 推出的新职能。那是 Oracle 提供的三个卷管理器,用于替代操作操作系统所提供的 LVM,它不但援助单实例配置,也支撑RAC那样的多实例配置。

[rac11g2@root]# chown grid:asmadmin /dev/rhdisk[2-4]

给 Oracle 数据库管理员带来巨大的方便人民群众,ASM 可以自行管理磁盘组,并提供数据冗余和优化。 ASM提供了增进的管住和容灾花招,通过适当的布局,能够兑现迅速的数据库层面的存款和储蓄容灾成效。

修改磁盘属性为660:

此案例通过某客商项目现场1次ASM囤积容灾不恐怕达成预期指标的标题剖析解决进程,和大家一块切磋对于非预期难点的消除之道。

[rac11g2@root]# chmod 660 /dev/rhdisk[2-4]

01难题简述

修改磁盘分享属性:

背景表明:

[rac11g2@root]# lsattr -El hdisk2|grep reserve_policy
reserve_policy  no_reserve                                          Reserve Policy                          True

1、Oracle12.2RAC+ASM Normal Redendancy 形式,数据仓库储存款和储蓄采取双储存冗余架构,规避单存款和储蓄故障变成服务中断及数量遗失;

2、 ASM DiskGroup 设计2个 Failgroup(FG),1个FG磁盘全部囤积在1#存款和储蓄;1个FG全体磁盘存款和储蓄在2#存储中;

style="font-size: 16px;">3、期待自便存款和储蓄故障或断电,数据库实例不受影响,数据不舍弃,故障存款和储蓄上线后数据自动同步。

[rac11g2@root]# chdev -l hdisk2 -a reserve_policy=no_reserve
[rac11g2@root]# chdev -l hdisk3 -a reserve_policy=no_reserve
[rac11g2@root]# chdev -l hdisk4 -a reserve_policy=no_reserve

在实际高可用测验中,拔掉1个存款和储蓄,开采如下现象:

3、今后就能够运维Cluster了

style="font-size: 16px;">1.C卡宴S集群不受影响,ocr/votedisk自动Failover;

2.DB Controlfile/Redolog生出I/O错误,导致LW土霉素urano/CKPT等宗旨进度长日子阻塞后,Oracle主动重启DB实例(1个或2个实例)后,数据库复苏通常;

style="font-size: 16px;">3.数据库数据平常,故障存款和储蓄Online后自行同步符合规律。

[rac11g1@root]# crsctl start cluster -all

02测验进程

注:小编早已因为忘了更换磁盘属性为660,结果导致Database起不来,在Alert日志中冒出了ORA-00600的失实,吓小编一跳,可是从日记中比较便于看出来是权力的主题素材,调治磁盘属性后再重启就能够了:

1) 第一类测量检验

Sweep [inc][393409]: completed
Sweep [inc2][393409]: completed
NOTE: Loaded library: System 
ORA-15025: could not open disk "/dev/rhdisk4"
ORA-27041: unable to open file
IBM AIX RISC System/6000 Error: 13: Permission denied
Additional information: 11
SUCCESS: diskgroup DATA was mounted
Errors in file /soft/Oracle/diag/rdbms/nint/nint1/trace/nint1_ckpt_19136654.trc  (incident=409793):
ORA-00600: internal error code, arguments: [kfioTranslateIO03], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /soft/oracle/diag/rdbms/nint/nint1/incident/incdir_409793/nint1_ckpt_19136654_i409793.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
NOTE: dependency between database nint and diskgroup resource ora.DATA.dg is established
ERROR: unrecoverable error ORA-600 raised in ASM I/O path; terminating process 19136654 
Dumping diagnostic data in directory=[cdmp_20120302172201], requested by (instance=1, osid=19136654 (CKPT)), summary=[incident=409793].
Fri Mar 02 17:22:01 2012
PMON (ospid: 14156014): terminating the instance due to error 469
System state dump requested by (instance=1, osid=14156014 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /soft/oracle/diag/rdbms/nint/nint1/trace/nint1_diag_21168306.trc
Fri Mar 02 17:22:02 2012
ORA-1092 : opitsk aborting process
Fri Mar 02 17:22:02 2012
License high water mark = 1
Instance terminated by PMON, pid = 14156014
USER (ospid: 15335672): terminating the instance
Instance terminated by USER, pid = 15335672

1、存款和储蓄完结拔线:16:56:05

2、实例16:57:37-16:57:39 挂掉

【摘录】

ASM日志:

First, you can try to check the OS drive ownership , permission and reserve_policy attribute on all nodes. Then restart the ASM instance.
  1)Make sure that the hdisk# is owned by the OS user who installed the ASM Oracle Home ... and that the disk is mounted correctly (with the correct owner) 
  2)Make sure that the permissions are set correctly at the disk level ... 660 is normal ... but if there are problems use 777 as a test 
  ls -l /dev/rhdisk3 output:
  For 10gR2/11gR1 like:  crw-rw----  oracle:oinstall /dev/rhdisk3 
  For 11gR2 like:        crw-rw----  grid:asmadmin /dev/rhdisk3

2018-08-01T16:57:41.712885+08:00

NOTE: ASM client style="font-size: 16px;">node11:node1:node1-rac disconnected unexpectedly

  How to change the drive ownership and permission ?
  For 10gR2/11gR1:
    # chown -R oracle:oinstall /dev/rhdisk[3-10]
    # chmod -R 660 /dev/rhdisk[3-10]
  For 11gR2:
    # chown -R grid:asmadmin /dev/rhdisk[3-10]
    # chmod -R 660 /dev/rhdisk[3-10]

DB:

  3)Make sure that the reserve_policy attribute of the needed hdisk# is no_reserve or no on all nodes.
    chdev -l hdisk# -a reserve_policy=no_reserve

2018-08-01T16:57:45.214182+08:00

Instance terminated by USER, pid = 10158

2018-08-01T16:57:36.704927+08:00

Errors in file /oracle/diag/rdbms/node1/node11/trace/node11_ckpt_10158.trc:

ORA-00206: error in writing (block 3, # blocks 1) of control file

ORA-00202: control file: '+DG_DATA_FAB/NODE1/CONTROLFILE/current.265.981318275'

ORA-15081: failed to submit an I/O operation to a disk

ORA-15081: failed to submit an I/O operation to a disk

ORA-15064: communication failure with ASM instance

2018-08-01T16:57:36.705340+08:00

Errors in file /oracle/diag/rdbms/node1/node11/trace/node11_ckpt_10158.trc:

ORA-00221: error on write to control file

ORA-00206: error in writing (block 3, # blocks 1) of control file

ORA-00202: control file: '+DG_DATA_FAB/NODE1/CONTROLFILE/current.265.981318275'

ORA-15081: failed to submit an I/O operation to a disk

ORA-15081: failed to submit an I/O operation to a disk

ORA-15064: communication failure with ASM instance

If it also fail by the first step, you may try to set the Oracle ASM parameter ASM_DISKSTRING to /dev/* or /dev/rhdisk*. The Step is:
1)Backup the ASM instance pfile(Parameter File) or spfile (Server Parameter File).
  Most in the $ORACLE_HOME/dbs. pfile name like is init+ASM1.ora, you can use cp command to backup it .and vi the content. 
  You to create spfile to pfile for backup,if use spfile. 
2)set ASM_DISKSTRING parameter
  use pfile ENV:
    Add or Edit "ASM_DISKSTRING" line to *.ASM_DISKSTRING='/dev/rhdisk*' in pfile. Startup the ASM instance using the pfile.
  
  use spfile ENV:
    $ ORACLE_SID=+ASM1;export ORACLE_SID
    
    $ sqlplus "/ as sysdba"
    or
    $ sqlplus "/ as sysasm"
    
    SQL> startup
    SQL> alter system set asm_diskstring='/dev/rhdisk*';
    SQL> select group_number,disk_number,path from v$asm_disk; 
        --You can get some disk info and the most disk's group_number  is not 0.

Oracle CKPT 进程因为调整文件 IO 错误阻塞,导致主动重启 instance,每便测验都在逾期70s日后初阶Terminate instance。

If ASM_DISKSTRING is NOT SET ... then the following default is used

思疑是ASM实例offline disk时间过慢,希望调高CKPT阻塞时间阀值化解难题,然而未有找到相应的参数。

    Default ASM_DISKSTRING per OS

既然是controlfile存在此主题素材,是否因为DATA磁盘非常多,导致offline检查测量试验时间长呢?

    Operating System Default            Search String
    =======================================
    Solaris (32/64 bit)                        /dev/rdsk/*
    Windows NT/XP                          \.orcldisk* 
    Linux (32/64 bit)                          /dev/raw/* 

品味将controlfile转移到磁盘很少的REDO DG,如故在controfile这里报错:

    LINUX (ASMLIB)                         ORCL:*
    LINUX (ASMLIB)                         /dev/oracleasm/disks/* ( as a workaround )

systemstatedump文件:

----- Beginning of Customized Incident Dump(s) -----

Process CKPT (ospid: 4693) is waiting for event 'control file sequential read'.

Process O009 (ospid: 5080) is the blocker of the wait chain.

===[ Wait Chain ]===

CKPT (ospid: 4693) waits for event 'control file sequential read'.

LGWR (ospid: 4691) waits for event 'KSV master wait'.

O009 (ospid: 5080) waits for event 'ASM file metadata operation'.

node1_lgwr_4691.trc

----- END DDE Actions Dump (total 0 csec) -----

ORA-15080: synchronous I/O operation failed to write block 1031 of disk 4 in disk group DG_REDO_MOD

ORA-27063: number of bytes read/written is incorrect

HPUX-ia64 Error: 11: Resource temporarily unavailable

Additional information: 4294967295

Additional information: 1024

NOTE: process _lgwr_node1 (4691) initiating offline of disk 4.4042263303 (DG_REDO_MOD_0004) with mask 0x7e in group 3 (DG_REDO_MO

D) with client assisting

    HPUX                                       /dev/rdsk/* 
    HP-UX(Tru 64)                            /dev/rdisk/*
    AIX                                            /dev/*

2) 第二类测验

图片 2

尝试对 controlfile 进行 multiplex:

1、每一个存储分配1个10GB LUN给服务器;

2、基于每一个LUN创设1个DG,controlfile multiplex到那2个DG中。

重复伊始效仿1个存款和储蓄故障测量试验,开采依旧会时有发生调整文件不能够读写,重启实例!

在Oracle文书档案发掘只好动用ASM FG来落到实处高可用,因为别的决定文件都急需在线,不然将平素形成实例中止!

style="font-size: 16px;">

Multiplex Control Files on Different Disks

Every Oracle Database should have at least two control files, each stored on a different physical disk. If a control file is damaged due to a disk failure, the associated instance must be shut down. Once the disk drive is repaired, the damaged control file can be restored using the intact copy of the control file from the other disk and the instance can be restarted. In this case, no media recovery is required.

The behavior of multiplexed control files is this:

The database writes to all filenames listed for the initialization parameter CONTROL_FILES in the database initialization parameter file.

The database reads only the first file listed in the CONTROL_FILES parameter during database operation.

If any of the control files become unavailable during database operation, the instance becomes inoperable and should be aborted.

Note:

Oracle strongly recommends that your database has a minimum of two control files and that they are located on separate physical disks.

进而这种 multiplex 方法对 controlfile 的高可用无效!

3) 第三类测量试验

将controlfile存款和储蓄在多少个RPT存款和储蓄中,幸免因为controlfile同步导致的封堵。

开掘不时测验可以得逞,可是不常会在**REDO LOG**读写时报错导致DB重启!

4) 第四类测量检验

创办2个独立的DG,指向2个不等存款和储蓄,REDO GROUP的2个member multiplex到2个DG中。

测验failover成功,ASM实例会将故障DG dismount,数据库完全不受影响!

依赖以上的测量检验进度,开采如下现象:

1、 ASM Failgroup对数据库文件管理完全没分外,可以完毕Failover

2、 ControlFile/RedoLogfile在Normal DG做offline时,非常长日子阻塞并积极重启DB实例,重启后运营平日化,数据完整性不受影响!

几度数十次测验,难点均随机出现,由此高度思疑为Oracle BUG,在MOS上开掘1个类似『 链接:Bug 23179662 - ASM B-slave Process Blocking Fatal background Process like LGWXC90 producing ORA-29771 (文书档案 ID 23179662.8)』,然则MOS说明 20180417PSU 已经 fixed 此 BUG, Wordaround 行为便是重启实例。

在再而三1周不只怕缓和难点的境况,选择了如下不时的应用方案:

本文由威尼斯在线平台发布于威尼斯在线平台,转载请注明出处:一遍ASM储存高可用故障化解进度剖析,ASM改动磁

关键词:

威尼斯在线平台:智能外呼,AI助推销骚扰电话升

原标题:智能外呼:场景内的数据可视化 每天拨打800到1000次 AI助推销骚扰电话升级 智能外呼是目前人工智能落地最...

详细>>

该如何定义,逆转脑死亡

原标题:在人类意识可以上传的数字时代,“死亡”该如何定义? 耶鲁大学的科学家研发了一套体外灌注系统,能够...

详细>>

关于未来世界的艺术展,现场丨2018北京媒体艺术

超链接展是依据互连网的线上海展览中心出,基于上海传播媒介艺术双年展官方网站及微信公众号等网络平台,同步...

详细>>

威尼斯在线平台:爱立信巩固型有线系统将加快

爱立信正在与弹性网络解决方案领域的全球供应商ECI建立新的合作关系,从而增强自身的轨道交通光纤传输产品实力...

详细>>