I am not superstitious,
but when a database crashes due to a storage failure on a Fri, 13th
it is hard to avoid the jokes about bad luck :o)
This is a RAC 11.2.0.2 database
with 3 nodes, and oracle alert log files show no evidence of crash (of course,
as the ADR base directory points to the crashed storage). Checking AIX Error
Log facility it is possible to notice the storage issue:
LABEL: DSI_PROC
IDENTIFIER: <<Identifier>>
Date/Time: Fri Jul 13 11:43:56 CDT 2012
Sequence Number: 1072
Machine Id: <<machine ID>>
Node Id: <<server name>>
Class: S
Type: PERM
WPAR: Global
Resource Name: SYSVMM
Description
DATA STORAGE INTERRUPT, PROCESSOR
Probable Causes
SOFTWARE PROGRAM
Failure Causes
SOFTWARE PROGRAM
Recommended Actions
IF PROBLEM PERSISTS THEN DO THE FOLLOWING
CONTACT APPROPRIATE SERVICE REPRESENTATIVE
Detail Data
DATA STORAGE INTERRUPT STATUS REGISTER
0000 0000 4000 0000
SEGMENT REGISTER, SEGREG
0000 001B 9853 9590
DATA STORAGE INTERRUPT ADDRESS REGISTER
F000 0000 2000 0010
After the storage issue
was fixed, it was time to bring the database back up.
$ srvctl start database -d DB_NAME
PRCR-1079 : Failed to start resource ora.db_name.db
CRS-5017: The resource action "ora.db_name.db start"
encountered the following error:
ORA-03113: end-of-file on communication channel
Process ID: 0
Session ID: 0 Serial number: 0
CRS-2674: Start of 'ora.db_name.db' on 'node-1' failed
CRS-5017: The resource action "ora.db_name.db start"
encountered the following error:
ORA-03113: end-of-file on communication channel
Process ID: 0
Session ID: 0 Serial number: 0
CRS-2674: Start of 'ora.reap101.db' on 'node-2' failed
CRS-2632: There are no more servers to try to place resource 'ora.db_name.db'
on that would satisfy its placement policy
CRS-5017: The resource action "ora.db_name.db start"
encountered the following error:
ORA-03113: end-of-file on communication channel
Process ID: 0
Session ID: 0 Serial number: 0
CRS-2674: Start of 'ora.db_name.db' on 'node-3' failed
The actual error is
identified by checking the alert log file:
Errors in file .../DB_NAME_lms1_11235114.trc (incident=251123):
ORA-04031: unable to allocate 1048528 bytes of shared memory
("shared pool","unknown object","sga
heap(1,0)","gcs resources")
The ORA-04031 error during
the startup occurred due to multiple partially-allocated permanent shared pool
chunks – analyzing the SGA heap dump generated by oracle it is possible to see
several chunks with allocated value less than the chunk size.
The issue is pretty
simple to resolve: increase SGA or memory target when using Automatic MemoryManagement
or the shared pool size when setting memory manually.
Considering the database
was up and running before the storage failure, the real question here is: what
has changed to cause the error during the database startup?
Previously the database
was started up using automatic memory management by setting sga_target.
The shared pool size was set also (the value specified worked as a minimum size
for the shared pool), but it was not enough as oracle increased the shared pool
size. This can be confirmed by checking the server parameter file for the
presence of __shared_pool_size with a value greater than the value
specified to the shared_pool_size parameter. Later the memory management
was changed from auto to manual by setting sga_target to 0 (the database
would be restarted during a scheduled maintenance window), but the parameter shared_pool_size
was not changed to reflect the real value being used by oracle.
It is interesting to note
SGA resizes continue occurring in 11.2 databases despite the fact that
automatic memory management is disabled – see oracle note 1269139.1. Therefore,
it is important to keep in mind to check the current memory size used by the
SGA components by querying gv$sga_dynamic_components before changing memory
parameters.
No comments:
Post a Comment