Wednesday, July 18, 2012

Fri, 13th :o(

I am not superstitious, but when a database crashes due to a storage failure on a Fri, 13th it is hard to avoid the jokes about bad luck :o)

This is a RAC 11.2.0.2 database with 3 nodes, and oracle alert log files show no evidence of  crash (of course, as the ADR base directory points to the crashed storage). Checking AIX Error Log facility it is possible to notice the storage issue:

LABEL:          DSI_PROC
IDENTIFIER:     <<Identifier>>

Date/Time:       Fri Jul 13 11:43:56 CDT 2012
Sequence Number: 1072
Machine Id:      <<machine ID>>
Node Id:         <<server name>>
Class:           S
Type:            PERM
WPAR:            Global
Resource Name:   SYSVMM

Description
DATA STORAGE INTERRUPT, PROCESSOR

Probable Causes
SOFTWARE PROGRAM

Failure Causes
SOFTWARE PROGRAM
        Recommended Actions
        IF PROBLEM PERSISTS THEN DO THE FOLLOWING
        CONTACT APPROPRIATE SERVICE REPRESENTATIVE

Detail Data
DATA STORAGE INTERRUPT STATUS REGISTER
0000 0000 4000 0000

SEGMENT REGISTER, SEGREG
0000 001B 9853 9590

DATA STORAGE INTERRUPT ADDRESS REGISTER
F000 0000 2000 0010

After the storage issue was fixed, it was time to bring the database back up.

$ srvctl start database -d DB_NAME
PRCR-1079 : Failed to start resource ora.db_name.db
CRS-5017: The resource action "ora.db_name.db start" encountered the following error:
ORA-03113: end-of-file on communication channel
Process ID: 0
Session ID: 0 Serial number: 0

CRS-2674: Start of 'ora.db_name.db' on 'node-1' failed
CRS-5017: The resource action "ora.db_name.db start" encountered the following error:
ORA-03113: end-of-file on communication channel
Process ID: 0
Session ID: 0 Serial number: 0

CRS-2674: Start of 'ora.reap101.db' on 'node-2' failed
CRS-2632: There are no more servers to try to place resource 'ora.db_name.db' on that would satisfy its placement policy
CRS-5017: The resource action "ora.db_name.db start" encountered the following error:
ORA-03113: end-of-file on communication channel
Process ID: 0
Session ID: 0 Serial number: 0

CRS-2674: Start of 'ora.db_name.db' on 'node-3' failed

The actual error is identified by checking the alert log file:

Errors in file .../DB_NAME_lms1_11235114.trc  (incident=251123):
ORA-04031: unable to allocate 1048528 bytes of shared memory ("shared pool","unknown object","sga heap(1,0)","gcs resources")

The ORA-04031 error during the startup occurred due to multiple partially-allocated permanent shared pool chunks – analyzing the SGA heap dump generated  by oracle it is possible to see several chunks with allocated value less than the chunk size.

The issue is pretty simple to resolve: increase SGA or memory target when using Automatic MemoryManagement or the shared pool size when setting memory manually.

Considering the database was up and running before the storage failure, the real question here is: what has changed to cause the error during the database startup?

Previously the database was started up using automatic memory management by setting sga_target. The shared pool size was set also (the value specified worked as a minimum size for the shared pool), but it was not enough as oracle increased the shared pool size. This can be confirmed by checking the server parameter file for the presence of __shared_pool_size with a value greater than the value specified to the shared_pool_size parameter. Later the memory management was changed from auto to manual by setting sga_target to 0 (the database would be restarted during a scheduled maintenance window), but the parameter shared_pool_size was not changed to reflect the real value being used by oracle.

It is interesting to note SGA resizes continue occurring in 11.2 databases despite the fact that automatic memory management is disabled – see oracle note 1269139.1. Therefore, it is important to keep in mind to check the current memory size used by the SGA components by querying gv$sga_dynamic_components before changing memory parameters.

No comments:

Post a Comment