Had to fail... for sure! Curious things about Oracle ® Database Management: FlashCopy and data consistency

Last week I was called by the on call DBA to check a database refresh that failed. This is a reporting database that is refreshed every night from a production database using Flashcopy.

Basically FlashCopy is a storage volume snapshot tool, provided by IBM. FlashCopy creates a point-in-time copy of a source volume on the target volume (sort of LUN mirroring). Currently, almost every storage system is equipped with volume snapshot tools, like IBM FlashCopy, EMC TimeFinder (it is used for creating mirror images of standard devices – the mirror image is called as BCV - Business Continuance Volume), Hitachi ShadowImage, and NetApp SnapMirror. Anyway, it does not matter the vendor or the tool name, the idea is exactly the same: copy or clone volumes (LUN) using the storage high speed internal bus.

The issue was pretty simple to identify and solve, but it made me reflect about data consistency. Let me explain the issue first (file names and paths have been changed):

After the recover process oracle was showing the following message:

ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01195: online backup of file 1 needs more recovery to be consistent
ORA-01110: data file 1: '/oradata/DB_NAME/system_01.dbf'

Here are more messages logged by the refresh scripts:

SQL> set autorecovery on

SQL> RECOVER DATABASE USING BACKUP CONTROLFILE UNTIL CHANGE 11430112340412

ORA-00279: change 11430112340412 generated at 07/04/2012 23:31:02 needed for thread 1

ORA-00289: suggestion : /oradata/refresh/arc/DB_NAME_1_1321_341213241.arc

ORA-00280: change 11430112340412 for thread 1 is in sequence #1321

Specify log: {<RET>=suggested | filename | AUTO | CANCEL}

auto

ORA-00332: archived log is too small - may be incompletely archived

ORA-00334: archived log:'/oradata/refresh/arc/DB_NAME_1_1321_341213241.arc'

As there was no available space on the target file system (/oradata/refresh/arc), the copy of the last archived log file generated during the storage copy just failed – instead of copying 500 MB, it was possible to copy just 100 MB. After cleaning up old files, and copying again the last archived log file, the refresh was complete successfully.

SQL> set autorecovery on

SQL> RECOVER DATABASE USING BACKUP CONTROLFILE UNTIL CHANGE 11430112340412

ORA-00279: change 11430112340412 generated at 07/04/2012 23:31:02 needed for thread 1

ORA-00289: suggestion : /oradata/refresh/arc/DB_NAME_1_1321_341213241.arc

ORA-00280: change 11430112340412 for thread 1 is in sequence #1321

Specify log: {<RET>=suggested | filename | AUTO | CANCEL}

auto

Log applied.

Media recovery complete.

Although the script used to copy the archived log files from the source to the target system did not check for errors, the root cause was the script used to identify the archived log files that should be copied to the target system, and applied in order to bring the database to a consistent state. That script was identifying more archived log files than the necessary. Therefore, soon or later, the size of the archived log files would become larger than the archive area on the target system.

About the data consistency:

Flashcopy is a concurrent point-in-time copy operation. It utilizes copy-on-write bitmap techniques to track which data blocks have been modified after the point-in-time copy was created. Actually the point-in-time copy is just the creation of a bitmap that indicates which part of the source volume has been copied to the target (no data is actually copied at that point) – during the bitmap creation the volume source is frozen for few seconds or less (time depends on the number and size of volumes being copied). The data is copied in background, and when a data block on the source volume is modified, the previous version of that data block is copied to the target volume before being overwritten.

As soon as the bitmap creation is complete, FlashCopy allows read and write operations on both the source and target volumes. Although the actual data copy operation occurs in background, and even taking time, the resulting data at the target appears as though the copy were made instantaneously.

Another technique of point-in-time copy is the split mirror (used by EMC TimeFinder). In this technique, the data is copied prior to the point-in-time copy. After a complete mirror of the data is created, the point-in-time copy is made by "splitting" the mirror. Note the target source is only available after splitting the mirror, and as the copy was already done, all changes on the source are continuously mirrored to the target until the split is performed.

And one more technique of point-in-time copy is the changed block. In this technique physical copy of the data is shared between the source and the target until the data is written. Shared control is implemented using a table of pointers to identify from where the data must be retrieved when both the source and target volumes are accessed. Read and write operations on both the source and target volumes are allowed as soon as the control table is created.

All these techniques ensure the data is copied consistently with a point-in-time, and the database will be in a consistent state also if there are no database I/O activities while the storage point-in-time copy (or split when using EMC TimeFinder) is done. Well, even considering the point-in-time copy is a quick operation, I do not think quiescing the database is a valid option. Therefore, it is essential to have the database in hot backup mode during the point-in-time copy or to use Oracle Recovery Manager (RMAN) to recover the database after the copy.

Had to fail... for sure! Curious things about Oracle ® Database Management

Tuesday, July 10, 2012

FlashCopy and data consistency

No comments:

Post a Comment

Search This Blog