Saturday, July 28, 2012

Physical data distribution

Tuning application queries is one of the biggest challenges for DBAs supporting production environments. The lack of knowledge about the application data model makes difficult the tuning. I remember when I worked as a development DBA – based on my knowledge about the application design I was able to identify whether the access path was optimal or not just looking at explain plans. It was easy to identify how a particular query should run.

 

Without application knowledge, improvements may be obtained by considering the table sizes and the number of rows that can be eliminated as early as possible in the query based on the join and filter conditions. When tables are accessed in the expected order, rows are eliminated correctly generating no more I/O operations (physical or logical) than the necessary. It is pretty common to see problematic situations created by just few SQL statements with non-optimal execution plans generating so many and unnecessary physical I/O in such way that the overall performance for the database is degraded.

 

Normally tuning OLTP queries is less complicated than tuning queries running on data warehouse systems, as the queries use to be smaller and simpler. Last week I was working on a query that is used to show offers to customers contacting the call center. Most part of the time (99%) the query runs very fast, in less than 100 milliseconds. But some executions take minutes, raising the average execution time for the query to 20 seconds – I am mentioning the average time just to give a better idea how extremely opposite the execution times are.

 

The execution time is related to the number of rows accessed by the query – fast executions access few rows, and the slow executions access some thousands rows. The query joins 7 tables, but only 3 are really relevant: one because it is the most selective table (it is the driven table) and the other two because they hold most records accessed by the query, being the top contributors for the high execution times. This is the simplified query that I am going to use to show how unordered physical rows (based on join predicates) can impact the response time due to the high number of I/O operations:

 

SELECT /*+ ALL_ROWS */

       ... -- 1 indexed column from T1, 5 columns from T2, and 14 columns from T7

  FROM T1,

       T2,

       T7

 WHERE T1.UID = :B1

   AND T2.PER_ID = T1.ROW_ID

   AND T7.ID = T2.ROW_ID

   AND T7.FLAG = 'Y'

 

Predicate

Rows

 

access(T1.UID= '4850')

1

access(T2.PER_ID=T1.ROW_ID)

4,973

access(T7. ID=T2.ROW_ID)

4,893

filter(T7.FLAG='Y')

3

access(T3.ROW_ID= T7.ROW_ID)

3

 

The query was taking 41 seconds.

 

I/O Statistics

 

Consistent gets

19,281

Physical reads

7,585

 

The following table shows the number of blocks that contains any row accessed by the query. It shows the number of blocks before and after reorganizing the physical order of the rows in tables T2 and T7.

 

Table

Rows

Before (blocks)

After (blocks)

 

T2

4,973

4,426

75

T7

4,893

120

23

 

The reorganization was pretty simple, as I was working only with records related to a sample case instead of reorganizing the entire table:

 

§   Created a temporary table with the relevant records;

§   Deleted the records from the current table;

§   Inserted the records back using a select ordered by the predicate column (PER_ID for T2 and ID for T7).

 

After reorganizing the tables T2 and T7 the query was taking 4.05 seconds.

 

I/O Statistics

 

Consistent gets

11,357

Physical reads

3,133

 

Analyzing the path access it is possible to reduce a little bit more the number of I/O operations. Oracle is using an index on column ID to access table T7 when joining it with T2. The column FLAG used as a filter condition is not part of the index, so oracle needs to access the table in order to check if the FLAG value satisfies the filter condition or not (T7.FLAG='Y').  Adding the column FLAG to the index will allow oracle to check the condition without accessing the table. After adding the FLAG column to the index this is the elapsed time and statistics for the query:

 

Elapsed time: 2.44 seconds

 

I/O Statistics

 

Consistent gets

10,081

Physical reads

2,819

 

Keeping the physical order of rows based on a specific logical order can be a nightmare. When the application inserts data through batch or bulk loads it may be simple to keep the physical order of rows based on a logical order (index). But normally data is inserted at different period of times, and without any order, result of online processing. It should be necessary to schedule table reorganizations on a regular basis to keep the physical order close to the logical order – I support some databases with tables measured in hundreds of gigabytes, with active data (not historical), and accessed 24x7x365 – reorganization is not feasible.

 

Concerning other access paths (for example, T2 was reorganized by PER_ID, but it has several other indexes used to satisfy different queries), the reorganization approach might decrease the table access efficiency thru other indexes. In this specific case of table T2, where the data was spread over thousands of blocks, the reorganization concentrated the rows in just few hundreds, improving all access paths.

 

Besides, this is a RAC database (for sake of sanity I ran all tests on the same node, and yes, I used the Test environment :o), therefore by using the reorganization approach there is a good chance to increase the interconnect traffic, as the rows will be grouped into fewer (hot) blocks.

 

Bottom line is there is no magic - bad design or a design that does not reflect easily the business requirements will cost money to fix, either for reengineering the application, or for acquiring powerful hardware.

Wednesday, July 18, 2012

Fri, 13th :o(

I am not superstitious, but when a database crashes due to a storage failure on a Fri, 13th it is hard to avoid the jokes about bad luck :o)

This is a RAC 11.2.0.2 database with 3 nodes, and oracle alert log files show no evidence of  crash (of course, as the ADR base directory points to the crashed storage). Checking AIX Error Log facility it is possible to notice the storage issue:

LABEL:          DSI_PROC
IDENTIFIER:     <<Identifier>>

Date/Time:       Fri Jul 13 11:43:56 CDT 2012
Sequence Number: 1072
Machine Id:      <<machine ID>>
Node Id:         <<server name>>
Class:           S
Type:            PERM
WPAR:            Global
Resource Name:   SYSVMM

Description
DATA STORAGE INTERRUPT, PROCESSOR

Probable Causes
SOFTWARE PROGRAM

Failure Causes
SOFTWARE PROGRAM
        Recommended Actions
        IF PROBLEM PERSISTS THEN DO THE FOLLOWING
        CONTACT APPROPRIATE SERVICE REPRESENTATIVE

Detail Data
DATA STORAGE INTERRUPT STATUS REGISTER
0000 0000 4000 0000

SEGMENT REGISTER, SEGREG
0000 001B 9853 9590

DATA STORAGE INTERRUPT ADDRESS REGISTER
F000 0000 2000 0010

After the storage issue was fixed, it was time to bring the database back up.

$ srvctl start database -d DB_NAME
PRCR-1079 : Failed to start resource ora.db_name.db
CRS-5017: The resource action "ora.db_name.db start" encountered the following error:
ORA-03113: end-of-file on communication channel
Process ID: 0
Session ID: 0 Serial number: 0

CRS-2674: Start of 'ora.db_name.db' on 'node-1' failed
CRS-5017: The resource action "ora.db_name.db start" encountered the following error:
ORA-03113: end-of-file on communication channel
Process ID: 0
Session ID: 0 Serial number: 0

CRS-2674: Start of 'ora.reap101.db' on 'node-2' failed
CRS-2632: There are no more servers to try to place resource 'ora.db_name.db' on that would satisfy its placement policy
CRS-5017: The resource action "ora.db_name.db start" encountered the following error:
ORA-03113: end-of-file on communication channel
Process ID: 0
Session ID: 0 Serial number: 0

CRS-2674: Start of 'ora.db_name.db' on 'node-3' failed

The actual error is identified by checking the alert log file:

Errors in file .../DB_NAME_lms1_11235114.trc  (incident=251123):
ORA-04031: unable to allocate 1048528 bytes of shared memory ("shared pool","unknown object","sga heap(1,0)","gcs resources")

The ORA-04031 error during the startup occurred due to multiple partially-allocated permanent shared pool chunks – analyzing the SGA heap dump generated  by oracle it is possible to see several chunks with allocated value less than the chunk size.

The issue is pretty simple to resolve: increase SGA or memory target when using Automatic MemoryManagement or the shared pool size when setting memory manually.

Considering the database was up and running before the storage failure, the real question here is: what has changed to cause the error during the database startup?

Previously the database was started up using automatic memory management by setting sga_target. The shared pool size was set also (the value specified worked as a minimum size for the shared pool), but it was not enough as oracle increased the shared pool size. This can be confirmed by checking the server parameter file for the presence of __shared_pool_size with a value greater than the value specified to the shared_pool_size parameter. Later the memory management was changed from auto to manual by setting sga_target to 0 (the database would be restarted during a scheduled maintenance window), but the parameter shared_pool_size was not changed to reflect the real value being used by oracle.

It is interesting to note SGA resizes continue occurring in 11.2 databases despite the fact that automatic memory management is disabled – see oracle note 1269139.1. Therefore, it is important to keep in mind to check the current memory size used by the SGA components by querying gv$sga_dynamic_components before changing memory parameters.

Tuesday, July 10, 2012

FlashCopy and data consistency

Last week I was called by the on call DBA to check a database refresh that failed. This is a reporting database that is refreshed every night from a production database using Flashcopy.

 

Basically FlashCopy is a storage volume snapshot tool, provided by IBM. FlashCopy creates a point-in-time copy of a source volume on the target volume (sort of LUN mirroring). Currently, almost every storage system is equipped with volume snapshot tools, like IBM FlashCopy, EMC TimeFinder (it is used for creating mirror images of standard devices – the mirror image is called as BCV - Business Continuance Volume), Hitachi ShadowImage, and NetApp SnapMirror. Anyway, it does not matter the vendor or the tool name, the idea is exactly the same: copy or clone volumes (LUN) using the storage high speed internal bus.

 

The issue was pretty simple to identify and solve, but it made me reflect about data consistency. Let me explain the issue first (file names and paths have been changed):

 

After the recover process oracle was showing the following message:

ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01195: online backup of file 1 needs more recovery to be consistent
ORA-01110: data file 1: '/oradata/DB_NAME/system_01.dbf'

 

Here are more messages logged by the refresh scripts:

SQL> set autorecovery on

SQL> RECOVER DATABASE USING BACKUP CONTROLFILE UNTIL CHANGE 11430112340412

ORA-00279: change 11430112340412 generated at 07/04/2012 23:31:02 needed for thread 1

ORA-00289: suggestion : /oradata/refresh/arc/DB_NAME_1_1321_341213241.arc

ORA-00280: change 11430112340412 for thread 1 is in sequence #1321

 

Specify log: {<RET>=suggested | filename | AUTO | CANCEL}

auto

ORA-00332: archived log is too small - may be incompletely archived

ORA-00334: archived log:'/oradata/refresh/arc/DB_NAME_1_1321_341213241.arc'

 

As there was no available space on the target file system (/oradata/refresh/arc), the copy of the last archived log file generated during the storage copy just failed – instead of copying 500 MB, it was possible to copy just 100 MB. After cleaning up old files, and copying again the last archived log file, the refresh was complete successfully.

 

SQL> set autorecovery on

SQL> RECOVER DATABASE USING BACKUP CONTROLFILE UNTIL CHANGE 11430112340412

ORA-00279: change 11430112340412 generated at 07/04/2012 23:31:02 needed for thread 1

ORA-00289: suggestion : /oradata/refresh/arc/DB_NAME_1_1321_341213241.arc

ORA-00280: change 11430112340412 for thread 1 is in sequence #1321

 

Specify log: {<RET>=suggested | filename | AUTO | CANCEL}

auto

 

Log applied.

Media recovery complete.

 

Although the script used to copy the archived log files from the source to the target system did not check for errors, the root cause was the script used to identify the archived log files that should be copied to the target system, and applied in order to bring the database to a consistent state. That script was identifying more archived log files than the necessary. Therefore, soon or later, the size of the archived log files would become larger than the archive area on the target system.

 

About the data consistency:

 

Flashcopy is a concurrent point-in-time copy operation. It utilizes copy-on-write bitmap techniques to track which data blocks have been modified after the point-in-time copy was created. Actually the point-in-time copy is just the creation of a bitmap that indicates which part of the source volume has been copied to the target (no data is actually copied at that point) – during the bitmap creation the volume source is frozen for few seconds or less (time depends on the number and size of volumes being copied). The data is copied in background, and when a data block on the source volume is modified, the previous version of that data block is copied to the target volume before being overwritten.

 

As soon as the bitmap creation is complete, FlashCopy allows read and write operations on both the source and target volumes. Although the actual data copy operation occurs in background, and even taking time, the resulting data at the target appears as though the copy were made instantaneously.

 

Another technique of point-in-time copy is the split mirror (used by EMC TimeFinder). In this technique, the data is copied prior to the point-in-time copy. After a complete mirror of the data is created, the point-in-time copy is made by "splitting" the mirror. Note the target source is only available after splitting the mirror, and as the copy was already done, all changes on the source are continuously mirrored to the target until the split is performed.

 

And one more technique of point-in-time copy is the changed block. In this technique physical copy of the data is shared between the source and the target until the data is written. Shared control is implemented using a table of pointers to identify from where the data must be retrieved when both the source and target volumes are accessed. Read and write operations on both the source and target volumes are allowed as soon as the control table is created.

 

All these techniques ensure the data is copied consistently with a point-in-time, and the database will be in a consistent state also if there are no database I/O activities while the storage point-in-time copy (or split when using EMC TimeFinder) is done. Well, even considering the point-in-time copy is a quick operation, I do not think quiescing the database is a valid option. Therefore, it is essential to have the database in hot backup mode during the point-in-time copy or to use Oracle Recovery Manager (RMAN) to recover the database after the copy.