Today a customer asked about how to monitor recovery activity on MaxDB.
The typical questions here are: “Is it still running?” and “How long will it take?”
While the last question cannot be answered easily (and most often not at all), since the time necessary to finish a recovery largely depends on the performance of
the hardware and the data volume, the first question can rather easily be answered.
First step – X_CONS
The first step to check the recovery activity is to check what the database is currently doing at all.
Since a recovery can only be done in ADMIN mode, we can be pretty sure that everything we will see in the ‘x_cons show all’ output will belong to the recovery.
So let’s check this. I just started the recovery of some logbackupfiles on mytest db (MaxDB 7.6.04 Build 11/Win32):
C:\WINDOWS>x_cons DB760 show ac SERVERDB: DB760 ID UKT Win TASK APPL Current Timeout Region Wait tid type pid state priority cnt try item T11 6 0x81C Pager Vvectorio 0 0 671(s) T12 6 0x81C Pager Vvectorio 0 0 671(s) T13 6 0x81C Pager Vvectorio 0 0 671(s) T18 7 0xD5C User 1560 JobWait BckRec 0 0 3599(s) T28 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T29 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T30 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T31 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T32 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T33 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T34 4 0xEDC BUPmed AsynWaitRead 0 0 1130(s) T35 4 0xEDC BUPvol JobWait Redo 0 0 1130(s) T51 4 0xEDC Savepnt PagerWaitWritr 0 0 1361(s) T52 4 0xEDC RedoRea Vsuspend (204) 0 0 1130(s) T53 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T54 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T55 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T56 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T57 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T58 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T59 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T60 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T61 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T62 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T63 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T64 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T65 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T66 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T67 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T68 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T69 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T70 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T71 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T72 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T73 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T74 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T75 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T76 4 0xEDC RedoExe IO Wait (R) 0 0 2 2697(s) T77 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T78 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T79 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T80 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T81 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T82 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T83 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T84 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T85 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T86 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T87 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T88 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T89 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T90 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T91 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T92 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T93 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T94 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T95 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T96 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T97 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T98 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T99 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T100 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) T101 4 0xEDC RedoExe NoRedoJob(231) 0 0 1130(s) Console command finished (2008-06-11 10:55:33). |
Ok, the first task we see is:
ID UKT Win TASK APPL Current Timeout Region Wait T18 7 0xD5C User 1560 JobWait BckRec 0 0 3599(s) |
This is the user tasks that actually started the recovery.
It is in state “JobWaitBckRec” (waiting for the completion of a backup or a recovery) and does not work for itself.
Instead, there are several user tasks active with actually performing the necessary work for the recovery.
This work can be splitted in:
A) Reading data from the log-backup
We see this in this line:
ID UKT Win TASK APPL Current Timeout Region Wait tid type pid state priority cnt try item T34 4 0xEDC BUPmed AsynWaitRead 0 0 1130(s) T35 4 0xEDC BUPvol JobWait Redo 0 0 1130(s) |
B) Creating internal recovery files from the data and analyze the recovery files, so that transactions that don’t need to be redone aren’t actually redone and to maximize parallelity of recovery.
In our list above this task is actually doing this:
ID UKT Win TASK APPL Current Timeout Region Wait tid type pid state priority cnt try item T52 4 0xEDC RedoRea Vsuspend (204) 0 0 1130(s) |
Hmm… what could Vsuspend (204) mean here? Unfortuately not all suspends have names that actually tells what’s going on.
But you can help yourself with this command:
C:\WINDOWS>x_cons DB760 show suspends SERVERDB: DB760 List of suspend reasons: ======================== total suspends: 10530 DC OvFlow(043) : 2 ( 0.02% ) b13get_node: sysbuf_overflow (SVP) :-( IOWait(W)(044) : 5 ( 0.05% ) b13pfree_pno :-( PagerWaitWritr : 13 ( 0.12% ) Pager_Controller::WaitForPagerWritReply DB FULL (198) : 13 ( 0.12% ) FBM_HandleDBFull Vsuspend (203) : 1566 ( 14.87% ) kb39read_wait Vsuspend (204) : 886 ( 8.41% ) kb39write_wait JobWait BckRec : 12 ( 0.11% ) SrvTasks_BackupServer::WaitForAnyJobFini JobWait Redo : 3 ( 0.03% ) Rst_RedoManager::RedoLog Vsuspend (224) : 6 ( 0.06% ) kb39wait_for_redo Vsuspend (227) : 1608 ( 15.27% ) RedoReader waits for finished jobs Vsuspend (228) : 1 ( 0.01% ) RedoReader waits for end of all RedoTask Prep-End (230) : 1 ( 0.01% ) bd13GetNode NoRedoJob(231) : 3434 ( 32.61% ) Rst_RedoTrafficControl::ExecuteJobs() SyncRedo (232) : 2559 ( 24.30% ) Rst_RedoTrafficControl::WaitForIoSeq() LogIOwait(234) : 1 ( 0.01% ) Log_Queue::UserTaskEOTReady SVP-wait (243) : 3 ( 0.03% ) Log_Savepoint::StartSavepointAndWait No-Work (255) : 417 ( 3.96% ) Task is waiting for work Console command finished (2008-06-11 11:15:31). |
Although it does display which suspends occured the most often, it does not show us how long those suspends lasted – so for performance analysis this view is rather useless.
Anyhow, for each suspend we see a description next to it and for our Vsuspend (204) we find“kb39write_wait”.
“KB” stands for “Kommunikationsbasisschicht” or “Communication Basislayer” in english and refers to db internal functions within the data access manager.
So we can conclude that our task T52 is currently writing out some internal redo files.
C) Perform the recovery,that is redo the transaction that got committed and rollback those that either got rolled back as well in the first place or that did not finish.
In our case it’s just this task working on this:
ID UKT Win TASK APPL Current Timeout Region Wait tid type pid state priority cnt try item T76 4 0xEDC RedoExe IO Wait (R) 0 0 2 2697(s) |
To perform the recovery it has to read the redo files and perform the changes on the pages of the database.
If those pages aren’t found in the cache than the task has to read it from the data volume as it just happens here in our example.
D) During all this, pages are changed in the database, so the pager needs to write out those ‘dirty’ pages from time to time via a savepoint.
In the tasklist we find these tasks for that:
ID UKT Win TASK APPL Current Timeout Region Wait tid type pid state priority cnt try item T11 6 0x81C Pager Vvectorio 0 0 671(s) T12 6 0x81C Pager Vvectorio 0 0 671(s) T13 6 0x81C Pager Vvectorio 0 0 671(s) [...] T51 4 0xEDC Savepnt PagerWaitWritr 0 0 1361(s) |
Task 51 triggered the Savepoint and the asynchronous writing of the data pages
is than done with one task per volume (in this case there are just three data volumes).
To optimize the write performance the data is written out in groups of
blocks and thus we see “Vvectorio” for the suspend reason.
Detailed information – knldiag/knlmsg file
So with x_cons we have quite a good option to see if there is anything happening in the database.
More detailed information on the recovery process itself can be found in the knldiag(knlmsg) file of the database.
The following is an excerpt from the knldiag file during the recovery.
I did only minor reformatting to ease reading:
2008-06-11 10:54:57 0xD5C 20006 Log 0 queues, flushmode is 'MaximizeSafety', devstate is 'Cleared' 10:54:57 0xD5C 20007 Log Oldest not saved is ioseq NIL @ off 1 10:54:57 0xD5C 20009 Log LogVolume not completely written until now 10:54:57 0xD5C 20005 Log Restart from ioseq 1 @ off 1 to ioseq NIL @ off NIL 10:54:57 0xD5C 20010 Log Result after checking the log device: 'Ok' 10:54:57 0xD5C 9 Log The number of active logging-queues has been increased to 1 10:54:57 0xEDC 11 Rst 88 redo transactions readable and 55 redo tasks available. 10:54:57 0xEDC 6 Restart recovering log from tape from IOSeq '1' until IOSeq: 'NIL' 10:54:57 0xEDC 52024 RESTORE 200 pages <- "sapdb\backup\db760log.001" 10:54:58 0xEDC 20012 Log redo-read#50000:TR465841(2)[2396]@NIL.1936'InsertRecord':20080611:102906 10:54:59 0xEDC 20012 Log redo-read#100000:TR465963(37514)[3812]@NIL.3808'InsertRecord':20080611:103101 10:54:59 0xEDC 20012 Log redo-read#150000:TR465963(87493)[4597]@NIL.5600'InsertRecord':20080611:103103 10:54:59 0xEDC 20012 Log redo-read#200000:TR465992(36925)[5147]@NIL.5784'DeleteRecord':20080611:103129 10:55:01 0xEDC 20012 Log redo-read#250000:TR465992(86904)[5591]@NIL.1752'DeleteRecord':20080611:103130 10:55:01 0xEDC 20012 Log redo-read#300000:TR466018(36398)[6055]@NIL.1680'DeleteRecord':20080611:103155 10:55:01 0xEDC 20012 Log redo-read#350000:TR466018(86377)[6498]@NIL.5712'DeleteRecord':20080611:103156 10:55:01 0xB4C 19626 IO Async I/O thread stopped, 'C:\sapdb\backup\db760log.001' 10:55:01 0xEDC 52024 RESTORE 4712 pages <- "apdb\backup\db760log.001" 10:55:01 0xD5C 52012 RESTORE new tape required 4300 10:55:01 0xD5C 1 Backup Backupmedium #1 (C:\sapdb\backup\db760log.001) 10:55:01 0xD5C 6 KernelCo + Backup error occured, Error code 4300 "new_hostfile_required" 10:55:01 0xD5C 20000 Log + no update of LogInfoPage: DeviceState = Cleared, LogIsEmpty = true 10:55:02 0xF68 19625 IO Async I/O thread started, 'C:\sapdb\backup\db760log.002 10:55:02 0xEDC 52101 RESTORE Filetype: file 10:55:02 0xEDC 52024 RESTORE 200 pages <- "sapdb\backup\db760log.002" 10:55:02 0xEDC 20012 Log redo-read#400000:TR466023(36334)[6943]@NIL.1896'DeleteRecord':20080611:103159 10:55:02 0xEDC 20012 Log redo-read#450000:TR466023(86292)[7387]@NIL.6216'DeleteRecord':20080611:103200 10:55:02 0xEDC 20012 Log redo-read#500000:TR466027(36247)[7832]@NIL.1752'DeleteRecord':20080611:103204 10:55:03 0xEDC 20012 Log redo-read#550000:TR466027(86227)[8274]@NIL.5928'DeleteRecord':20080611:103205 10:55:03 0xEDC 20012 Log redo-read#600000:TR466032(36142)[8718]@NIL.3840'DeleteRecord':20080611:103208 10:55:03 0xEDC 20012 Log redo-read#650000:TR466032(86121)[9161]@NIL.1320'DeleteRecord':20080611:103210 10:55:03 0xEDC 20012 Log redo-read#700000:TR466037(36098)[9604]@NIL.5424'DeleteRecord':20080611:103212 10:55:03 0xEDC 20012 Log redo-read#750000:TR466037(86098)[10047]@NIL.1176'DeleteRecord':20080611:103214 10:55:04 0xEDC 20012 Log redo-read#800000:TR466038(36096)[10491]@NIL.4992'DeleteRecord':20080611:103216 10:55:04 0xEDC 20012 Log redo-read#850000:TR466038(86096)[10934]@NIL.744'DeleteRecord':20080611:103217 [...] 10:55:31 0xAD8 19625 IO Async I/O thread started, 'C:\sapdb\backup\db760log.010 10:55:31 0xEDC 52101 RESTORE Filetype: file 10:55:31 0xEDC 52024 RESTORE 200 pages <- "sapdb\backup\db760log.010" 10:55:33 0xEDC 20012 Log redo-read#4350000:TR466074(86047)[45950]@NIL.6008'UpdateRecord':20080611:104003 10:55:33 0xEDC 20012 Log redo-read#4400000:TR466075(36045)[46791]@NIL.8016'DeleteRecord':20080611:104024 10:55:34 0xEDC 20012 Log redo-read#4450000:TR466075(86045)[47234]@NIL.3768'DeleteRecord':20080611:104034 10:55:39 0xEDC 20012 Log last-redo-read empty errlist#4463954:TR466075(99999)[47357]@NIL.7728'DeleteRecord':20080611:105311 10:55:39 0xEDC 11 Log Transaction 466075 is open after log reading and forced to rollback 0 entries. 10:59:18 0xEDC 5 Rst RedoControl: End of redo processing 10:59:18 0xEDC 4 Rst + RedoControl: OldestCommitSeq: 1, RedoTasks: 55 , MaxRedoTrans: 88 10:59:18 0xEDC 6 Rst + RedoControl: EndOfLogReached: true, RedoIsAborted: false, RedoProcessingDisabled: false 10:59:18 0xEDC 12 Rst + RedoControl: redo reader T0 wait reason: isNotWaiting, waiting redo executer: 0 10:59:18 0xEDC 8 Rst + RedoControl: Transactions processed: 2582 (2582) 10:59:19 0xEDC 20039 Admin Kernel state: 'REDO LOG' finished [...] |
Here we see that redo-entries are read and redofiles are created:
10:55:03 0xEDC 20012 Log redo-read#700000:TR466037(36098)[9604]@NIL.5424'DeleteRecord':20080611:103212 |
At the end of the recovery there will be a summary, here marked with messagetype RST:
10:59:18 0xEDC 5 Rst RedoControl: End of redo processing 10:59:18 0xEDC 4 Rst + RedoControl: OldestCommitSeq: 1, RedoTasks: 55 , MaxRedoTrans: 88 10:59:18 0xEDC 6 Rst + RedoControl: EndOfLogReached: true, RedoIsAborted: false, RedoProcessingDisabled: false 10:59:18 0xEDC 12 Rst + RedoControl: redo reader T0 wait reason: isNotWaiting, waiting redo executer: 0 10:59:18 0xEDC 8 Rst + RedoControl: Transactions processed: 2582 (2582) |
as well as the “we’re done!” message:
2008-06-11 10:59:19 0xEDC 20039 Admin Kernel state: 'REDO LOG' finished |
Further reading
All these information are presented in a much better way in the “MaxDB Internals” course material.
It’s available for free at http://maxdb.sap.com/training/
Even more information can be found in the MaxDBWikihere in SDN
Best regards,
Lars