Skip to content

Questions to SAP Support: “Is my MaxDB recovery still running?”

Today a customer asked about how to monitor recovery activity on MaxDB.

The typical questions here are: “Is it still running?” and “How long will it take?”

While the last question cannot be answered easily (and most often not at all), since the time necessary to finish a recovery largely depends on the performance of
the hardware and the data volume, the first question can rather easily be answered.

First step – X_CONS

The first step to check the recovery activity is to check what the database is currently doing at all.
Since a recovery can only be done in ADMIN mode, we can be pretty sure that everything we will see in the ‘x_cons show all’ output will belong to the recovery.

So let’s check this. I just started the recovery of some logbackupfiles on mytest db (MaxDB 7.6.04 Build 11/Win32):

C:\WINDOWS>x_cons DB760 show ac
SERVERDB: DB760
ID   UKT  Win   TASK       APPL Current         Timeout Region     Wait
 tid   type        pid state          priority cnt try    item
 T11    6  0x81C Pager           Vvectorio             0 0               671(s)
 T12    6  0x81C Pager           Vvectorio             0 0               671(s)
 T13    6  0x81C Pager           Vvectorio             0 0               671(s)
 T18    7  0xD5C User       1560 JobWait BckRec        0 0               3599(s)
 T28    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T29    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T30    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T31    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T32    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T33    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T34    4  0xEDC BUPmed          AsynWaitRead          0 0               1130(s)
 T35    4  0xEDC BUPvol          JobWait Redo          0 0               1130(s)
 T51    4  0xEDC Savepnt         PagerWaitWritr        0 0               1361(s)
 T52    4  0xEDC RedoRea         Vsuspend (204)        0 0               1130(s)
 T53    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T54    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T55    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T56    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T57    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T58    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T59    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T60    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T61    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T62    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T63    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T64    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T65    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T66    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T67    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T68    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T69    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T70    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T71    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T72    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T73    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T74    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T75    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T76    4  0xEDC RedoExe         IO Wait (R)           0 0      2        2697(s)
 T77    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T78    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T79    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T80    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T81    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T82    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T83    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T84    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T85    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T86    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T87    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T88    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T89    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T90    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T91    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T92    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T93    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T94    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T95    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T96    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T97    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T98   4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T99    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T100   4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
 T101   4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
Console command finished (2008-06-11 10:55:33).

Ok, the first task we see is:

ID   UKT  Win   TASK       APPL Current         Timeout Region     Wait
T18    7  0xD5C User       1560 JobWait BckRec        0 0               3599(s)

This is the user tasks that actually started the recovery.
It is in state “JobWaitBckRec” (waiting for the completion of a backup or a recovery) and does not work for itself.
Instead, there are several user tasks active with actually performing the necessary work for the recovery.

This work can be splitted in:

A) Reading data from the log-backup

We see this in this line:

ID   UKT  Win   TASK       APPL Current         Timeout Region     Wait
          tid   type        pid state          priority cnt try    item
T34    4  0xEDC BUPmed          AsynWaitRead          0 0               1130(s)
T35    4  0xEDC BUPvol          JobWait Redo          0 0               1130(s)

 

B) Creating internal recovery files from the data and analyze the recovery files, so that transactions that don’t need to be redone aren’t actually redone and to maximize parallelity of recovery.

In our list above this task is actually doing this:

ID   UKT  Win   TASK       APPL Current         Timeout Region     Wait
          tid   type        pid state          priority cnt try    item
T52    4  0xEDC RedoRea         Vsuspend (204)        0 0               1130(s)

Hmm… what could Vsuspend (204) mean here? Unfortuately not all suspends have names that actually tells what’s going on.
But you can help yourself with this command:

C:\WINDOWS>x_cons DB760 show suspends
SERVERDB: DB760
List of suspend reasons:
========================
total suspends:   10530
 DC OvFlow(043) :       2 (   0.02% ) b13get_node: sysbuf_overflow (SVP)   :-(
 IOWait(W)(044) :       5 (   0.05% ) b13pfree_pno                         :-(
 PagerWaitWritr :      13 (   0.12% ) Pager_Controller::WaitForPagerWritReply
 DB FULL  (198) :      13 (   0.12% ) FBM_HandleDBFull
 Vsuspend (203) :    1566 (  14.87% ) kb39read_wait
 Vsuspend (204) :     886 (   8.41% ) kb39write_wait
 JobWait BckRec :      12 (   0.11% ) SrvTasks_BackupServer::WaitForAnyJobFini
 JobWait Redo   :       3 (   0.03% ) Rst_RedoManager::RedoLog
 Vsuspend (224) :       6 (   0.06% ) kb39wait_for_redo
 Vsuspend (227) :    1608 (  15.27% ) RedoReader waits for finished jobs
 Vsuspend (228) :       1 (   0.01% ) RedoReader waits for end of all RedoTask
 Prep-End (230) :       1 (   0.01% ) bd13GetNode
 NoRedoJob(231) :    3434 (  32.61% ) Rst_RedoTrafficControl::ExecuteJobs()
 SyncRedo (232) :    2559 (  24.30% ) Rst_RedoTrafficControl::WaitForIoSeq()
 LogIOwait(234) :       1 (   0.01% ) Log_Queue::UserTaskEOTReady
 SVP-wait (243) :       3 (   0.03% ) Log_Savepoint::StartSavepointAndWait
 No-Work  (255) :     417 (   3.96% ) Task is waiting for work
Console command finished (2008-06-11 11:15:31).

Although it does display which suspends occured the most often, it does not show us how long those suspends lasted – so for performance analysis this view is rather useless.

Anyhow, for each suspend we see a description next to it and for our Vsuspend (204) we find“kb39write_wait”.
“KB” stands for “Kommunikationsbasisschicht” or “Communication Basislayer” in english and refers to db internal functions within the data access manager.
So we can conclude that our task T52 is currently writing out some internal redo files.

C) Perform the recovery,that is redo the transaction that got committed and rollback those that either got rolled back as well in the first place or that did not finish.

In our case it’s just this task working on this:

ID   UKT  Win   TASK       APPL Current         Timeout Region     Wait
          tid   type        pid state          priority cnt try    item
T76    4  0xEDC RedoExe         IO Wait (R)           0 0      2        2697(s)

To perform the recovery it has to read the redo files and perform the changes on the pages of the database.
If those pages aren’t found in the cache than the task has to read it from the data volume as it just happens here in our example.

D) During all this, pages are changed in the database, so the pager needs to write out those ‘dirty’ pages from time to time via a savepoint.

In the tasklist we find these tasks for that:

ID   UKT  Win   TASK       APPL Current         Timeout Region     Wait
          tid   type        pid state          priority cnt try    item
T11    6  0x81C Pager           Vvectorio             0 0               671(s)
T12    6  0x81C Pager           Vvectorio             0 0               671(s)
T13    6  0x81C Pager           Vvectorio             0 0               671(s)
[...]
T51    4  0xEDC Savepnt         PagerWaitWritr        0 0               1361(s)

Task 51 triggered the Savepoint and the asynchronous writing of the data pages
is than done with one task per volume (in this case there are just three data volumes).
To optimize the write performance the data is written out in groups of
blocks and thus we see “Vvectorio” for the suspend reason.

Detailed information – knldiag/knlmsg file

So with x_cons we have quite a good option to see if there is anything happening in the database.
More detailed information on the recovery process itself can be found in the knldiag(knlmsg) file of the database.

The following is an excerpt from the knldiag file during the recovery.
I did only minor reformatting to ease reading:

2008-06-11
10:54:57      0xD5C     20006 Log      0 queues, flushmode is 'MaximizeSafety', devstate is 'Cleared'
10:54:57      0xD5C     20007 Log      Oldest not saved is ioseq NIL @ off 1
10:54:57      0xD5C     20009 Log      LogVolume not completely written until now
10:54:57      0xD5C     20005 Log      Restart from ioseq 1 @ off 1 to ioseq NIL @ off NIL
10:54:57      0xD5C     20010 Log      Result after checking the log device: 'Ok'
10:54:57      0xD5C         9 Log      The number of active logging-queues has been increased to 1
10:54:57      0xEDC        11 Rst      88 redo transactions readable and 55 redo tasks available.
10:54:57      0xEDC         6 Restart  recovering log from tape from IOSeq '1' until IOSeq: 'NIL'
10:54:57      0xEDC     52024 RESTORE  200 pages <- "sapdb\backup\db760log.001"
10:54:58      0xEDC     20012 Log      redo-read#50000:TR465841(2)[2396]@NIL.1936'InsertRecord':20080611:102906
10:54:59      0xEDC     20012 Log      redo-read#100000:TR465963(37514)[3812]@NIL.3808'InsertRecord':20080611:103101
10:54:59      0xEDC     20012 Log      redo-read#150000:TR465963(87493)[4597]@NIL.5600'InsertRecord':20080611:103103
10:54:59      0xEDC     20012 Log      redo-read#200000:TR465992(36925)[5147]@NIL.5784'DeleteRecord':20080611:103129
10:55:01      0xEDC     20012 Log      redo-read#250000:TR465992(86904)[5591]@NIL.1752'DeleteRecord':20080611:103130
10:55:01      0xEDC     20012 Log      redo-read#300000:TR466018(36398)[6055]@NIL.1680'DeleteRecord':20080611:103155
10:55:01      0xEDC     20012 Log      redo-read#350000:TR466018(86377)[6498]@NIL.5712'DeleteRecord':20080611:103156
10:55:01      0xB4C     19626 IO       Async I/O thread stopped, 'C:\sapdb\backup\db760log.001'
10:55:01      0xEDC     52024 RESTORE  4712 pages <- "apdb\backup\db760log.001"
10:55:01      0xD5C     52012 RESTORE  new tape required 4300
10:55:01      0xD5C         1 Backup   Backupmedium #1 (C:\sapdb\backup\db760log.001)
10:55:01      0xD5C         6 KernelCo  +   Backup error occured, Error code 4300 "new_hostfile_required"
10:55:01      0xD5C     20000 Log       +   no update of LogInfoPage: DeviceState = Cleared, LogIsEmpty = true
10:55:02      0xF68     19625 IO       Async I/O thread started, 'C:\sapdb\backup\db760log.002
10:55:02      0xEDC     52101 RESTORE  Filetype: file
10:55:02      0xEDC     52024 RESTORE  200 pages <- "sapdb\backup\db760log.002"
10:55:02      0xEDC     20012 Log      redo-read#400000:TR466023(36334)[6943]@NIL.1896'DeleteRecord':20080611:103159
10:55:02      0xEDC     20012 Log      redo-read#450000:TR466023(86292)[7387]@NIL.6216'DeleteRecord':20080611:103200
10:55:02      0xEDC     20012 Log      redo-read#500000:TR466027(36247)[7832]@NIL.1752'DeleteRecord':20080611:103204
10:55:03      0xEDC     20012 Log      redo-read#550000:TR466027(86227)[8274]@NIL.5928'DeleteRecord':20080611:103205
10:55:03      0xEDC     20012 Log      redo-read#600000:TR466032(36142)[8718]@NIL.3840'DeleteRecord':20080611:103208
10:55:03      0xEDC     20012 Log      redo-read#650000:TR466032(86121)[9161]@NIL.1320'DeleteRecord':20080611:103210
10:55:03      0xEDC     20012 Log      redo-read#700000:TR466037(36098)[9604]@NIL.5424'DeleteRecord':20080611:103212
10:55:03      0xEDC     20012 Log      redo-read#750000:TR466037(86098)[10047]@NIL.1176'DeleteRecord':20080611:103214
10:55:04      0xEDC     20012 Log      redo-read#800000:TR466038(36096)[10491]@NIL.4992'DeleteRecord':20080611:103216
10:55:04      0xEDC     20012 Log      redo-read#850000:TR466038(86096)[10934]@NIL.744'DeleteRecord':20080611:103217
[...]
10:55:31      0xAD8     19625 IO       Async I/O thread started, 'C:\sapdb\backup\db760log.010
10:55:31      0xEDC     52101 RESTORE  Filetype: file
10:55:31      0xEDC     52024 RESTORE  200 pages <- "sapdb\backup\db760log.010"
10:55:33      0xEDC     20012 Log      redo-read#4350000:TR466074(86047)[45950]@NIL.6008'UpdateRecord':20080611:104003
10:55:33      0xEDC     20012 Log      redo-read#4400000:TR466075(36045)[46791]@NIL.8016'DeleteRecord':20080611:104024
10:55:34      0xEDC     20012 Log      redo-read#4450000:TR466075(86045)[47234]@NIL.3768'DeleteRecord':20080611:104034
10:55:39      0xEDC     20012 Log      last-redo-read empty errlist#4463954:TR466075(99999)[47357]@NIL.7728'DeleteRecord':20080611:105311
10:55:39      0xEDC        11 Log      Transaction 466075 is open after log reading and forced to rollback 0 entries.
10:59:18      0xEDC         5 Rst      RedoControl: End of redo processing
10:59:18      0xEDC         4 Rst       +   RedoControl: OldestCommitSeq: 1, RedoTasks: 55 , MaxRedoTrans: 88
10:59:18      0xEDC         6 Rst       +   RedoControl: EndOfLogReached: true, RedoIsAborted: false, RedoProcessingDisabled: false
10:59:18      0xEDC        12 Rst       +   RedoControl: redo reader T0 wait reason:  isNotWaiting, waiting redo executer:  0
10:59:18      0xEDC         8 Rst       +   RedoControl: Transactions processed: 2582 (2582)
10:59:19 0xEDC 20039 Admin Kernel state: 'REDO LOG' finished
[...]

Here we see that redo-entries are read and redofiles are created:

10:55:03      0xEDC     20012 Log      redo-read#700000:TR466037(36098)[9604]@NIL.5424'DeleteRecord':20080611:103212

At the end of the recovery there will be a summary, here marked with messagetype RST:

10:59:18      0xEDC         5 Rst      RedoControl: End of redo processing
10:59:18      0xEDC         4 Rst       +   RedoControl: OldestCommitSeq: 1, RedoTasks: 55 , MaxRedoTrans: 88
10:59:18      0xEDC         6 Rst       +   RedoControl: EndOfLogReached: true, RedoIsAborted: false, RedoProcessingDisabled: false
10:59:18      0xEDC        12 Rst       +   RedoControl: redo reader T0 wait reason:  isNotWaiting, waiting redo executer:  0
10:59:18      0xEDC         8 Rst       +   RedoControl: Transactions processed: 2582 (2582)

as well as the “we’re done!” message:

2008-06-11 10:59:19 0xEDC 20039 Admin Kernel state: 'REDO LOG' finished

Further reading

All these information are presented in a much better way in the  “MaxDB Internals” course material.
It’s available for free at http://maxdb.sap.com/training/
Even more information can be found in the MaxDBWikihere in SDN

 

Best regards,
Lars

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: