Why You Need the Before-image Log by Gus Bjorklund

dmi · Сообщение **dmi** » 13 июл 2005, 14:07

I thought I'd write this in view of last week's thread
on the bi file. Hope it clarifies things a bit.

Thanks to Tom Bascom and George Potemkin for reviewing,
providing suggestions for improvement, and pointing out
errors.

-gus

Why You Need the Before-image Log
=================================
Gus Bjorklund, July 4, 2005

Introduction

The before-image log, also called the "bi file", contains
the primary transaction log[1] of the Progress RDBMS. It
is an integral and essential part of every database and is
just as important as the data extents used to store table
data and their associated indexes. Without the
before-image log extents, the database cannot be accessed.

If you should happen to lose your bi log or it becomes
corrupted somehow, your database will be hosed (that's a
technical term for "damaged beyond repair"). To reliably
recover from such a situation, you must restore the
database from a recent backup and roll forward using the
after-image logs. You do make backup copies of your
databases, don't you? You do use the after-image
journalling feature, don't you?

Some have advocated a number of short-cuts to avoid
restoring from backup in the event the bi log is lost.
Such methods hardly ever work and even worse: there is no
way to tell if they do. You will not discover that these
"recovery" methods don't work until much later when you
encounter errors caused by data corruption that has been
there since the apparently successful short-cut was taken.

If your data are valuable, if you experience loss of a data
extent or a transaction log extent, don't rely on any
recovery method other than restoring from a backup.

Purpose of the Before-image Log

Although it is called the "before-image" log, the data
stored in it are /not/ before-images of anything. The name
is historical and perhaps was poorly chosen. The
before-image log contains records, called "notes", that
allow previously made changes to the database to be
repeated (redone) or rolled back (undone).

The before-image log is mainly used for two things:

0) To perform crash recovery in the event of a failure,

1) To perform transaction rollback during normal
processing

In addition, the before-image log allows for a variety of
performance optimisations because it, along with the
storage manager's buffer management policies[2], permits
database blocks to be modified in-place without requiring
them to be written to the data extents right away (neither
at the time changes are made, nor at transaction commit
time).

How It Works:

During normal operation, copies of database blocks are read
from the data extents on disk into the database buffer pool
(also called the buffer cache) as needed, if a requested
block is not already present. When database changes are
made by running transactions, those changes are made
/only/ in memory, on the /copies/ of the blocks in the
buffer pool.

Eventually, the changed blocks must be written out to the
data extents on disk, but the database storage manager
postpones the write operations for as long as possible,
sometimes for many hours or even days. One important
reason for the postponement is that a given block will
often be updated more than once and it is advantageous to
avoid writing it for each change. Changed blocks are
normally written by an automatic asynchronous checkpoint
mechanism (a topic too complex to go into here), or when a
normal database shutdown takes place.

If the changes are made only in memory, what happens if the
system crashes and the changes are lost? The answer is that
the changes can be recovered using the information that is
recorded in the transaction log (the bi file). As each
change is made to any database block (and to some in-memory
data structures as well), a note describing the change is
first spooled to the transaction log buffers. For some
types of changes, multiple notes are generated. The change
notes in the transaction log buffers must be forced to disk
when the transaction that generated them is committed (or
shortly thereafter, more on that later).

The notes written to the before-image log contain storage
area numbers, block numbers, rowids, and other data which
are used to identify what was changed and precisely how it
was changed. As a result, the before-image log is tightly
coupled to its database. A before-image log can only be
used with its own database.

The storage manager meticulously follows what we call the
"write-ahead logging rule" or WAL rule. This rule states
that no changed database block can be written to a data
extent on-disk unless the transaction log notes describing
the changes to it have been written to the transaction log
/first/. Thus there will always be a record of the changes
on disk in the transaction log[6]. If the changes are lost,
we can use the notes in the transaction log to recreate
those changes by performing them again[3]. Similarly, we
can use the same log notes to undo changes that were made
by an uncommitted transaction.

The WAL rule makes it possible to update database blocks in
memory and not immediately write them to the data extents on
disk. Furthermore, we do not have to write the changes to
disk even when the transaction that made them commits. We
merely have to write the change notes to the transaction
log.

A result of the foregoing is that the on-disk versions of
the data extents are almost never up-to-date in a running
system. The complete and correct database state is made up
of three separate parts:

0) the blocks on disk in the various data extents

1) newer versions of some of those blocks in memory

2) the change notes in the before-image log

The data in the before-image log and the data in the data
extents /taken together/ is sufficient to recreate the
newer versions that are present in memory in the buffer
pool. That means it is ok if we lose what's in memory and
when the system crashes, we /do/ lose those changed
blocks. However, if a crash occurs and there is no bi log
(perhaps because the disk it was on failed), then it is
impossible to recreate the changes that were buffered in
memory and not written to disk. Note that it is impossible
to recreate the data in the before-image log.

If you use the after-image journalling feature, then you
can easily recover from the loss of either the before-image
log or data extents. The after-image log extents contain a
second copy of all the change notes that are written to the
before-image log. By restoring a backup of te database and
rolling forward all of the previous ai logs, you can
recreate all of the changes that were made since the backup.

Crash Recovery

During normal operation as transactions update the
database, before-image log notes are generated, spooled to
the log buffers, and written to the log, blocks are updated
in the buffer pool in shared-memory, and blocks are
occasionally written to the data extents. If a crash
occurs, the contents of shared-memory will be lost.
Assuming the crash was not caused by a failure that caused
the loss or corruption of the data on the disks, we will be
left only with what remains on disk in the before-image log
and in the outdated data extents.

When the database is restarted after a crash, the recovery
manager reads through the notes in the before-image log and
recreates any changes that were not previously written to
disk. To aid in this process, each database block contains
a version number that is incremented whenever a change is
made. The version number and the block number are copied
to the before-image note describing the change. As the
recovery manager reads the before-image notes, in the order
they were written, it compares the version number in the
note with the version number in the block, which it reads
into the buffer pool from disk if necessary. If the
block's version number is later than the note's, we know
the change described by the note was previously made[4].
We can discard it and read the next one. If the version
number matches, we make the change and increment the
block's version number[5]. Once crash recovery reaches the
end of the before-image log, we will have reproduced any
changes that were lost. Most of the changed blocks will be
in memory in the database buffer pool, but some may have
been written to the data extents if it was necessary to
make room for reading in blocks. Once all the notes have
been read, we can roll back any incomplete transactions and
recovery is complete. Note: changes to the in-memory
transaction table are also recorded in the before-image log
and it is reconstructed as the notes are read.

When Changed Blocks Are Written to Data Extents

Modified blocks in the database buffer pool are written to
their associated data extents on disk at the following
times:

0) When an asynchronous page writer (APW) processes the
modified blocks in the checkpoint queue and writes them.

1) When a block not in the buffer pool is requested and a
previously modified block must be written to make room
for the new block.

2) When an APW writes a modified block from the APW "write
it now queue" (called simply "the APW queue").

3) When an APW has no useful work to do and slowly scans
the modified block list, occasionally writing them.

4) Shortly after an online backup is initated, the online
backup tool writes all modified blocks in the buffer
pool. This is actually unnecessary.

5) When the database is shut down by way of a normal
shutdown.

In all of these cases, because we must adhere to the
write-ahead-logging rule, all of the transaction log notes
that describe the changes must be written to the
transaction log /before/ the data blocks can be written.

In the first five cases, the database is in use and usually
new changes are being made at the same time. There is no
way to know which data extents are up-to-date and which are
not. In the case labelled 3, while there may be little or
no update activity taking place, there is still no way to
be certain that all modified blocks have been written.
This is true even if the database has been idle for a long
time, even several hours.

Note that if you are not using the APWs the situation is
somewhat different. Without page writers, blocks will be
written at the following times:

0) When a process generating a note discovers that the
current log cluster is full. Before initialising the
next log cluster, all modified blocks on the checkpoint
queue will be written.

1) When a block not in the buffer pool is requested and a
previously modified block must be written to make room
for the new block.

3) Shortly after an online backup is initated, the online
backup tool writes all modified blocks in the buffer
pool. This is actually unnecessary.

4) When the database is shut down by way of a normal
shutdown.

Without page writers, all of these writes will take longer
than they do when page writers are used.

The disk-resident data extents are completely up-to-date
when /all/ of the following conditions have been satisfied:

0) The database is not in use in either multi-user mode or
in single-user mode and is not being used by any
database utility or any other program.

1) The database was shut down by way of a normal shutdown.

2) All data blocks that were written previously have been
written to disk from the operating system's buffer pool
by the operating system. It is sometimes difficult to
know when this condition is true.

If you are using the -directio startup option, or you are
using Version 10, then following completion of a successful
shutdown, it will be true.

If you are NOT using the -directio option or Version 10,
then it is not possible to determine /with certainty/ that
there are no unwritten blocks in the operating system
buffer pool. At shutdown, after writing all modified
blocks, the storage manager issues a sync() system call to
request the operating system to flush its buffers,
including data unrelated to the database. On some
operating systems, this system call returns before the
operating system's writes have been completed[7]. The
storage manager always waits to give the system time to
finish writing before declaring the shutdown complete. The
default wait time is 60 seconds, which may be insufficient.
The -G startup parameter can be used to increase the wait
time. If the system crashes before the operating system
has finished writing, some changes may be lost.

3) There are no unwritten data blocks in disk controller
buffers, buffers in the disks themselves, or disk
subsystem (for example, a disk array) buffers. Unless
you are using a known to be reliable, high-quality
storage subsystem with battery backup, disks should
always be configured so that writes are not buffered.

When the database is restarted after a normal shutdown, it
goes through some of the same processing used in crash
recovery. For no particular reason, we call this "warm
start" processing. Among other things, it reads
transaction log notes and verifies that no block changes
were lost after the shutdown. Unless the system crashes
after the shutdown but before it finished writing its
filesystem buffers, no changes will have been lost.

If all of the above conditions are met, and only then, you
can be sure the data extents on disk are up-to-date

An Example Failure Scenario:

The proutil tool can be used to skip crash recovery (with
the -F option) and truncate the bi log, discarding its
contents. Skipping crash recovery means the records in the
before-image log are not read or processed, any database
changes that were lost are not recreated, and any
incomplete transactions will not be rolled back. As a
result the database will likely be corrupted by performing
this operation. Furthermore, this corruption may not be
immediately apparent. You may be able to successfully open
the database and access some, but not all, of the data in
it. The following example illustrates what may happen.
The example is small to make it easy to understand. I've
left out a few details to keep it from becoming too
involved, but nothing important. Everything is applicable
to the real world.

Consider the following sequence of events:

Let's assume you have a very small buffer pool, with only 4
database buffers, and you are going to update 1 record and
then read another one.

To fetch the record that will be updated, we need to read
some stuff from the data extents on disk. First, we read
in the root block of an index into one of the 4 buffers.
Assume the index has two levels, so we will have to read a
second index block as well.

Next, we read a record block to fetch the record that is
going to be updated. Assume it has two fragments so we
will also have to read another record block.

At this point each of the 4 blocks in the buffer pool
contains one of the blocks we just read - two index blocks
and two record blocks.

Now, when we update the record, each of the two fragments
will be updated, causing the /both/ record blocks to become
dirty (i.e. modified). Transaction log notes will be
generated for each change.

After the update is finished, we have the following
situation:

- We generated 4 notes to the before-image log:

a) a transaction begin note,

b) a record update note for the first fragment,

c) a record update note for the second fragment,

d) a transaction end note.

- Nothing has been written to disk. So on disk, we still
have exactly what we started with.

- In the buffer pool, there are two dirty blocks which
have not been written to disk, but will need to be
written in the future.

Now we read the second record, again reading in two index
blocks. Because the previous index blocks were oldest,
they will be discarded and the new ones replace them and
move to the front of the lru chain.

Then we read in a single record block, replacing one of the
previously modified record blocks, which is written to
disk. We do not update this second record.

At this point, we have:

- the 4 notes from the previous update

- on disk, one of the two modified record blocks has
been written

- in memory we have two index blocks, one unmodified
record block, and one modified record block.

The modified record block contains one of the two fragments
from the record we updated earlier. The unmodified one
contains the second record we read but did not update.

Now assume that the the system crashes because someone
unplugs the server so they can plug in the vacuum
cleaner. Stuff like this actually happens! But failures
occur for many other reasons too. Maybe it was a laptop
instead of a vacuum cleaner.

After we plug in the server and reboot it, we have:

- the 4 before-image log notes from the update,
in the bi file.

- on disk, one of the two modified record blocks
which was written,

- the rest of the stuff that was in the data extents
before we started this exercise.

- whatever was in memory has disappeared. This
includes one of our updated record blocks. The
/only/ evidence that the block was updated is in
the notes in the before-image log.

If we start the database the normal way, it will go through
crash recovery, read the before-image log notes, discover
that one of the block updates was lost, and make that
update again. After crash recovery does this, we will
again have both updated record blocks. Everything is quite
fine.

If instead we use proutil to truncate the bi log with the
-F option, the bi notes are thrown away without being
processed. The note that contains the only evidence of the
updated block that was never written to disk is discarded.

We are left with whatever was on disk in the database. We
have one half of the updated record and the old version of
the other half. Or maybe there was no old version of the
other half because the original record fit in one block but
the updated one did not.

Rebuilding indexes will not repair the damaged record.
There is /no/ information available /anywhere/ that can be
used to repair it. You are hosed, big time.

When you skip crash recovery this way. the "damaged" flag
is set in the database and each time you start the
database, you receive a message that says the database is
damaged and you must dump and reload the data. If the
damage is severe, this may not be possible. Even when it
is possible, you are likely to lose some data.

Nota Bene:

If you do what is sometimes called a "mock index rebuild",
where you tell the index-rebuild tool that you want to
rebuild some indexes and then don't actually specify any,
the database damaged flag that got turned on because you
skipped crash recovery will be turned off. But nothing has
been fixed. You are still hosed, big time.

If you actually rebuild all the indexes, and the damage is
confined to indexes, then doing the index rebuild will
result in a good database. But index rebuild cannot and
does not repair damaged records. The index rebuild might
complete successfully even though damaged records are
present, as long as all those fragments that contain the
key fields for the indexes being rebuilt are readable.

The loss of a before-image file /may/ be recoverable via
proutil truncate bi with the -F option if you are absolutely
certain that the database was cleanly shutdown and you are
confident that the operating system and disk subsystem
flushed everything to disk. Unfortuntely, it is difficult
or impossible to tell when this recovery method is
successful.

Final Words

The before-image log is part of the database just like data
extents are. If you lose a data extent, you must restore
the database from a backup. If you lose your before-image
log, you must restore the database from backup. There is
no other solution that can be reliably expected to work.

Footnotes:

[1] The database geek's technical term for this kind of
transaction log is "undo-redo log".

[2] The buffer management policy is "no-force, steal" which
means that modified buffers do not have to be forced to disk
at transaction commit time, and a modified buffer can be
written to disk, possibly by another transaction, whenever
it is necessary to do so in order to make room to read in a
block that is not present in the buffer pool.

[3] This is the reason it is referred to as an "undo-redo
log" -- it allows us to redo changes we made previously.
Redo is also called "repeating history" or "replaying
transactions".

[4] It is possible for only some changes, but not all, to
be lost in the event of a crash. This might occur if a
modified block needs to be written to disk in order to make
room for another block that needs to be read into memory
(into the buffer pool). It may also occur if a block is
written but the operating system buffers it and has not yet
written it to disk.

[5] Note that this scheme requires that all changes must be
redone in the same order they were done to begin with, and
no change can be left out. Each change record is a
description of how to change the block from version "n" to
version "n+1". The situation for rollback is different,
more complicated, and we will not go into it here.

[6] Strictly speaking, only descriptions of /recent/
changes are available in the before-image log. This is
because space in the log is reused to reduce the amount of
disk space it occupies. When reusing log space, the log
manager first ensures that the data about to be discarded
(i.e overwritten with new data) will no longer be needed.

[7] You can tell if the sync() call is synchronous by
performing a simple experiment. At a time when the system
is active and many file writes are taking place, type a
"time sync" command in a command shell a few times and see
how long it takes. If it returns immediately the first
time, then sync() is probably not synchronous. If it takes
awhile (several seconds or a few minutes) and subsequent
sync commands return quickly, then it is synchronous and
the disk writes were completed before returning control to
the shell. This experiment only works when there are many
modified filesystem buffers present in the system buffer
pool.

-30-

--
|regards,
|gus
|-------------------------------------------------------------------
| Gus Bjorklund, Wizard and Vice President, Technology
| Progress Software, Bedford MA.
|
| 2 strings walk into a bar. The first string says to the bartender,
| "Bartender, I'll have a beer. u.5n$x5t?*&4ru!2[sACC~ErJ" The second
| string says "Pardon my friend, he isn't NUL terminated."