mirror of https://gitee.com/openkylin/linux.git
353 lines
17 KiB
ReStructuredText
353 lines
17 KiB
ReStructuredText
|
.. SPDX-License-Identifier: GPL-2.0
|
||
|
|
||
|
============================
|
||
|
XFS Self Describing Metadata
|
||
|
============================
|
||
|
|
||
|
Introduction
|
||
|
============
|
||
|
|
||
|
The largest scalability problem facing XFS is not one of algorithmic
|
||
|
scalability, but of verification of the filesystem structure. Scalabilty of the
|
||
|
structures and indexes on disk and the algorithms for iterating them are
|
||
|
adequate for supporting PB scale filesystems with billions of inodes, however it
|
||
|
is this very scalability that causes the verification problem.
|
||
|
|
||
|
Almost all metadata on XFS is dynamically allocated. The only fixed location
|
||
|
metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all
|
||
|
other metadata structures need to be discovered by walking the filesystem
|
||
|
structure in different ways. While this is already done by userspace tools for
|
||
|
validating and repairing the structure, there are limits to what they can
|
||
|
verify, and this in turn limits the supportable size of an XFS filesystem.
|
||
|
|
||
|
For example, it is entirely possible to manually use xfs_db and a bit of
|
||
|
scripting to analyse the structure of a 100TB filesystem when trying to
|
||
|
determine the root cause of a corruption problem, but it is still mainly a
|
||
|
manual task of verifying that things like single bit errors or misplaced writes
|
||
|
weren't the ultimate cause of a corruption event. It may take a few hours to a
|
||
|
few days to perform such forensic analysis, so for at this scale root cause
|
||
|
analysis is entirely possible.
|
||
|
|
||
|
However, if we scale the filesystem up to 1PB, we now have 10x as much metadata
|
||
|
to analyse and so that analysis blows out towards weeks/months of forensic work.
|
||
|
Most of the analysis work is slow and tedious, so as the amount of analysis goes
|
||
|
up, the more likely that the cause will be lost in the noise. Hence the primary
|
||
|
concern for supporting PB scale filesystems is minimising the time and effort
|
||
|
required for basic forensic analysis of the filesystem structure.
|
||
|
|
||
|
|
||
|
Self Describing Metadata
|
||
|
========================
|
||
|
|
||
|
One of the problems with the current metadata format is that apart from the
|
||
|
magic number in the metadata block, we have no other way of identifying what it
|
||
|
is supposed to be. We can't even identify if it is the right place. Put simply,
|
||
|
you can't look at a single metadata block in isolation and say "yes, it is
|
||
|
supposed to be there and the contents are valid".
|
||
|
|
||
|
Hence most of the time spent on forensic analysis is spent doing basic
|
||
|
verification of metadata values, looking for values that are in range (and hence
|
||
|
not detected by automated verification checks) but are not correct. Finding and
|
||
|
understanding how things like cross linked block lists (e.g. sibling
|
||
|
pointers in a btree end up with loops in them) are the key to understanding what
|
||
|
went wrong, but it is impossible to tell what order the blocks were linked into
|
||
|
each other or written to disk after the fact.
|
||
|
|
||
|
Hence we need to record more information into the metadata to allow us to
|
||
|
quickly determine if the metadata is intact and can be ignored for the purpose
|
||
|
of analysis. We can't protect against every possible type of error, but we can
|
||
|
ensure that common types of errors are easily detectable. Hence the concept of
|
||
|
self describing metadata.
|
||
|
|
||
|
The first, fundamental requirement of self describing metadata is that the
|
||
|
metadata object contains some form of unique identifier in a well known
|
||
|
location. This allows us to identify the expected contents of the block and
|
||
|
hence parse and verify the metadata object. IF we can't independently identify
|
||
|
the type of metadata in the object, then the metadata doesn't describe itself
|
||
|
very well at all!
|
||
|
|
||
|
Luckily, almost all XFS metadata has magic numbers embedded already - only the
|
||
|
AGFL, remote symlinks and remote attribute blocks do not contain identifying
|
||
|
magic numbers. Hence we can change the on-disk format of all these objects to
|
||
|
add more identifying information and detect this simply by changing the magic
|
||
|
numbers in the metadata objects. That is, if it has the current magic number,
|
||
|
the metadata isn't self identifying. If it contains a new magic number, it is
|
||
|
self identifying and we can do much more expansive automated verification of the
|
||
|
metadata object at runtime, during forensic analysis or repair.
|
||
|
|
||
|
As a primary concern, self describing metadata needs some form of overall
|
||
|
integrity checking. We cannot trust the metadata if we cannot verify that it has
|
||
|
not been changed as a result of external influences. Hence we need some form of
|
||
|
integrity check, and this is done by adding CRC32c validation to the metadata
|
||
|
block. If we can verify the block contains the metadata it was intended to
|
||
|
contain, a large amount of the manual verification work can be skipped.
|
||
|
|
||
|
CRC32c was selected as metadata cannot be more than 64k in length in XFS and
|
||
|
hence a 32 bit CRC is more than sufficient to detect multi-bit errors in
|
||
|
metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is
|
||
|
fast. So while CRC32c is not the strongest of possible integrity checks that
|
||
|
could be used, it is more than sufficient for our needs and has relatively
|
||
|
little overhead. Adding support for larger integrity fields and/or algorithms
|
||
|
does really provide any extra value over CRC32c, but it does add a lot of
|
||
|
complexity and so there is no provision for changing the integrity checking
|
||
|
mechanism.
|
||
|
|
||
|
Self describing metadata needs to contain enough information so that the
|
||
|
metadata block can be verified as being in the correct place without needing to
|
||
|
look at any other metadata. This means it needs to contain location information.
|
||
|
Just adding a block number to the metadata is not sufficient to protect against
|
||
|
mis-directed writes - a write might be misdirected to the wrong LUN and so be
|
||
|
written to the "correct block" of the wrong filesystem. Hence location
|
||
|
information must contain a filesystem identifier as well as a block number.
|
||
|
|
||
|
Another key information point in forensic analysis is knowing who the metadata
|
||
|
block belongs to. We already know the type, the location, that it is valid
|
||
|
and/or corrupted, and how long ago that it was last modified. Knowing the owner
|
||
|
of the block is important as it allows us to find other related metadata to
|
||
|
determine the scope of the corruption. For example, if we have a extent btree
|
||
|
object, we don't know what inode it belongs to and hence have to walk the entire
|
||
|
filesystem to find the owner of the block. Worse, the corruption could mean that
|
||
|
no owner can be found (i.e. it's an orphan block), and so without an owner field
|
||
|
in the metadata we have no idea of the scope of the corruption. If we have an
|
||
|
owner field in the metadata object, we can immediately do top down validation to
|
||
|
determine the scope of the problem.
|
||
|
|
||
|
Different types of metadata have different owner identifiers. For example,
|
||
|
directory, attribute and extent tree blocks are all owned by an inode, while
|
||
|
freespace btree blocks are owned by an allocation group. Hence the size and
|
||
|
contents of the owner field are determined by the type of metadata object we are
|
||
|
looking at. The owner information can also identify misplaced writes (e.g.
|
||
|
freespace btree block written to the wrong AG).
|
||
|
|
||
|
Self describing metadata also needs to contain some indication of when it was
|
||
|
written to the filesystem. One of the key information points when doing forensic
|
||
|
analysis is how recently the block was modified. Correlation of set of corrupted
|
||
|
metadata blocks based on modification times is important as it can indicate
|
||
|
whether the corruptions are related, whether there's been multiple corruption
|
||
|
events that lead to the eventual failure, and even whether there are corruptions
|
||
|
present that the run-time verification is not detecting.
|
||
|
|
||
|
For example, we can determine whether a metadata object is supposed to be free
|
||
|
space or still allocated if it is still referenced by its owner by looking at
|
||
|
when the free space btree block that contains the block was last written
|
||
|
compared to when the metadata object itself was last written. If the free space
|
||
|
block is more recent than the object and the object's owner, then there is a
|
||
|
very good chance that the block should have been removed from the owner.
|
||
|
|
||
|
To provide this "written timestamp", each metadata block gets the Log Sequence
|
||
|
Number (LSN) of the most recent transaction it was modified on written into it.
|
||
|
This number will always increase over the life of the filesystem, and the only
|
||
|
thing that resets it is running xfs_repair on the filesystem. Further, by use of
|
||
|
the LSN we can tell if the corrupted metadata all belonged to the same log
|
||
|
checkpoint and hence have some idea of how much modification occurred between
|
||
|
the first and last instance of corrupt metadata on disk and, further, how much
|
||
|
modification occurred between the corruption being written and when it was
|
||
|
detected.
|
||
|
|
||
|
Runtime Validation
|
||
|
==================
|
||
|
|
||
|
Validation of self-describing metadata takes place at runtime in two places:
|
||
|
|
||
|
- immediately after a successful read from disk
|
||
|
- immediately prior to write IO submission
|
||
|
|
||
|
The verification is completely stateless - it is done independently of the
|
||
|
modification process, and seeks only to check that the metadata is what it says
|
||
|
it is and that the metadata fields are within bounds and internally consistent.
|
||
|
As such, we cannot catch all types of corruption that can occur within a block
|
||
|
as there may be certain limitations that operational state enforces of the
|
||
|
metadata, or there may be corruption of interblock relationships (e.g. corrupted
|
||
|
sibling pointer lists). Hence we still need stateful checking in the main code
|
||
|
body, but in general most of the per-field validation is handled by the
|
||
|
verifiers.
|
||
|
|
||
|
For read verification, the caller needs to specify the expected type of metadata
|
||
|
that it should see, and the IO completion process verifies that the metadata
|
||
|
object matches what was expected. If the verification process fails, then it
|
||
|
marks the object being read as EFSCORRUPTED. The caller needs to catch this
|
||
|
error (same as for IO errors), and if it needs to take special action due to a
|
||
|
verification error it can do so by catching the EFSCORRUPTED error value. If we
|
||
|
need more discrimination of error type at higher levels, we can define new
|
||
|
error numbers for different errors as necessary.
|
||
|
|
||
|
The first step in read verification is checking the magic number and determining
|
||
|
whether CRC validating is necessary. If it is, the CRC32c is calculated and
|
||
|
compared against the value stored in the object itself. Once this is validated,
|
||
|
further checks are made against the location information, followed by extensive
|
||
|
object specific metadata validation. If any of these checks fail, then the
|
||
|
buffer is considered corrupt and the EFSCORRUPTED error is set appropriately.
|
||
|
|
||
|
Write verification is the opposite of the read verification - first the object
|
||
|
is extensively verified and if it is OK we then update the LSN from the last
|
||
|
modification made to the object, After this, we calculate the CRC and insert it
|
||
|
into the object. Once this is done the write IO is allowed to continue. If any
|
||
|
error occurs during this process, the buffer is again marked with a EFSCORRUPTED
|
||
|
error for the higher layers to catch.
|
||
|
|
||
|
Structures
|
||
|
==========
|
||
|
|
||
|
A typical on-disk structure needs to contain the following information::
|
||
|
|
||
|
struct xfs_ondisk_hdr {
|
||
|
__be32 magic; /* magic number */
|
||
|
__be32 crc; /* CRC, not logged */
|
||
|
uuid_t uuid; /* filesystem identifier */
|
||
|
__be64 owner; /* parent object */
|
||
|
__be64 blkno; /* location on disk */
|
||
|
__be64 lsn; /* last modification in log, not logged */
|
||
|
};
|
||
|
|
||
|
Depending on the metadata, this information may be part of a header structure
|
||
|
separate to the metadata contents, or may be distributed through an existing
|
||
|
structure. The latter occurs with metadata that already contains some of this
|
||
|
information, such as the superblock and AG headers.
|
||
|
|
||
|
Other metadata may have different formats for the information, but the same
|
||
|
level of information is generally provided. For example:
|
||
|
|
||
|
- short btree blocks have a 32 bit owner (ag number) and a 32 bit block
|
||
|
number for location. The two of these combined provide the same
|
||
|
information as @owner and @blkno in eh above structure, but using 8
|
||
|
bytes less space on disk.
|
||
|
|
||
|
- directory/attribute node blocks have a 16 bit magic number, and the
|
||
|
header that contains the magic number has other information in it as
|
||
|
well. hence the additional metadata headers change the overall format
|
||
|
of the metadata.
|
||
|
|
||
|
A typical buffer read verifier is structured as follows::
|
||
|
|
||
|
#define XFS_FOO_CRC_OFF offsetof(struct xfs_ondisk_hdr, crc)
|
||
|
|
||
|
static void
|
||
|
xfs_foo_read_verify(
|
||
|
struct xfs_buf *bp)
|
||
|
{
|
||
|
struct xfs_mount *mp = bp->b_mount;
|
||
|
|
||
|
if ((xfs_sb_version_hascrc(&mp->m_sb) &&
|
||
|
!xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length),
|
||
|
XFS_FOO_CRC_OFF)) ||
|
||
|
!xfs_foo_verify(bp)) {
|
||
|
XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
|
||
|
xfs_buf_ioerror(bp, EFSCORRUPTED);
|
||
|
}
|
||
|
}
|
||
|
|
||
|
The code ensures that the CRC is only checked if the filesystem has CRCs enabled
|
||
|
by checking the superblock of the feature bit, and then if the CRC verifies OK
|
||
|
(or is not needed) it verifies the actual contents of the block.
|
||
|
|
||
|
The verifier function will take a couple of different forms, depending on
|
||
|
whether the magic number can be used to determine the format of the block. In
|
||
|
the case it can't, the code is structured as follows::
|
||
|
|
||
|
static bool
|
||
|
xfs_foo_verify(
|
||
|
struct xfs_buf *bp)
|
||
|
{
|
||
|
struct xfs_mount *mp = bp->b_mount;
|
||
|
struct xfs_ondisk_hdr *hdr = bp->b_addr;
|
||
|
|
||
|
if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
|
||
|
return false;
|
||
|
|
||
|
if (!xfs_sb_version_hascrc(&mp->m_sb)) {
|
||
|
if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
|
||
|
return false;
|
||
|
if (bp->b_bn != be64_to_cpu(hdr->blkno))
|
||
|
return false;
|
||
|
if (hdr->owner == 0)
|
||
|
return false;
|
||
|
}
|
||
|
|
||
|
/* object specific verification checks here */
|
||
|
|
||
|
return true;
|
||
|
}
|
||
|
|
||
|
If there are different magic numbers for the different formats, the verifier
|
||
|
will look like::
|
||
|
|
||
|
static bool
|
||
|
xfs_foo_verify(
|
||
|
struct xfs_buf *bp)
|
||
|
{
|
||
|
struct xfs_mount *mp = bp->b_mount;
|
||
|
struct xfs_ondisk_hdr *hdr = bp->b_addr;
|
||
|
|
||
|
if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) {
|
||
|
if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
|
||
|
return false;
|
||
|
if (bp->b_bn != be64_to_cpu(hdr->blkno))
|
||
|
return false;
|
||
|
if (hdr->owner == 0)
|
||
|
return false;
|
||
|
} else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
|
||
|
return false;
|
||
|
|
||
|
/* object specific verification checks here */
|
||
|
|
||
|
return true;
|
||
|
}
|
||
|
|
||
|
Write verifiers are very similar to the read verifiers, they just do things in
|
||
|
the opposite order to the read verifiers. A typical write verifier::
|
||
|
|
||
|
static void
|
||
|
xfs_foo_write_verify(
|
||
|
struct xfs_buf *bp)
|
||
|
{
|
||
|
struct xfs_mount *mp = bp->b_mount;
|
||
|
struct xfs_buf_log_item *bip = bp->b_fspriv;
|
||
|
|
||
|
if (!xfs_foo_verify(bp)) {
|
||
|
XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
|
||
|
xfs_buf_ioerror(bp, EFSCORRUPTED);
|
||
|
return;
|
||
|
}
|
||
|
|
||
|
if (!xfs_sb_version_hascrc(&mp->m_sb))
|
||
|
return;
|
||
|
|
||
|
|
||
|
if (bip) {
|
||
|
struct xfs_ondisk_hdr *hdr = bp->b_addr;
|
||
|
hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn);
|
||
|
}
|
||
|
xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF);
|
||
|
}
|
||
|
|
||
|
This will verify the internal structure of the metadata before we go any
|
||
|
further, detecting corruptions that have occurred as the metadata has been
|
||
|
modified in memory. If the metadata verifies OK, and CRCs are enabled, we then
|
||
|
update the LSN field (when it was last modified) and calculate the CRC on the
|
||
|
metadata. Once this is done, we can issue the IO.
|
||
|
|
||
|
Inodes and Dquots
|
||
|
=================
|
||
|
|
||
|
Inodes and dquots are special snowflakes. They have per-object CRC and
|
||
|
self-identifiers, but they are packed so that there are multiple objects per
|
||
|
buffer. Hence we do not use per-buffer verifiers to do the work of per-object
|
||
|
verification and CRC calculations. The per-buffer verifiers simply perform basic
|
||
|
identification of the buffer - that they contain inodes or dquots, and that
|
||
|
there are magic numbers in all the expected spots. All further CRC and
|
||
|
verification checks are done when each inode is read from or written back to the
|
||
|
buffer.
|
||
|
|
||
|
The structure of the verifiers and the identifiers checks is very similar to the
|
||
|
buffer code described above. The only difference is where they are called. For
|
||
|
example, inode read verification is done in xfs_inode_from_disk() when the inode
|
||
|
is first read out of the buffer and the struct xfs_inode is instantiated. The
|
||
|
inode is already extensively verified during writeback in xfs_iflush_int, so the
|
||
|
only addition here is to add the LSN and CRC to the inode as it is copied back
|
||
|
into the buffer.
|
||
|
|
||
|
XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of
|
||
|
the unlinked list modifications check or update CRCs, neither during unlink nor
|
||
|
log recovery. So, it's gone unnoticed until now. This won't matter immediately -
|
||
|
repair will probably complain about it - but it needs to be fixed.
|