FAT File System Troubleshooting

Armed with an understanding of the FAT file system, you can troubleshoot file system problems, even those involving the FAT itself. You can use ScanDisk to detect file system logic errors, but should not allow it to automatically fix all errors found; if you do "fix" anything, append the results to the existing ScanDisk log for future reference. For more detailled manual data recovery work, you need a raw disk editor such as "Norton" DiskEdit.

Both manual recovery and automated tools such as ScanDisk use the following clues to detect file system errors:

This is why automatic "fixing" of errors is often a bad idea; once "fixed", the clues that point to damaged data are lost (so that you can't detect which data is damaged) and the contradictory redundant information that could have been used to repair the error is lost as well. You should set ScanDisk.ini (or the WinME equivalent) to safeguard against this risk.

Sanity-checking directories

Directories contain directory entries, each of which is 32 bytes long, and each starts with name data.

Subdirectories (i.e. all directories other than the root) start with . and .. entries, so one can find "lost" directories by searching raw disk for these (this is the method DiskEdit uses).

The . pointer should point to itself, i.e. the cluster address it occupies. If this address does not match, then one of the following applies:

The .. pointer points to the parent, and a value of zero means the parent is the root. So, to build up a root directory from scratch, look for and create entries for all found subdirs that have .. pointers to zero.

LFN entries occur from last to first, and are then followed by the actual 8.3 entry that defines the file or directory thus named. If the sequence of clusters in a directory is broken, this may interrupt the chain of LFN fragments and be detected by ScanDisk as a "LFN error".

If the first byte of an entry is zero, then that is taken to mark the end of the listing and no further entries should follow. If such an entry occurs in the middle of a directory, and ScanDisk is allowed to "fix" this, then all entries following the zero entry will be lost. With this in mind, you should ensure valid first-byte values for all entries to the end; use the E5h character to "comment out" garbage entries, as "deleted".

Non-unique directory entry names and invalid characters within 8.3 name fields can cause files or directories so named to be uncopyable or undeletable. You can attempt to use wildcard selection logic to work around this, but it's often better to fix such errors via direct editing of the disk.

Byte offsets and 4-byte garbage errors are suggestive of flaky hardware, such as bad cables, overclocked PCI bus, poor chassis grounding of hard drives, overheating, etc. You can sometimes correct an offset error - or a stuck-bit error - and get full recovery of what looked like garbage.

Total garbage directories are either the result of a sector full of garbage being written into the cluster chain, cross-linking followed by overwrite, or a corrupted pointer that points to arbitrary data. In the last case, look for "bit-puns", i.e. a value that differs by one flipped bit may point to the real data.

Sanity-checking FAT

FAT tables typically start with the value of F8h, which looks like a degree sign in ASCII; you can check that each putative FAT starts in this way. Then if possible, get a dual-window view with FAT1 and FAT2 in each window, and compare the two windows to detect errors.

The most significant bytes of FAT addresses generally fall in a low range, especially where FAT32 is concerned. So arbitrary garbage thrown into the FAT is usually obvious, due to insanely high address values. You can also use this factor to determine whether a FAT is FAT12, FAT16, or FAT32 (the boot record will tell you - if it's sane).

Empty space appears as zero values, making it hard to differentiate genuine FAT content from splatted-in sectors full of nulls. Raw disk format typically fills with ASCII "divide-by" characters, which helps.

Non-unique cluster addresses, for all values other than zero or those "special" values that signify bad clusters and end-of-file markers, are invalid for FAT. They may indicate cross-links in genuine FAT content, or result from arbitrary garbage thrown into the FAT area. A quick look in ASCII mode will usually spot the latter.

Normal non-empty FAT usually looks like a steadily increasing set of cluster addresses, punctuated with end-of-file markers and a few zeros where files have been deleted. However, as files are deleted and created, discontinuities will occur, either as files are fragmented, or simply created "further up" the volume.

The above should make it easy to spot the valid FAT from outright garbage, but subtle errors may be more difficult to tie-break. In such cases, it's best to pull files off using both of the two alternate FAT data choices. If you have to tie-break, choose the one with more information - i.e. that with a valid set of cluster addresses rather than zeros. If there was no data there, the non-zero values won't matter (though ScanDisk may recover "lost chains") but if a directory entry points to a chain with zeros in it, this will typically fail and abort the copy operation.

Finally, make sure all FAT errors are fixed (or all data has been evacuated) before allowing any file writes to the affected volume!

Sanity-checking directory entries vs. FAT

Both the FAT chain info and the directory entry contain information about the length of the file, and where these mismatch, you have a problem. Either you will have an end-of-file marker appearing in the cluster chain in FAT where a valid "next cluster" address was expected, or you will have further cluster addresses beyond where the directory entry length value predicted an end-of-file marker.

Of the two, the latter is preferable as this will not cause the copy operation to fail. This is the logic behind the use of a "flat-FAT", where the entire volume is treated as one single contiguous cluster chain (thus cross-linking all data on the volume). This is a brute-force way of inventing the FAT from scratch, and will fail wherever a file or directory contains more than one cluster and is fragmented. Volumes set up as "flat-FAT" are sane for reads, but not for writes.

An alternative to the "flat-FAT" approach is the use of UnFormat; this rebuilds the FAT using similar assumptions about no fragmentation, but with the refinement that end-of-file markers are appropriately inserted. But there's a serious danger with Microsoft's UnFormat; if your cluster alignment is wrong, UnFormat will "fix" the . pointers to use the cluster addresses in effect. This is exactly the opposite of what you generally want - you'd rather believe the . pointers and fix the alignment so that their prophecy is fulfilled - and not only destroys the only clue you have to find the correct alignment, but invalidates all other pointer cluster addresses as well. Very, very nasty :-(

 

(C) Chris Quirke, all rights reserved - November 2002

Back to index