Eldrich filesystem horrors
Hopefully you will never see this one, but with Halloween upon us I thought I’d share a file system horror story from the depths of time. I’ve changed some of the names, but this is all true.
We have some processes that do a bit of metadata cleaning on datasets, and for a long time they ran on some old Ubuntu boxes (imagine cobwebs in a dark corner of the server room with a flickering light, if you like).
One quiet afternoon, a process fell over. When it was noticed and reported a little later I had a look at the logs. There was a mismatch in the number of files that had been copied across for processing, one of the sanity checks the process runs. This was rare, but not strange: they are big datasets and space might run out occasionally. I made sure there was enough free and restarted it, but, when I checked back later, it had failed again. Digging deeper over the next few hours would lead me to a place beyond sanity checking…
The process in question copies a lot of files to local storage, it then renames them and alters some metadata. I launched it manually and watched what happened to see what was stopping the copy process, there were 14972 files to copy. I checked the output.
There were 14973 files.
I looked at the listing.
One file was listed twice.
me@box1:~$ ls /tmp/sis_3575/|sort |uniq -c|grep -v " 1 "
2 20240213113627471.dcm
Was this some bug where a hidden character had crept into a filename? I tried find:
me@box1:~$ find /tmp/sis_3575 -type f -name 20240213113627471.dcm
/tmp/sis_3575/20240213113627471.dcm
/tmp/sis_3575/20240213113627471.dcm
Maybe find was somehow matching two different names? Surely these are different entries. The inode number will tell me:
me@box1:~$ ls /tmp/sis_3575/ -i|grep 20240213113627471
6812999 20240213113627471.dcm
6812999 20240213113627471.dcm
I stared in horror at the same inode and name appearing twice in a directory. Can they really be the same name?
me@box1:~ $ ls /tmp/sis_3575/|grep 20240213113627471.dcm|hexdump -C
00000000 32 30 32 34 30 32 31 33 31 31 33 36 32 37 34 37 |2024021311362747|
00000010 31 2e 64 63 6d 0a 32 30 32 34 30 32 31 33 31 31 |1.dcm.2024021311|
00000020 33 36 32 37 34 37 31 2e 64 63 6d 0a |3627471.dcm.|
0000002c
Was there some race condition in the loop that was renaming the files? I pregenerated the list of mv commands and ran them, the result was the same. It must be a problem with this file! I ran the list of renames so far as the 20240213113627471.dcm file, everything was good. Then I ran the 20240213113627471.dcm rename.
me@box1:~$ mv /tmp/sis_3575/20240213113727481.MR /tmp/sis_3575/20240213113727481.dcm
me@box1:~$ ls /tmp/sis_3575/|sort |uniq -c|grep -v " 1 "
There were no duplicate entries
me@box1:~$ ls /tmp/sis_3575|wc -l
14972
And the right number of files. What was happening? There was only one thing to do.
me@box1:~ $ cat /tmp/sis_3575.cmds|while read L ; do $L ; ls /tmp/sis_3575|wc -l|tee -a /tmp/sis_3575.cmds.wccounts ; done
Move each file one by one, and count them every time. At first everything was fine but then I watched with growing horror:
me@box1:~ $ uniq -c /tmp/sis_3575.cmds.wccounts
7528 14972
25 14973
1 14972
4 14973
7 14972
…
It carried on like this, sometimes there were runs of hundreds or even thousands with the right number, but always the horrible 14973 would reappear.
This could only be one thing: filesystem corruption. I forced an fsck on box1 and set the process running on a newer system.
A few days later I returned to box1. After looking at the e2fsck manual I’d found a weapon with which I hoped to kill the monster once and for all:
“The -D option will detect directory entries with duplicate names in a single directory, which e2fsck normally does not enforce for performance reasons.”
With box1 fscked and restarted I tried again on the test data, but fsck had failed me. Once again, as the files were copied, suddenly the 14973rd file would appear.
There was worse news: 20240213113627471.dcm was still the file that appeared twice.
But there are weapons even more fearsome than fsck. In desperation I opened the filesystem in debugfs and looked at the directory. There were certainly other deleted entries, but only one 20240213113627471.dcm. If this wasn’t filesystem corruption then what was it?
Box1 had a companion, box2 was installed at the same time. I ran my test on box2, but now where there should be 14972 files there were only 1742. The filesystem on box2 had become read only, was it trying to protect itself from the cursed 20240213113627471.dcm?
No. The disk had failed. Box1 had been sheltering box2 from jobs for some time, and now, with its defender stricken, the onslaught of data had killed box2 instantly.
I had pled with Google for answers before, but did so more fervently now. One story told of accessing samba from a Mac, and name mapping mischief. But this was a local ext3 filesystem, there was no such mapping.
Then, finally, I found a crumpled parchment in the depths of the ext3-users list from 2008. https://listman.redhat.com/archives/ext3-users/2008-August/002272.html
The essence of it was that readdir (used in the Linux kernel for listing directories) uses a hashing function to keep track of position. There was both a kernel bug that didn’t handle hash collisions and (for a while) a default hashing function that generated high numbers of collisions when the files had similar names. In such cases, tools like ls or find, relying on readdir could report duplicate entries when listing directories that contained large numbers of similarly named files.
Or, as the scroll before me read:
“It appears there's a kernel bug hiding here, since there shouldn't be duplicates returned by readdir() even if we have hash collisions.
It turns out though that the TEA hash we are currently using as the default is a really sucky hash.”
And finally, the magic to banish the 14973rd file, use debugfs to change the directory hash function on the filesystem:
$ debugfs -w /dev/sdXXX
debugfs: set_super_value def_hash_version half_md4
debugfs: quit
# Remove affected directories and reboot.
Changing the default hash for directories to half_md4 and reducing the risk of collisions.
The hash collision handling was seemingly fixed around kernel version 2.6.27-rc3, back in 2008, and so you can hope never to see this again. But the next time you ls or find, take a moment to think about box2 and the 14973rd file.
Sleep well.
Comments
Post a Comment