Jump to content
IGNORED

Music file corruption without a change in checksum


Recommended Posts

I recently found out the hard way that a music file can become corrupted without the md5sum or sha1 checksum changing.

 

This is because the checksum is a signature of the bit content of the data fork, whereas various extended attributes can (and do) change all the time in the resource fork of the file that is present on filesystems such as Apple's HFS+ filesystem.

 

If the resource fork metadata (not to be confused with music file tag metadata, which belongs to the data fork of the file) becomes corrupted, it can prevent copying or playing the music file. (So in this extreme sense, SandyK is right, files with identical checksums can sound different, but it is important to keep in mind the difference in this case is due to file system corruption).

 

This raises a problem: Typically we check (if at all) for file corruption based on the checksum of the data fork. Does anyone do this for resource forks? Does zfs or btrfs self-healing take this into account? Or are resource forks subject to bitrot that can go undetected?

Link to comment

The audio data should be entirely contained in the data fork. Otherwise copying the file to a non-mac system or over a generic transport such as the Internet would not be possible. A corrupted resource fork could result in loss of metadata or access control lists, but should never affect the sound reproduction if the file is at all readable, as it must be if data checksums still match. Moreover, FLAC files include an md5sum of the decompressed data, so they are easy to verify. Finally, CDs do not have a resource fork, so sandyk is still full of shit.

Link to comment

I certainly agree with your last point, but having what might be a failing or at least corrupted disk, I have had about 4 separate instances in which the resource fork has become corrupted but the data fork is untouched (at least as far as the checksums reveal). Nevertheless, I was unable to copy the file or play it. (One thing I did not try, which I wish I had thought of before I over-wrote the files with uncorrupted versions, was to delete all the extended attributes in the resource fork with xattr -d .) In any case, this is a genuine problem, at least on HFS+. I understand all the audio data is encapsulated in the data fork of the file, but this is little comfort if you cannot access it due to a corrupted resource fork.

 

What I would like to know is if such a problem can arise on zfs or btrfs, and if so, if there is a way of resolving it.

Link to comment
I certainly agree with your last point, but having what might be a failing or at least corrupted disk, I have had about 4 separate instances in which the resource fork has become corrupted but the data fork is untouched (at least as far as the checksums reveal). Nevertheless, I was unable to copy the file or play it. (One thing I did not try, which I wish I had thought of before I over-wrote the files with uncorrupted versions, was to delete all the extended attributes in the resource fork with xattr -d .) In any case, this is a genuine problem, at least on HFS+

 

Unable to play using what software? ITunes? If you can calculate the data checksum, you must be able to read the data, but I can imagine some software refusing to play a file with a corrupt resource fork. I would not believe you if you said it played but had a collapsed soundstage or some other such vagary.

 

Personally, I run Linux which doesn't have any resource fork nonsense.

Link to comment
I recently found out the hard way that a music file can become corrupted without the md5sum or sha1 checksum changing.

 

This is because the checksum is a signature of the bit content of the data fork, whereas various extended attributes can (and do) change all the time in the resource fork of the file that is present on filesystems such as Apple's HFS+ filesystem.

 

If the resource fork metadata (not to be confused with music file tag metadata, which belongs to the data fork of the file) becomes corrupted, it can prevent copying or playing the music file. (So in this extreme sense, SandyK is right, files with identical checksums can sound different, but it is important to keep in mind the difference in this case is due to file system corruption).

 

This raises a problem: Typically we check (if at all) for file corruption based on the checksum of the data fork. Does anyone do this for resource forks? Does zfs or btrfs self-healing take this into account? Or are resource forks subject to bitrot that can go undetected?

 

I think you're onto a reason why the "bits are bits" crowd are not accounting for everything...

Link to comment
Unable to play using what software? ITunes?

 

iTunes, Audirvana, afplay (apple's command-line player), Finder.

 

If you can calculate the data checksum, you must be able to read the data,

 

... in the data fork.

 

but I can imagine some software refusing to play a file with a corrupt resource fork. I would not believe you if you said it played but had a collapsed soundstage or some other such vagary.

 

Personally, I run Linux which doesn't have any resource fork nonsense.

 

I'm pretty sure zfs does. Also, anything with extended attributes is doing something along these lines, with the key being that you can change a value in an extended attribute without changing the checksum of the file itself.

 

Edit: According to the second table on this wikipedia page, https://en.wikipedia.org/wiki/Comparison_of_file_systems , ext2, ext3, ext4 have this nonsense as well.

Link to comment

 

Well, that is really weird. As best I remember, the example of Word documents on a PC that linked to resources such as images - the resources were just stashed into a subfolder, zipped or otherwise, so the whole thing could be copied up to the Web etc. and the link addresses would be relative to the main document. Or in some cases I suppose the resources could be embedded into the document. But I've never heard of those resources being anything except ordinary linked content on a PC, and why anyone would want to create a special O/S feature for resources still escapes me. Performance?

Link to comment

This is all over my head. What I want to know is how can I create a backup system that prevents this from happening?

 

Right now I have my music library in 3 locations.

 

on a Drobo connected to an iMac

 

in my Aurender X100

 

and on a fire and water proof Synology NAS

No electron left behind.

Link to comment
why anyone would want to create a special O/S feature for resources still escapes me. Performance?

 

Implement things like access control lists, extended attributes, and all sorts of other stuff. Recently I have been storing the audio checksum for alac files in the associated resource fork as a way to check for file corruption of the data fork once a week. (Yes, the irony is not lost on me that I might be creating a bigger problem than I am trying to prevent.)

 

More here: Mac OS X 10.4 Tiger | Ars Technica

Link to comment
iTunes, Audirvana, afplay (apple's command-line player), Finder.

 

If you can calculate the data checksum, you must be able to read the data,

... in the data fork.

 

You said yourself that all the audio data is in the data fork. If the data fork can be read by any means, the audio data can be retrieved and played. If some application refuses to play a file with a damaged resource fork, that's its problem. Since you are able to read the data fork (otherwise you couldn't validate the checksum), you can obviously write it out to a clean file which should be accepted by itunes just like any downloaded file.

 

I'm pretty sure zfs does. Also, anything with extended attributes is doing something along these lines, with the key being that you can change a value in an extended attribute without changing the checksum of the file itself.

 

Edit: According to the second table on this wikipedia page, https://en.wikipedia.org/wiki/Comparison_of_file_systems , ext2, ext3, ext4 have this nonsense as well.

 

Linux has extended attributes and ACLs, yes. That's not quite the same as the rather arbitrary forks supported by HFS+ and NTFS (NTFS even allows multiple data forks). More importantly, neither the OS nor typical applications demand the presence of extended attributes on a music file to work normally.

 

I get your point that it's possible to change something related to a file without affecting the data checksum. My counter-point is that doing so doesn't alter the validity of that data. If the checksum matches, you almost certainly have the same data as when you first created it.

Link to comment

It isn't exactly cheap to do, but it is possible. You just have to evaluate the cost of each ability against your budget and comfort levels.

 

First, active systems should not have your "master" data, just copies of it. That applies equally whether they access the music files from a server or as a local disk.

 

Secondly, have a high availability level in place to avoid system or external based data corruption. This could be RAID5 or better storage just on your master files, or both high RAID storage for your master files and RAID-1 storage on your active playback systems. IN part this depends upon how you intend to make changes into your master file, such as playback counts and so forth. Those are not, for example, important to me, so it doesn't matter to me if I loose them on "active" systems. I also do not hold that data (i.e. iTunes libraries, JRMC libraries, Sono's libraries, etc.) on the master files. :)

 

Third have a rigorous and automatic backup system than can handle multiple versions of data files. Reduplication and compression are without doubt, your friends here. But they are, or at least, can be expensive friends. A virtual Tape Library (VTL), combined with appropriate software, is often ideal here. Even a "virtual" Virtual Tape Library, which is available for low or no cost as software only implementations, is going to cost a significant amount. If not for the software, then for a machine (virtual or physical) to run it on and for the cost of storage.

 

Fourth, have a way to send a copy of the active data offsite on a regular basis. (Weekly works for me, meaning I am willing to loose up to one week of new files and changes...) Also, you need to have multiple copies of that offsite backup of the active data. You can do slightly better by having an offsite copy of all the data, active and inactive versioned data. One easy way to do this is to have the VTL's ​replicate to each other over the Internet. This replication can be automatic or manual, so long as it is performed periodically.

 

Fifth, it is always a good idea to have the data - at least the active data - periodically copied to an archival format. Tape, USB hard drives, second NAS, Bluray data discs, etc. Keep several versions of this as well.

 

Even with all that, you still run the risk of some data loss, but it is greatly - massively - reduced.

 

But as you can see, it all depends upon how much investment you are willing to tie up in data protection.

 

For a real world example, I have three copies of my master music and video library, one of which is kept offsite and updated automatically each evening. The data has up to 90 versions of each file on it. Most of the data files only have 1, 2, or 3 versions. But the system is configured to support up to 90.

 

I take yearly copies of that data, including the versions, and keep around 10 years of those copies.

 

I don't want to loose any data. I have had to totally restore music libraries more than once, when experiments went pear shaped. That can be annoying the first time it happens. :)

 

The downside is I pay for a lot of storage, so much so that online services - even very economical online services like tarsnap - are prohibitively expensive. I love the Akitio external drive enclosures these days.

 

-Paul

 

This is all over my head. What I want to know is how can I create a backup system that prevents this from happening?

 

Right now I have my music library in 3 locations.

 

on a Drobo connected to an iMac

 

in my Aurender X100

 

and on a fire and water proof Synology NAS

Anyone who considers protocol unimportant has never dealt with a cat DAC.

Robert A. Heinlein

Link to comment
No.

 

Unless you are suggesting audio information gets encoded in the resource fork. It does not.

 

Apologies for the OT, but perhaps this shouldn't be dismissed out of hand. If the audio *information* is the same but the player *application* can be made to behave differently (short of not playing back the audio file at all), then perhaps there's a possibility of a different audible outcome. Or perhaps not. Anyway, on with discussion of your more immediate problem....

One never knows, do one? - Fats Waller

The fairest thing we can experience is the mysterious. It is the fundamental emotion which stands at the cradle of true art and true science. - Einstein

Computer, Audirvana -> optical Ethernet to Fitlet3 -> Fibbr Alpha Optical USB -> iFi NEO iDSD DAC -> Apollon Audio 1ET400A Mini (Purifi based) -> Vandersteen 3A Signature.

Link to comment
Implement things like access control lists, extended attributes, and all sorts of other stuff. Recently I have been storing the audio checksum for alac files in the associated resource fork as a way to check for file corruption of the data fork once a week. (Yes, the irony is not lost on me that I might be creating a bigger problem than I am trying to prevent.) More here: Mac OS X 10.4 Tiger | Ars Technica

 

Ever since I used a PC I backed up data onto at least 3 media in case of failure. Starting around 1988 I wrote my own 'Diff' utility to compare the contents of a given folder, then used batch files to validate multiple folders automatically. I never, ever used any auto-backup programs - instead I wrote an enhanced version of the classic 'Dirmatch' utility, which when used in batch files, allows quick visual and manual updating of folders to backup media. The point is to never, ever write over an existing backed-up file with what might possibly be a corrupted copy from a current-use drive.

 

The second principle I've adhered to over time is to keep all important data in a more-or-less generic format - a popular and heavily-used format that can easily be converted to an even more generic format if need be, even if I've not accessed the data for years. I've watched several major software projects get painted into a corner because of developers who were anxious to use new or advanced features to pad their resumes.

 

But in all that time I've learned to ignore 'metadata' or other such non-portable things, and when building my website for photos (for example), the thumbnails are just 200x150 etc. copies of the main photo that I post visible, and link to the large photo.

 

When it comes to the music collection, I presume that any 'metadata' or other associated data for my music tracks are embedded in the tracks themselves, since I've moved and copied my music many times without accomodating any external files. I keep all of my computer settings such that all hidden and system files are always visible, especially to DOS, which counts the files and filesizes precisely.

Link to comment
ZFS handles everything:) jk. Actually the resource forks are mapped to a parallel filesystem with prefix "._" by SMB --which sort of works--kinda-- and so these will also get check summed

 

OK, that is what I was interested in, in part. The ._foo file that accompanies foo is the resource fork stripped off. For music files, these are in general irrelevant garbage. (Yes, I am currently making some use of them, but frankly would rather be rid of them if it meant less chance of corruption).

 

 

But where does zfs itself encode the checksums it uses for comparison? (Assume for simplicity I don't want to back up ._foo files, but instead discard them.)

 

If the data fork can be read by any means, the audio data can be retrieved and played. If some application refuses to play a file with a damaged resource fork, that's its problem. Since you are able to read the data fork (otherwise you couldn't validate the checksum), you can obviously write it out to a clean file which should be accepted by itunes just like any downloaded file..

 

I got rid of all my damaged resource fork files before I could test this, but I am hoping I can recover usability either by deleting all extended attributes (xattr -d foo) or by copying them while stripping off the resource fork ( ditto --norsrc foo bar ). Edit: man page suggests cp -X foo bar does the same.

 

The problem on OS X isn't that the player applications themselves are stumbling on a corrupted resource fork, but rather that it must by necessity be processed by the kernel in order to update extended attributes like file last opened time and stuff like that. In other words, although the presence of a resource fork is not required, if it is present, it is read.

 

In nonspecialist terms, I think this is a major cause underlying the file system fragility people complain about in OS X.

Link to comment
Apologies for the OT, but perhaps this shouldn't be dismissed out of hand. If the audio *information* is the same but the player *application* can be made to behave differently (short of not playing back the audio file at all), then perhaps there's a possibility of a different audible outcome. Or perhaps not. Anyway, on with discussion of your more immediate problem....

 

It is an easily tested hypotheses: See if the alleged sonic differences go away if you strip off the resource fork as described above. (This, by the way, would be best done in a double-blind test.)

Link to comment
Apologies for the OT, but perhaps this shouldn't be dismissed out of hand. If the audio *information* is the same but the player *application* can be made to behave differently (short of not playing back the audio file at all), then perhaps there's a possibility of a different audible outcome. Or perhaps not. Anyway, on with discussion of your more immediate problem....

 

A hypothetical player application can react in any way to anything. It might refuse to play files with an X in the filename, or it might distort the audio on Tuesdays. Let's not go there.

Link to comment

I wrote this on another conversation a few minutes ago:

 

One of the (many) things that irritate me about the Mac is how it adds lots of '.' files to my thumbdrive that transfers between my PC and Macbook. Every time I transfer files to the Mac, I plug the thumbdrive back into the PC and remove all of those files, so they're irrelevant. I hate doing the extra maintenance, but if I were to dig into the File Security settings and remove that feature and the constant "logging" that Mac and PC both do, then the O/S's would likely do other nasty things like invite virii into the system. So I just keep everything clean, quietly, and don't let the computer know anything it doesn't have to know to function reliably.

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now



×
×
  • Create New...