Jump to content
IGNORED

My First NAS: Newbie Q&A on Hashing, Data & RAID Scrubbing and Check Summing for Backups


Recommended Posts

In pursuit of building-or rather having my local IT guy build me-my first NAS, I’ve sunk my newbie brain as deep as it can go into learning how best to use it after my builder does all the hardware and OS installs and then walks me through use of the GUI.

 

Of course, beyond basic storage capacity and drive storage redundancy to prevent user file losses, a NAS or any server and its file system (zfs or btrfs) are only as useful as they enable you to prevent data corruption. Save for the crazy maths (and terms like “pool” which seems to have multiple meanings in the data storage biz), these reports were helpful https://en.wikipedia.org/wiki/Hash_function https://en.wikipedia.org/wiki/Checksum for learning about hash functions and the tables of hash codes (“hashes”) they (apparently?) create for each document, photo, audio or video file.

 

But please to these questions:

 

Is a hash code automatically created for every user file (e.g., document, photo, audio, video) the first time it gets written to the NAS? Or do you have to use some kind of app or NAS utility and enable it to generate and assign a hash code to every one of your files?

 

And where are those codes stored? Inside of the file’s own container? Or are all user file hash codes stored someplace else? In a “hash table” and/or on a drive partition on the RAID drive array?

 

Are these hash codes used by the zfs and btrfs file system for routine data scrubbing?

https://blog.synology.com/how-data-scrubbing-protects-against-data-corruption

https://www.qnap.com/en/how-to/tutorial/article/how-to-prevent-silent-data-corruption-by-using-data-scrubbing-schedule

https://par.nsf.gov/servlets/purl/10100012

 

Then, as mentioned in the above links, following data scrubbing, are these hash codes also usually used for routine RAID scrubbing?

 

But for both data and RAID scrubbing, is data integrity ensured by comparing the hash code of each file with its initially (first ever created) hash code (stored wherever) to the hash code currently in the file. If the system’s comparing calculations show that the codes are different, then one or more of the file’s bits have flipped, so then it knows that the file is therefore corrupt?

 

If yes, then at that point will it flag me and ask if it wants the system to attempt to repair it?

 

If I say yes, then it will try to overwrite the corrupt file with the mirrored copy stored on a redundant (e.g., RAID 5) drive.

 

CAUTION: As RAID scrubbing puts mechanical stress and heat on HDDs, the rule of thumb seems to be to schedule it for once monthly-and only when drives are idle, so no user triggered read/write errors can occur. https://arstechnica.com/civis/threads/probably-a-dumb-question-raid-data-scrubbing-bad-for-disks-if-done-too-frequently.1413781/

 

 

Beyond scrubbing, what else can I and the zfs and/or btrfs do to both bit rot?

 

And to minimize the risk crashes:

Replace the RAIDed HDD array every 3 (consumer) to 5 (enterprise grade) years.

 

Do not install any software upgrade for the NAS until it’s been around long for the NAS brand and the user community forum to declare it to be bug free.

 

What else can I do to minimize the risk of crashes?

 

Finally, when backing up from my (main) NAS to an (ideally identical??) NAS, Kunzite says here “…and I'm check summing my backups...”

https://forum.qnap.com/viewtopic.php?t=168535

 

But as hash functions are never perfect, and while rare, data “collisions” are inevitable. https://en.wikipedia.org/wiki/Hash_collision So as those hash algorithms are used for data and RAID scrubbing, they are evidently also used for check summing to ensure that data transfers from the NAS to a backup device happen without file corruption.

 

Apparently, CRC-32 is among the least collusion proof hash algorithms. https://en.wikipedia.org/wiki/Hash_collision#CRC-32 

 

Thus, for backups from main NAS to backup NAS, how much more is the SHA256 hash function (algorithm) worth using to prevent collisions and to verify data integrity of user files via check summing than MD5, because it uses twice the number of bits?

 

But if not much more advantageous for even potentially large audio files https://www.hdtracks.com/ , then would SHA256 be a lot more so than MD5 for check summing during for backups of DVD movie rips saved to uncompressed MKV and/or ISO files, because video bandwidths are so much bigger than audio?

 

And what would be a recommended checksum calculator app? https://www.lifewire.com/what-does-checksum-mean-2625825#toc-checksum-calculators

 

But if the app returns a check sum error between the file on my main NAS and the copy to be updated on my backup NAS, how then to repair the corrupt file?

 

Again, by using the file’s original hash code (stored some place) created the first time that it was ever stored in the NAS?

 

If yes, would that app then prompt me to choose to have the system repair the file?

 

 

 

 

 

Link to comment

It appears hash codes are created for all user files stored on Synology NAS, else this third party duplicate file finder wouldn't work. By definition a hash code changes when a file

changes else the file would be a duplicate of any other file with same hash code. This is all back ground activity for the NAS file system. People worry more about the error detection/

mirroring solution for recovery from when a HD cluster goes bad.

 

https://www.cisdem.com/resource/synology-duplicate-file-finder.html

 

Regards,

Dave

 

Audio system

Link to comment
On 9/2/2023 at 4:58 AM, nxrm said:

Is a hash code automatically created for every user file (e.g., document, photo, audio, video) the first time it gets written to the NAS? Or do you have to use some kind of app or NAS utility and enable it to generate and assign a hash code to every one of your files?

 

And where are those codes stored? Inside of the file’s own container? Or are all user file hash codes stored someplace else? In a “hash table” and/or on a drive partition on the RAID drive array?

 

Typically hash is calculated on every cluster (4096 bytes chunk of file data) written or overwritten.

 

Hash is stored in separated metadata area of the filesystem from file content store, it is not appended to the end of file or cluster. (There is an exception, on some expensive storage system used at bank such as netapp or 3par, HDD sector size is extended from 512 bytes to 528 bytes and sector hash is stored on the extra 16 bytes area.) This hash information is used on scrub, and also data deduplication (optional feature). This is slightly off-topic but data deduplication is used on cloud backup and they charge based on “storage size before dedup” to end-user to make profit. I created powerpoint slide to explain file metadata, cluster hashing and data dedup in the following thread:

 

 

Sunday programmer since 1985

Developer of PlayPcmWin

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now



×
×
  • Create New...