B1tF1ghter: You should
never use md5 for anything
important.
It's utterly worthless. Proven broken. Doesn't actually protect from collisions.
There's a reason it calculates so fast (takes shorter than it would take for your storage device to actually read entire file range, this is direct evidence it doesn't actually calculate hash from ENTIRE file just select few parts of it).
It's a worthless alg for ANY backups.
It's in the process of being replaced in any place that has sane procedures and actual standards.
Unfortunatelly it's still being perceived as "valid" by many in IT world as well as whole ocean of uninformed private people.
You should use something like sha256. Or sha512.
Or use both - this way you actually are reducing risk of collision or rouge change to an absolute minimum - at the moment it would be near impossible to make 2 different (high complexity and not proven broken) algs collide sums at once.
And if you cannot be bothered with time then use at the VERY LEAST sha1.
kohlrak: I'm a bit more aware than you realize. However, good luck getting better than md5 from anyone. I used sha1 for a little thing that pulls naughty images out of spam emails i get (for an experiment), to reduce the frequency of duplicate images. It wasn't perfect: duplicates managed to get though on rare occasions.
For cryptography and collision protection... You should beware taking any particular hash seriously. The very nature of hashing increases chances of collision (as opposed to multiple redundant stores), by virtue of being irreversible. This is fundamental to understanding hashes, and failure to understand it will only lead to disappointment. For storage and integrity, we need triple redundancy at the bare minium.
We are talking about 2 entirely different issues - you are talking about data retention - meanwhile I am talking about verifying usefulness of the retained data, which is entirely different problem.
You use Z copies for data retention, better redundancy means higher chances your data is going to be retained in the longterm.
Meanwhile you use checksums to verify if your retained copy Z_number is usable AT ALL (ergo if it was not modified through ANY means, such as corruption, or perhaps malicious actor involvement).
The two aren't competetive means to achieve the same. The two are 2 different matters entirely.
They should be used in conjunction with each other and NOT exclusively.
And IMO I wouldn't call making a backup of exact copies on 3 drives of the same model manufactured in the same batch a "proper" backup by any means.
It's more of a "happy go lucky, wishful thinking based HOPING that they will not start to crap out all at the same time".
Ideally we would have a storage solution that would be impossible to be affected by environmental factors.
A master copy from a material that is virtually indestructible, not affected by ANY magnetic fields (including visual radiation, such as sunlight), nor temperature (something that could easily withstand 3 k degree C for "reasons" [mostly deliberate attempts at sabotaging it by rouge personel but not actually limited to that, say fires, or maybe you dropped your backup into a volcano to give a more abstract example) and sturdy enough to not get altered with even formidable force.
Something like a block made of material-better-than-titanium in which data would be physically engraved *.
With compression obviously, as raw binary representation would not be very density effective.
So a coded representation naturally, with dictionary added somewhere on the device itself (for avoiding "we have the backup but the dictionary was forever lost somewhere in the archives so we cannot read this thing" situations).
A "data block" if you will.
Unfortunatelly we do not have such a storage facilitator right now, but few people worldwide are trying to cook up something like that.
* this is actually just like optical media (pressed anyway) on a concept level. Just far more sturdy and environmental factors resilient.
The closest to that we got so far is M-disc which is based on basically stone layer (versus more or less organic and / or "chemically reactive" data layers used in "conventional" pressed optical media) which makes it far less suspectible to all sorts of factors, including higher resistance to temperature.
And in-before someone goes all "but it wasn't proven".
To those people:
shut up and at least make an ATTEMPT at understanding.
Synthetic tests.
Nobody makes realtime tests anymore. Technology used would be obsolete by the time the test would end.
Synthetic tests are designed to simulate realtime workload while performing drastically time-reduced test.
They are based on complex scientific calculations.
Almost everything is tested like that nowadays, including things like longetivity of car suspension in a 4x4 car.
Maybe it (M-disc) will not last advertised ~1000 years. But if the test methodology was proper and the tests were done ok (I am suspecting they were at least proper-ish [if not better] considering military testing was involved) it should definitely last for at least few DECADES. Which is considerably longer than for example what VerbatimE or Taiyo Yuden (the original, not the "chinese ripoff after assets purge", also I'm talking about RECORDABLE ones and NOT pressed ones) is rated at (which are ALREADY "higher end than what most people would use").
Ultra-long-term data retention is a fascinating case. One I am particularly personally interested in (I am also interested in non-storage hardware that could work for more than a decade without failing in any way - something humanity doesn't consistently have atm - and it's somewhat infuriating for me personally how humanity still goes for profit over quality to this day, prioritizing designing stuff with repeated purchase [ergo income] in mind versus superior quality that would last for at least decades - this is something that has to change FAST - otherwise we could start drowning in ELECTRONIC trash in few decades, not to mention most of the infrastructure working state is based almost entirely on luck and hoping nothing BS will happen).
I could go on about this case for ages. But unless you want me to start writing a book here I will just not drag this off-topic too far.
So if you excuse me, we can continue this conversation (should you be interested) in private message at some point (keep in mind I'm highly busy within next 1,5 month and may respond with huge delays).