Machine Learning and Deep Learning datasets are MASSIVE. Just to put it into perspective, here are a couple of popular datasets across different modalities.
And all this data is before it has been converted into tokens. That will be even more terabytes worth of data if you chose to preprocess and save the tokens before training. So high density data storage was a core requirement when it came to building 🪐 Apollo 🛸.
However with new Hard Disk Drives (HDDs) costing more than $20 per Terabyte, the total costs for storage was causing the budget to grow out of control. In the end I ended up going with refurbished HDDs with one leg in the grave 😱 (at least according to the internet). But how do we make educated decisions about life left on a Hard Drive and avoid data loss due to hardware failures?
Warning
Some newer HDDs have a fun little power disable feature (PWDIS). I’m sure the engineers who made it thought it was useful, but what it means for you is that you will need an ATA power adapter to get your drives to properly draw power.
badblocks is a utility to find, you guess it, bad blocks/sectors. As any drive gets used it will experience natural wear and tear causing the drive to start to fail. Failures will show up as sectors of the drive become unusable as they cannot reliability read and write data to that block anymore. Since HDDs work by exploiting the physics of magnets to make a plate shaped magnetic rocks spin around super fast, this natural wear and tear is especially problematic.
First lets install badblocks and find the drives that we want to test.
$ yay -S e2fsprogs$ fdisk -l
Lets assume we want to test drive /dev/sdx. If you try to run badblocks as it, will throw Value too large for defined data type errors on large drives. To get around this, we will use the drive’s block size and tell badblocks to batch the read and write tests to fit in the block size. Lets get /dev/sdx’s recommended block size.
$ blockdev --getbsz /dev/sdx
For /dev/sdx the block size was 4069. Now lets run badblocks to scan our HDD.
$ badblocks -t random -w -s -b 4069 /dev/sdx
Warning
Running badblocks with the -w option is a destructive action that will overwrite any data on the drive. Backup any data that you don’t want to lose or run badblocks non destructively! Check out the wiki for more information
Tip
Since badblocks can more than a day to run through a larger drives you can schedule the job to run in the background with nohup, &, and piping the output to a file. System reboots will still stop the background job.
While badblocks allows you to actively monitor your drive health by running tests, Self-Monitoring, Analysis, and Reporting Technology (S.M.A.R.T.) helps passively track the health of your HDDs and Solid State Drives (SDDs). This also means that errors caused by the previous owner(s) should also show up here.
Lets start by installing the S.M.A.R.T. tools and finding the drives we want to check.
$ yay -S smartmontools$ fdisk -l# check if SMART is supported$ smartctl -i /dev/sdx
You can manually kick off a short or a long S.M.A.R.T. tests. The test will continue in the background and will take a few minutes to several hours.
# see how long tests will take to run$ smartctl -c /dev/sdx# manual run tests$ smartctl -t short /dev/sdx$ smartctl -t long /dev/sdx# this test finds drive damage during transport$ smartctl -t conveyance /dev/sdx
Check if the manual test is done.
$ smartctl -l selftest /dev/sdxsmartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.7-arch1-1] (local build)Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org=== START OF READ SMART DATA SECTION ===SMART Self-test log structure revision number 1Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error# 1 Extended offline Completed without error 00% 43048 -# 2 Short offline Completed without error 00% 42997 -# 3 Short offline Completed without error 00% 42979 -
Warning
System reboots will cause the S.M.A.R.T. tests to abort.
The values that show up will be different for each brand and drive connector. Generally you can tell what values should be high and which should be low, but online wikis will have more updated and detiailed information.
SCISI S.M.A.R.T. Values :: This is used by drives that use the Universal Serial Bus (USB) connector.
Example
In the S.M.A.R.T. values for the drive above, the UDMA_CRC_Error_Count is quite high. The wiki mentions that it could be due to a loose cable connection. And it was!
We can automate periodically running S.M.A.R.T. tests to keep up to date health data on our drives. smartd is a service that helps us do exactly that, but first we need to enable it.
$ systemctl enable smartd$ systemctl start smartd
Lets edit /etc/smartd.conf to configure smartd.
# /etc/smartd.conf# `DEVICESAN` :: run tests on S.M.A.R.T. all enabled drives# `-a` :: for all S.M.A.R.T. values# `-o on` :: save offline data# `-S on` :: save attribute data# `-n standby,q` :: skips the test if the drive is not active (increases drive lifespan)# `-s (S/../.././02|L/../../4/03)` :: runs a daily short test and a long test on the 4th of each month# `-W 4,35,40` :: log changes and dangerous operating temps# `-m [email protected]` :: send an email for alertsDEVICESCAN -a -o on -S on -n standby,q -s (S/../.././02|L/../../4/03) -W 4,35,40 -m [email protected]
Depending on the results of the badblocks and S.M.A.R.T. tests, you may want to keep using the drive or send it back if its still under warranty. Not perfect drives are acceptable if your are using other tools like the Zettabyte File System (ZFS) to protect your data.
ZFS is a software based Redundant Array of Inexpensive Drives (RAID-Z) solution to help replicate data and maintain data integrity given the unpredictable nature of drive failure and data loss. However, all this power does come with a lot of responsibility. ZFS expects you to know what your doing and isn’t straight forward to expand.
Lets start by installing ZFS and find the drives to use. For the ZFS package we’re going to use Dynamic Kernel Module Support (DKMS) to allow the system to easily swap between driver versions. This will be good if we ever need to upgrade or downgrade the drivers.
# install and check zfs$ yay -S zfs-dkms zfs-utils$ modprobe zfs# find the drives$ fdisk -l
Enable ZFS daemons to auto mount your pool on boot.
ZFS is smart enough to be able to handle getting disks by either UUID or /dev/sdx. Internally ZFS will map to the UUID of the drive and you will not have to worry about the pool being mounted correctly on reboots.
Check the status of your new pool.
$ zpool status black_hole$ zfs get all black_hole
Tip
Finding large files and folders to clean up using just the CLI can be difficult. If you are looking for files to clean up you can use the du command to explore your system. sudo will be required to search in folders owned by root.
$ du -ah --max-depth 1$ du -ah --max-depth 3 ~/downloads
If you have ran your tests and decided to use the drive in your ZFS array then theres not much more to do. Now that you have the patient’s history and charts, you can make educated decisions on the security of your data. Keep monitoring ZFS and S.M.A.R.T. values to ensure that your system’s data is secure and use the money you made to buy some ice cream 🍦!
Tip
Just because we have done all this doesn’t mean that your data is fully secure. Always follow the 3-2-1 rule for data and keep backups.
Have 3 copies of your data. Typically one copy is production/live and the other two are for backup.
On 2 different types of mediums. Currently out HDDs are magnetic disks, but other options include flash storage with SDDs, tape storage with Tape Drives, and more!
1 of the copies should be offsite and away from your system. This way if your house catches on fire you can still get the old data out of the cloud or your parent’s basement.