Machine Learning and Deep Learning datasets are MASSIVE. Just to put it into perspective, here are a couple of popular datasets across different modalities.

DatasetApprox. SizeDescription
LAION 5B9.5 TBdataset of labeled imaged used to train Open Clip
The Pile825 GBopen source dataset of scraped internet data for language modeling
Mozilla’s Common Voice80 GBa human voice dataset used to train DeepSpeech
Meta’s SA-1B10 TBdataset of annotated image segments used to train Segment Anything Model (SAM)
YouTube 8M80 GBdataset of annotated youtube videos and their link
Berkeley’s BDD100K1.9 TBdataset of annotated driving videos

And all this data is before it has been converted into tokens. That will be even more terabytes worth of data if you chose to preprocess and save the tokens before training. So high density data storage was a core requirement when it came to building 🪐 Apollo 🛸.

However with new Hard Disk Drives (HDDs) costing more than $20 per Terabyte, the total costs for storage was causing the budget to grow out of control. In the end I ended up going with refurbished HDDs with one leg in the grave 😱 (at least according to the internet). But how do we make educated decisions about life left on a Hard Drive and avoid data loss due to hardware failures?

Warning

Some newer HDDs have a fun little power disable feature (PWDIS). I’m sure the engineers who made it thought it was useful, but what it means for you is that you will need an ATA power adapter to get your drives to properly draw power.

badblocks :: The Bad Boy in Town

badblocks is a utility to find, you guess it, bad blocks/sectors. As any drive gets used it will experience natural wear and tear causing the drive to start to fail. Failures will show up as sectors of the drive become unusable as they cannot reliability read and write data to that block anymore. Since HDDs work by exploiting the physics of magnets to make a plate shaped magnetic rocks spin around super fast, this natural wear and tear is especially problematic.

First lets install badblocks and find the drives that we want to test.

$ yay -S e2fsprogs
$ fdisk -l

Lets assume we want to test drive /dev/sdx. If you try to run badblocks as it, will throw Value too large for defined data type errors on large drives. To get around this, we will use the drive’s block size and tell badblocks to batch the read and write tests to fit in the block size. Lets get /dev/sdx’s recommended block size.

$ blockdev --getbsz /dev/sdx

For /dev/sdx the block size was 4069. Now lets run badblocks to scan our HDD.

$ badblocks -t random -w -s -b 4069 /dev/sdx

Warning

Running badblocks with the -w option is a destructive action that will overwrite any data on the drive. Backup any data that you don’t want to lose or run badblocks non destructively! Check out the wiki for more information

Tip

Since badblocks can more than a day to run through a larger drives you can schedule the job to run in the background with nohup, &, and piping the output to a file. System reboots will still stop the background job.

$ nohup badblocks -t random -w -s -b 4096 /dev/sda > dev.sdx.txt &

Tip

If badblocks fails, you can continue the job by specifying a block to start at.

$ nohup badblocks -t random -w -s -b 4096 /dev/sdx $START_BLOCK > dev.sda.2.txt &

Making S.M.A.R.T. Decisions

While badblocks allows you to actively monitor your drive health by running tests, Self-Monitoring, Analysis, and Reporting Technology (S.M.A.R.T.) helps passively track the health of your HDDs and Solid State Drives (SDDs). This also means that errors caused by the previous owner(s) should also show up here.

Lets start by installing the S.M.A.R.T. tools and finding the drives we want to check.

$ yay -S smartmontools
$ fdisk -l
 
# check if SMART is supported
$ smartctl -i /dev/sdx

Manual Tests

You can manually kick off a short or a long S.M.A.R.T. tests. The test will continue in the background and will take a few minutes to several hours.

# see how long tests will take to run
$ smartctl -c /dev/sdx
 
# manual run tests
$ smartctl -t short /dev/sdx
$ smartctl -t long /dev/sdx
# this test finds drive damage during transport
$ smartctl -t conveyance /dev/sdx

Check if the manual test is done.

$ smartctl -l selftest /dev/sdx
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.7-arch1-1] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
 
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     43048         -
# 2  Short offline       Completed without error       00%     42997         -
# 3  Short offline       Completed without error       00%     42979         -

Warning

System reboots will cause the S.M.A.R.T. tests to abort.

Understanding S.M.A.R.T. values.

Lets start by printing out the S.M.A.R.T. values.

$ smartctl -A /dev/sdx
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.7-arch1-1] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
 
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   134   134   054    Pre-fail  Offline      -       96
  3 Spin_Up_Time            0x0007   154   154   024    Pre-fail  Always       -       448 (Average 405)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       37
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   094   094   000    Old_age   Always       -       43051
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       37
 22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   099   099   000    Old_age   Always       -       1603
193 Load_Cycle_Count        0x0012   099   099   000    Old_age   Always       -       1603
194 Temperature_Celsius     0x0002   176   176   000    Old_age   Always       -       34 (Min/Max 22/49)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       2727

The values that show up will be different for each brand and drive connector. Generally you can tell what values should be high and which should be low, but online wikis will have more updated and detiailed information.

Example

In the S.M.A.R.T. values for the drive above, the UDMA_CRC_Error_Count is quite high. The wiki mentions that it could be due to a loose cable connection. And it was!

Automated Tests and Alerts

We can automate periodically running S.M.A.R.T. tests to keep up to date health data on our drives. smartd is a service that helps us do exactly that, but first we need to enable it.

$ systemctl enable smartd
$ systemctl start smartd

Lets edit /etc/smartd.conf to configure smartd.

# /etc/smartd.conf
 
# `DEVICESAN` :: run tests on S.M.A.R.T. all enabled drives
# `-a` :: for all S.M.A.R.T. values
# `-o on` :: save offline data
# `-S on` :: save attribute data
# `-n standby,q` :: skips the test if the drive is not active (increases drive lifespan)
# `-s (S/../.././02|L/../../4/03)` :: runs a daily short test and a long test on the 4th of each month
# `-W 4,35,40` :: log changes and dangerous operating temps
# `-m [email protected]` :: send an email for alerts
DEVICESCAN -a -o on -S on -n standby,q -s (S/../.././02|L/../../4/03) -W 4,35,40 -m [email protected]

ZFS, the Final Frontier

Depending on the results of the badblocks and S.M.A.R.T. tests, you may want to keep using the drive or send it back if its still under warranty. Not perfect drives are acceptable if your are using other tools like the Zettabyte File System (ZFS) to protect your data.

ZFS is a software based Redundant Array of Inexpensive Drives (RAID-Z) solution to help replicate data and maintain data integrity given the unpredictable nature of drive failure and data loss. However, all this power does come with a lot of responsibility. ZFS expects you to know what your doing and isn’t straight forward to expand.

Lets start by installing ZFS and find the drives to use. For the ZFS package we’re going to use Dynamic Kernel Module Support (DKMS) to allow the system to easily swap between driver versions. This will be good if we ever need to upgrade or downgrade the drivers.

# install and check zfs
$ yay -S zfs-dkms zfs-utils
$ modprobe zfs
 
# find the drives
$ fdisk -l

Enable ZFS daemons to auto mount your pool on boot.

$ systemctl enable zfs-import-cache
$ systemctl enable zfs-import-scan
$ systemctl enable zfs-mount
$ systemctl enable zfs.target

To keep things simple I will be using all the drives in a single raidz3 pool. To keep with the theme, lets name the pool black_hole.

# create zfs pool
$ zpool create black_hole raidz3 /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj
$ zfs set mountpoint=/mnt/black_hole black_hole
 
# (optional) enable compression
$ zfs set compression=on black_hole
 
# (optional) create encrypted dataset
$ zfs create -o encryption=on -o keyformat=passphrase black_hole/enc

Note

ZFS is smart enough to be able to handle getting disks by either UUID or /dev/sdx. Internally ZFS will map to the UUID of the drive and you will not have to worry about the pool being mounted correctly on reboots.

Check the status of your new pool.

$ zpool status black_hole
$ zfs get all black_hole

Tip

Finding large files and folders to clean up using just the CLI can be difficult. If you are looking for files to clean up you can use the du command to explore your system. sudo will be required to search in folders owned by root.

$ du -ah --max-depth 1
$ du -ah --max-depth 3 ~/downloads

Tip

Check out Learn X in Y minutes, Where X=zfs for a quick start zfs guide.

What Now?

If you have ran your tests and decided to use the drive in your ZFS array then theres not much more to do. Now that you have the patient’s history and charts, you can make educated decisions on the security of your data. Keep monitoring ZFS and S.M.A.R.T. values to ensure that your system’s data is secure and use the money you made to buy some ice cream 🍦!

Tip

Just because we have done all this doesn’t mean that your data is fully secure. Always follow the 3-2-1 rule for data and keep backups.

  • Have 3 copies of your data. Typically one copy is production/live and the other two are for backup.
  • On 2 different types of mediums. Currently out HDDs are magnetic disks, but other options include flash storage with SDDs, tape storage with Tape Drives, and more!
  • 1 of the copies should be offsite and away from your system. This way if your house catches on fire you can still get the old data out of the cloud or your parent’s basement.