Machine Learning and Deep Learning datasets are MASSIVE. Just to put it into perspective, here are a couple of popular datasets across different modalities.
Dataset | Approx. Size | Description |
---|---|---|
LAION 5B | 9.5 TB | dataset of labeled imaged used to train Open Clip |
The Pile | 825 GB | open source dataset of scraped internet data for language modeling |
Mozilla’s Common Voice | 80 GB | a human voice dataset used to train DeepSpeech |
Meta’s SA-1B | 10 TB | dataset of annotated image segments used to train Segment Anything Model (SAM) |
YouTube 8M | 80 GB | dataset of annotated youtube videos and their link |
Berkeley’s BDD100K | 1.9 TB | dataset of annotated driving videos |
And all this data is before it has been converted into tokens. That will be even more terabytes worth of data if you chose to preprocess and save the tokens before training. So high density data storage was a core requirement when it came to building 🪐 Apollo 🛸.
However with new Hard Disk Drives (HDDs) costing more than $20 per Terabyte, the total costs for storage was causing the budget to grow out of control. In the end I ended up going with refurbished HDDs with one leg in the grave 😱 (at least according to the internet). But how do we make educated decisions about life left on a Hard Drive and avoid data loss due to hardware failures?
Warning
Some newer HDDs have a fun little power disable feature (PWDIS). I’m sure the engineers who made it thought it was useful, but what it means for you is that you will need an ATA power adapter to get your drives to properly draw power.
badblocks
:: The Bad Boy in Town
badblocks
is a utility to find, you guess it, bad blocks/sectors. As any drive gets used it will experience natural wear and tear causing the drive to start to fail. Failures will show up as sectors of the drive become unusable as they cannot reliability read and write data to that block anymore. Since HDDs work by exploiting the physics of magnets to make a plate shaped magnetic rocks spin around super fast, this natural wear and tear is especially problematic.
First lets install badblocks
and find the drives that we want to test.
Lets assume we want to test drive /dev/sdx
. If you try to run badblocks
as it, will throw Value too large for defined data type
errors on large drives. To get around this, we will use the drive’s block size and tell badblocks
to batch the read and write tests to fit in the block size. Lets get /dev/sdx
’s recommended block size.
For /dev/sdx
the block size was 4069
. Now lets run badblocks
to scan our HDD.
Warning
Running
badblocks
with the-w
option is a destructive action that will overwrite any data on the drive. Backup any data that you don’t want to lose or runbadblocks
non destructively! Check out the wiki for more information
Tip
Since
badblocks
can more than a day to run through a larger drives you can schedule the job to run in the background withnohup
,&
, and piping the output to a file. System reboots will still stop the background job.
Tip
If
badblocks
fails, you can continue the job by specifying a block to start at.
Making S.M.A.R.T. Decisions
While badblocks
allows you to actively monitor your drive health by running tests, Self-Monitoring, Analysis, and Reporting Technology (S.M.A.R.T.) helps passively track the health of your HDDs and Solid State Drives (SDDs). This also means that errors caused by the previous owner(s) should also show up here.
Lets start by installing the S.M.A.R.T. tools and finding the drives we want to check.
Manual Tests
You can manually kick off a short or a long S.M.A.R.T. tests. The test will continue in the background and will take a few minutes to several hours.
Check if the manual test is done.
Warning
System reboots will cause the S.M.A.R.T. tests to abort.
Understanding S.M.A.R.T. values.
Lets start by printing out the S.M.A.R.T. values.
The values that show up will be different for each brand and drive connector. Generally you can tell what values should be high and which should be low, but online wikis will have more updated and detiailed information.
- ATA S.M.A.R.T. Values :: These are standard HDDs and SDDs that use SATA connectors.
- NVME S.M.A.R.T. Values :: These are drives that use the M.2 connector.
- SCISI S.M.A.R.T. Values :: This is used by drives that use the Universal Serial Bus (USB) connector.
Example
In the S.M.A.R.T. values for the drive above, the
UDMA_CRC_Error_Count
is quite high. The wiki mentions that it could be due to a loose cable connection. And it was!
Automated Tests and Alerts
We can automate periodically running S.M.A.R.T. tests to keep up to date health data on our drives. smartd
is a service that helps us do exactly that, but first we need to enable it.
Lets edit /etc/smartd.conf
to configure smartd
.
ZFS, the Final Frontier
Depending on the results of the badblocks
and S.M.A.R.T. tests, you may want to keep using the drive or send it back if its still under warranty. Not perfect drives are acceptable if your are using other tools like the Zettabyte File System (ZFS) to protect your data.
ZFS is a software based Redundant Array of Inexpensive Drives (RAID-Z) solution to help replicate data and maintain data integrity given the unpredictable nature of drive failure and data loss. However, all this power does come with a lot of responsibility. ZFS expects you to know what your doing and isn’t straight forward to expand.
Lets start by installing ZFS and find the drives to use. For the ZFS package we’re going to use Dynamic Kernel Module Support (DKMS) to allow the system to easily swap between driver versions. This will be good if we ever need to upgrade or downgrade the drivers.
Enable ZFS daemons to auto mount your pool on boot.
To keep things simple I will be using all the drives in a single raidz3 pool. To keep with the theme, lets name the pool black_hole
.
Note
ZFS is smart enough to be able to handle getting disks by either UUID or
/dev/sdx
. Internally ZFS will map to the UUID of the drive and you will not have to worry about the pool being mounted correctly on reboots.
Check the status of your new pool.
Tip
Finding large files and folders to clean up using just the CLI can be difficult. If you are looking for files to clean up you can use the
du
command to explore your system.sudo
will be required to search in folders owned by root.
Tip
Check out Learn X in Y minutes, Where X=zfs for a quick start zfs guide.
What Now?
If you have ran your tests and decided to use the drive in your ZFS array then theres not much more to do. Now that you have the patient’s history and charts, you can make educated decisions on the security of your data. Keep monitoring ZFS and S.M.A.R.T. values to ensure that your system’s data is secure and use the money you made to buy some ice cream 🍦!
Tip
Just because we have done all this doesn’t mean that your data is fully secure. Always follow the 3-2-1 rule for data and keep backups.
- Have 3 copies of your data. Typically one copy is production/live and the other two are for backup.
- On 2 different types of mediums. Currently out HDDs are magnetic disks, but other options include flash storage with SDDs, tape storage with Tape Drives, and more!
- 1 of the copies should be offsite and away from your system. This way if your house catches on fire you can still get the old data out of the cloud or your parent’s basement.