Data Archiving Specification
This page serves as a home for my evolving approach to digital data archival. It contains my intentions, thought process, resources, and the current schema.
Why
- Data created over a lifetime begs to be archived, or risks extinction
- Housing data in the cloud has risks (service shutting down, privacy, etc.)
- Streaming services can’t guarantee content availability in the future
- There’s some emotional satisfaction to building a system that lasts
Goals
- Simple - The simpler it is, the longer I’ll use and trust it
- Intuitive - Should be useable without external knowledge a decade from now
- Digestible - External organizing tools should be able to map it
Backups
The 3-2-1 backup rule should be implemented as much as possible:
3 - “Data doesn’t exist if there’s not at least three copies of it.”
2 - “Two copies are local, but on different devices.”
1 - “At least one copy off-site.”
RAID is not a backup!
Common Strategies
From my research, there seems to be a few schools of thought on how to organize data.
TODO: Add sources here.
By data type
- Intent:
-
I want to browse all the books in my collection.
I want to point media player to movies directory.
- Pros:
-
- Can be easily digested by external apps
- Less effort manually organizing by arbitrary subjects
- Less total directories/nesting
- Cons:
-
- Directory structure isn't intuitive when looking for specific subject
- Required external dependencies for organizing by topic
By subject
Intent:
- “I want to browse all content about Linux.”
- “I want to save a random article about Woodworking.”
Pros:
- More intuitively navigate the file system when seeking data
- File types can easily be searched for versus searching by subject
Cons:
- Digestion into external apps more difficult
- Much more effort to initially organize
By date
Intent:
- “I want to see all photos I took from 2015.”
- “I want to see all movies released in 1945.”
Pros:
- Good for timelines of personal data (photos, school, medical, etc)
Cons:
- Difficult to navigate
Other Considerations
The level of effort to implement/maintain a system is important. There are at least these three aspects that need to be considered:
- Initial - Effort to setup the system and do the initial sorting
- Modification - Effort/complexity to add/edit data and update references
- Navigation - Effort to navigate and find data
Example Schema
TODO: Move this to a gist and embed it so the history can be easily accessible.
tank/ (root zpool)
archive/ (low avail, high compression dataset)
literature/
cheatsheets/
manuals/
papers/
quotes/
software/
fonts/
operatingsystems/
games/
windows/
media/ (high avail, low compression dataset - unoriginal content)
.dump/ (staging area for new data)
audiobooks/
ebooks/
movies/
movies3d/
music/
tmp/
tv/
README.txt (info about why things are organized this way)
audio/
audiobooks/
comedy/
music/
literature/
ebooks/
video/
courses/
comedy/
movies/
Movie Name, The (YYYY)/
metadata/
movies3d/
Movie Name, The (YYYY)/
metadata/
tv/
Show Name, The (YYYY)/
Season 01/
metadata/
youtube/
Channel Name/
video.name-videoid.mkv
video.name-videoid-thumb.mkv
Steve Jobs/
something_something_interview-e4dF2s.mkv
Elon Musk/
subject/ (content by subject... needs more thought)
README.txt (info file outlining how subjects are structured)
Psychology/
README.txt (notes/paths to related media in ebooks, movies, etc.)
Depression/
Depression Education Course/
01 - introduction.mp4
Depression Sucks Paper.pdf
Steve Jobs/
README.txt (paths to related content (books, movies, yt, etc.)
vault/ (high avail, low compression dataset - original content)
.dump/
photos/
documents/
README.txt
dwelling/
Travel Trailer/
finance/
taxes/
retirement/
budget.xls
travel/
201804 - California Road Trip/
photos/
itinerary.pdf
health/
fitness/
therapy/
hobby/
hiking/
homelab/
woodworking/
diving/
gaming/
coding/
trading/
work/
resumes/
hiring/
201910 - Company/ (job start date YYYYMM)
volunteer/
pcta/
misc/
Schema Notes
- Adding the ZFS datasets may be overkill. It makes the schema a bit less file system agnostic. I may move it to a separate post at some point.
- Need to think about data security (do I want my taxes available to anyone with access to the vault?).
Resources
TODO: Find an add more of the resources I’ve used.