A proposal for building responsible content archiving tools that import web content, wikis, and social media into Seed Hypermedia. The goal is preservation and accessibility, not theft - we're archivists, not pirates.
Why Archive to SHM?
Content on the web is ephemeral. Tweets get deleted. Websites go offline. Wikis get vandalized. Platforms shut down. By archiving to SHM, we create:
• Permanent, content-addressed copies that can't be altered
• Cryptographic proof of what existed and when
• Decentralized storage across the p2p network
• Clear provenance and attribution metadata
Core Principles
1. One Key Per Source
Each archived source gets its own cryptographic identity. An archive of @elonmusk tweets would have a dedicated key, separate from an archive of Wikipedia articles. This provides:
• Clear namespace separation
• Ability to verify all content from one source
• Easy discovery (follow the archive account)
2. Rich Provenance Metadata
Every archived document includes metadata explaining its origin:
{
"name": "Tweet by @username - 2024-01-15",
"archive_source": "twitter",
"archive_source_url": "https://twitter.com/username/status/123",
"archive_source_author": "@username",
"archive_timestamp": "2024-01-16T10:30:00Z",
"archive_tool": "shm-twitter-archiver/1.0",
"archive_note": "Archived for preservation purposes"
}3. Responsible Archiving
We are archivists, not content thieves. Guidelines:
• Always attribute the original creator
• Link back to the original source when possible
• Respect robots.txt and explicit no-archive requests
• Focus on preservation value (historical, at-risk content)
• Don't monetize others' content
Architecture
Empty Mermaid block
Source-Specific Strategies
Web Pages
Convert HTML to SHM blocks. Preserve structure (headings, paragraphs, lists, code). Upload images to IPFS. Store original URL and archive date.
Wikipedia
Use MediaWiki API to fetch articles. Preserve wiki markup or convert to SHM. Track revision IDs for version provenance. Great candidate for bulk archiving.
Twitter/X
Archive individual tweets or entire accounts. Preserve media (images, videos). Thread reconstruction. Quote tweets as embeds. Handle deletion gracefully.
YouTube
Archive metadata, descriptions, and transcripts. Video files are large - consider storing references or thumbnails. Channel archiving for at-risk content.
RSS/Atom Feeds
Continuous archiving of blog posts and news. Great for ongoing preservation. Each feed item becomes a document.
Profile Metadata Schema
Each archive account's home document should clearly identify itself:
{
"name": "Archive: @username Twitter",
"description": "Archived tweets from @username for preservation",
"icon": "ipfs://...",
// Custom archive metadata
"archive_type": "twitter_account",
"archive_subject": "@username",
"archive_subject_url": "https://twitter.com/username",
"archive_started": "2024-01-01",
"archive_operator": "IonBobcat",
"archive_policy": "Public content only, respecting deletion requests"
}Implementation Roadmap
Phase 1: Web Page Archiver
Build a tool that takes a URL and creates an SHM document. HTML parsing, image upload, metadata extraction. This is the foundation for all other archivers.
Phase 2: Wikipedia Archiver
MediaWiki API integration. Bulk article archiving. Version tracking. Good test case for large-scale archiving.
Phase 3: Social Media Archivers
Twitter, Mastodon, Bluesky. Account-level archiving. Media handling. Thread reconstruction.
Phase 4: Continuous Archiving
RSS feed monitoring. Scheduled archiving jobs. Change detection and updates.
Technical Challenges
• Rate limiting and API access for social platforms
• Large media files (videos, high-res images)
• Dynamic content (JavaScript-rendered pages)
• Authentication for private/protected content
• Storage costs for large archives
• Legal considerations (DMCA, copyright, terms of service)
Why This Matters
The Internet Archive does incredible work, but it's centralized. Wikipedia could change. Twitter could ban accounts. Websites could be seized. By creating decentralized, cryptographically-signed archives, we ensure that important content survives regardless of platform politics or technical failures.
This is digital preservation for the long term. Our archives could outlive the platforms they came from.