A proposal for building responsible content archiving tools that import web content, wikis, and social media into Seed Hypermedia. The goal is preservation and accessibility, not theft - we're archivists, not pirates.

    Why Archive to SHM?

      Content on the web is ephemeral. Tweets get deleted. Websites go offline. Wikis get vandalized. Platforms shut down. By archiving to SHM, we create:

      • Permanent, content-addressed copies that can't be altered

      • Cryptographic proof of what existed and when

      • Decentralized storage across the p2p network

      • Clear provenance and attribution metadata

    Core Principles

      1. One Key Per Source

      Each archived source gets its own cryptographic identity. An archive of @elonmusk tweets would have a dedicated key, separate from an archive of Wikipedia articles. This provides:

      • Clear namespace separation

      • Ability to verify all content from one source

      • Easy discovery (follow the archive account)

      2. Rich Provenance Metadata

      Every archived document includes metadata explaining its origin:

      {
        "name": "Tweet by @username - 2024-01-15",
        "archive_source": "twitter",
        "archive_source_url": "https://twitter.com/username/status/123",
        "archive_source_author": "@username",
        "archive_timestamp": "2024-01-16T10:30:00Z",
        "archive_tool": "shm-twitter-archiver/1.0",
        "archive_note": "Archived for preservation purposes"
      }

      3. Responsible Archiving

      We are archivists, not content thieves. Guidelines:

      • Always attribute the original creator

      • Link back to the original source when possible

      • Respect robots.txt and explicit no-archive requests

      • Focus on preservation value (historical, at-risk content)

      • Don't monetize others' content

    Architecture

      Empty Mermaid block

    Source-Specific Strategies

      Web Pages

      Convert HTML to SHM blocks. Preserve structure (headings, paragraphs, lists, code). Upload images to IPFS. Store original URL and archive date.

      Wikipedia

      Use MediaWiki API to fetch articles. Preserve wiki markup or convert to SHM. Track revision IDs for version provenance. Great candidate for bulk archiving.

      Twitter/X

      Archive individual tweets or entire accounts. Preserve media (images, videos). Thread reconstruction. Quote tweets as embeds. Handle deletion gracefully.

      YouTube

      Archive metadata, descriptions, and transcripts. Video files are large - consider storing references or thumbnails. Channel archiving for at-risk content.

      RSS/Atom Feeds

      Continuous archiving of blog posts and news. Great for ongoing preservation. Each feed item becomes a document.

    Profile Metadata Schema

      Each archive account's home document should clearly identify itself:

      {
        "name": "Archive: @username Twitter",
        "description": "Archived tweets from @username for preservation",
        "icon": "ipfs://...",
        
        // Custom archive metadata
        "archive_type": "twitter_account",
        "archive_subject": "@username",
        "archive_subject_url": "https://twitter.com/username",
        "archive_started": "2024-01-01",
        "archive_operator": "IonBobcat",
        "archive_policy": "Public content only, respecting deletion requests"
      }

    Implementation Roadmap

      Phase 1: Web Page Archiver

      Build a tool that takes a URL and creates an SHM document. HTML parsing, image upload, metadata extraction. This is the foundation for all other archivers.

      Phase 2: Wikipedia Archiver

      MediaWiki API integration. Bulk article archiving. Version tracking. Good test case for large-scale archiving.

      Phase 3: Social Media Archivers

      Twitter, Mastodon, Bluesky. Account-level archiving. Media handling. Thread reconstruction.

      Phase 4: Continuous Archiving

      RSS feed monitoring. Scheduled archiving jobs. Change detection and updates.

    Technical Challenges

      • Rate limiting and API access for social platforms

      • Large media files (videos, high-res images)

      • Dynamic content (JavaScript-rendered pages)

      • Authentication for private/protected content

      • Storage costs for large archives

      • Legal considerations (DMCA, copyright, terms of service)

    Why This Matters

      The Internet Archive does incredible work, but it's centralized. Wikipedia could change. Twitter could ban accounts. Websites could be seized. By creating decentralized, cryptographically-signed archives, we ensure that important content survives regardless of platform politics or technical failures.

      This is digital preservation for the long term. Our archives could outlive the platforms they came from.