Date: Sun, 4 Dec 2022 20:04:11 -0800
From: Bret Victor
Subject: Re: Dynamicland website
I was working my way through editing archive items, and midway through year two I did a little temporal extrapolation and realized that this website was going to be a longer project than my usual projects, and if I kept at it with my usual intensity, I was going to burn out.

So I gave myself a "milestone" -- I would edit the archive through the end of 2015, and then build all the infrastructure to make it into an actual on-line web-site, and that would be the end of this milestone.

I am relieved to announce the end of this milestone.

secret temporary website

http://dynamic.land/archive
*****************
******************

What is there:

 - a first pass on items through the end of 2015, meaning that there are descriptions, media, and emails that are somewhat organized, with some exceptions (e.g. the hypercard in the world stuff is waiting on me editing those videos, a lot of the "inserts" will get cleaned up when there are more items ready)

 - a zeroeth pass on a handful of items after 2015, meaning mostly just email threads

 - a first pass on the design

Still a ton more work to do, but it doesn't not exist.

Besides "content", Archive Kit can now transform pieces of paper representing our private thoughts into an actual public website on the internet.  There's a bit more to do here, but it's in pretty good shape.  A "content management system" involves a lot of dumb technical details (media transcoding, syncing with servers, etc) and I am very happy (to believe) that I've come out the other side on this.

Behold:


The archive website is truly generated by Realtalk.  Our inputs are a binder full of cards, and the poster above.  Make some edits, wish the archive is deployed, and it's now online.  There's intended to be very little need for mucking around in server directories with ssh, etc.

I'll take you down the pages on the left side of the poster:


Mount archive drive
If the archive drive is plugged in on startup, this page mounts it and claims it's mounted, which kicks off everything else in Archive Kit.

Archive paths
The archive drive has a lot going on.  This page defines functions that generate pathnames for all the different kinds of files.

  sources/                all original files. Other than adding mirrored files and tags, we don't touch this.

  

  archived/               all generated by Archive Kit
     originals/             symlinks from archived media filenames to original files in sources
     email-bodies/          HTML extracted from mboxes
     attachments/           attachment files extracted from mboxes
     media/                 stripped, downsampled derivations of files: full, preview, thumbnail
     items/                 archive item webpages, with local media and email URLs 
     items-cdn/             archive item webpages, with CDN media and email URLs
     emails/                email webpages, with local media URLs
     emails-cdn/            email webpages, with CDN media URLs

     

  dynamicland.org/        rsync with web host
     archive-resources/     javascript, css, fonts, etc
     archive/               symlinks to the public subset of archived/items-cdn
     archived-emails/       symlinks to the public subset of archived/emails-cdn

     

  files.dynamicland.org/  sync with CDN
     archived-media/        symlinks to the public subset of archived/media

This page also makes symlinks in cache/www so you can browse the website on the local machine.  But all archive data is stored on the archive drive, not on the local drive or in Realtalk refs.  This means that you can work on the archive from any machine that you've plugged the drive into.

Archived media
This page kicks off the media and email processing pipeline described over the next pages, and presents the results with two claims:

    Claim the archived media files are (files).
    Claim the archived emails are (msgs).

Any page that needs to know about archived media will Notice these claims, and will thus automatically update when anything new gets added.

Scan archived media
Looks through the "sources/Photos and videos" directory for media files, and gets their info (via the "Media info" page).  

Like many of the following steps, this can be time consuming, so these steps use a caching mechanism which only checks what's changed, and they run as coroutines and display feedback on progress.

Tag archived media
Deduplicates media files by md5 (we have a lot of duplicates squirreled away in various directories!) and forms tags from the pathnames, as well as any manually-added tags.

Downsample archived media
Some media files are presented as-is (pdf, gif, png), but all others are re-encoded into a standard format (jpg, mp4, m4a) with all metadata stripped off except for the date and color profile.  Images are encoded at "preview" size (900 px small dimension) and "full" size.  Videos are encoded at 720 px small dimension.  Every file also gets a 240x240 square thumbnail.

We use imagemagick for images and ffmpeg for video and audio, and it was a bit of a slog to figure out all of the right options to properly deal with all our files.  There's nothing like letting the downsampler run overnight, and finding the next morning that all of the portrait-mode media had been stretched to landscape.

Scan archived mailboxes
Looks through the "sources/Mailboxes" directory for mbox files, parses them (via the "Reading mailboxes" page), generates presentable html for the body of each message, and saves attachments.

Redacting emails
Attempts to make a message suitable for public display.  Redacts passwords, email addresses, phone numbers, door codes, links to google maps and calendars, and "special requests", which include names of certain people and projects outside our group, and at least one paragraph that I wrote to Alan that I shouldn't have sent to the group.  Redactions appear as a tasteful light-grey bar.

Generate email webpages
Generates the final html file for each message, from the body and attachments saved earlier.  This mostly amounts to adding the style sheet and message header, and it's its own pass because I wanted to be able to quickly change the style sheet without having to process all the mboxes again (which takes ten minutes or so).

We generate two copies of each webpage, one for local browsing that uses local URLs for the media files ("/archived-media/2014/02/...") and one for uploading that uses URLs to the CDN ("https://files.dynamicland.org/archived-media/2014/02").  

Parsing archive text
Our archive items are represented as playing cards with wiki-like markup on them.  This page parses that markup and turns it into html and a table of metadata.

Archive translator
Generates the full html for each item.  The page above supplies the "content", but this page adds the frame around the content, including links to the previous and next items.  This page also acts as the language translator, so pages of the form

    >> 2015/Laser Socks [in DL archive]

are translated into

    Claim (you) is an archive item with HTML (html) options (metadata).

which is used for printing and other things.

Generate archive webpages
Writes the html file for each archive item, and for each archive "insert" (those black-background synopses that tie together multiple items -- I don't know what to call them yet.)  Also makes claims about the archive items.  (These claims are made for all items collected in the archive, as opposed to the translator's claim above, which is only made by cards that are currently present.)

    Claim (site) collects archive "item" (p) with options (metadata).
    Claim the listed archive items are (items).

    

    Claim (site) collects archive "insert" (p) with options (metadata).
    Claim the archive inserts are (inserts).

The index and timeline are generated from the "listed" archive items, which exclude items marked "unlisted".  Like an unlisted youtube video, we can hide items into our archive that can only be accessed by knowing the URL.

Generate archive timeline
Generates the svg for the timeline at the top of the archive website.

Generate public archives
The archive drive holds our own personal archive -- all media and all emails.  Most of that will never touch a public server.  The only files that get uploaded are those that are explicitly referenced in an archive item, or in an email that's referenced in an archive item.  And only the downsampled versions, with metadata stripped.

This page figures out which files should be public, and creates directory trees of symlinks, suitable for rsyncing.

Archive item webpages, email webpages, and "resources" (javascript, css, fonts) are rsync'd with the dynamicland.org server (currently DreamHost).

Media files are synced with Amazon S3 via Amazon's command-line tool.  Our S3 "bucket" is in northern California, but will eventually be cached around the world via Cloudfront's CDN, and will be accessible as files.dynamicland.org. Apparently this is all pretty inexpensive these days, and using DreamHost as a video server is probably a bad idea.

Kick off the syncing with:

    Wish the public archive is deployed.


That's basically Archive Kit.  These are all one-pagers except for a couple two-pagers.  On the right of the posters are the "resources", and some general utilities.  And then there's a handful of ad-hoc tools that I've thrown together for wrangling the archive: