2018/SF/dwebarchiving

DWeb Archiving was a session at IndieWebCamp SF 2018.

Notes archived from: https://etherpad.indieweb.org/DWebArchiving

IndieWebCamp SF 2018
Session: DWeb Archiving
When: 2018-07-31 12:30

Participants

Sawood Alam
dietrich
Add yourself here… (see this for more details)

Notes

web archiving
how to make it decentralized
IA is largest, but many others
countries investing, for cultural preservation

often short lived

archive-it.org
IA's subscription service

European efforts - large national collections/archives
then changed domain name
broke every link
our aggregator detected something wrong
did make an announcement, but no details
had to look in the IA to find the new location (WTF)
-> need to decentralize

archives, even when stable and well run, can still be shut down by govt, etc
the short life and history of the web is in these few archives
newspaper was the hard copy, even 200 years ago
but try finding a website from 10 yrs ago

Old Dominion University
lab in the CS dept
questions like "how much of the web is archived"

IPWayback
hackathon project

warc files
http + extra headers
compiled serially
way to void inode limits
no worries about file types, etc
offsets in the file
higher level tools: indexers, storage compression, etc

problems: lots of deduping, etc
leverage deduping of content addressibility
store headers and payload separately
instead, store two hashes for headers and payload
doesn't matter where it's stored anymore - location independent

index is not decentralized yet
ipns maybe a solution - url path -> digest
ipns doesn't have history though
one way is using blockchain for history
then you can get history and what time changed
protocol called memento - has rfc
you can query "example.com at date time x"
can interop across different archives
ipns blockchain on github
some namespacing would help - for sets of what is archived
IPLD maybe help - lazy eval to find all the archives of a given page
namespaces would allow different views - logged in vs geo differences in rendering, etc
standardized way of aggregating across those differences

how to protect personal info in the archive
how to do personal web archiving

dweb storage is not magic
someone is actually storing
need resilient way of preserving/serving if someone dies, or goes offline, etc

ipfs cluster and other tools

british library
aware of ipfs, but not taken seriously
can't say blockchain
so much invested in status quo
still just finally got the hang of http

how to elevator pitch decentralization

branding
"search" means google
"wayback machine" - no, there are more
IIPC, and more
some countries block IA

memgator - memento aggregator
collect from 12-13 different web archives
then IA went down from ddos
but still demoing memgator because pulled from other archives

YACY
body of knowledge w/o search is not useful
focus on critical availability
auto balance availability of resources across DHT

web archiving people + FFF

Archiving Tools

Remote Thoughts

Hashing of content is a great way to deal with archiving, if we can get uniform hashes and query by hash across systems. Currently systems tend to suck at this - they hash not the resource, but the resource plus some local context, so the same resourcegets a differnt hash. IA uses SHA1 by default, but I've been waiting for them to ship the query by SHA1 since 2016

see https://www.svgshare.com/dweb for an example

The Reynolds Journalism Insititute hosts an annual news/archiving conference every year, a lot of what they're discussing covers doing mass archiving https://www.rjionline.org/search/results?q=dodging+the+memory+hole

Participants

Notes

Archiving Tools

Remote Thoughts

See Also