2018/SF/dwebarchiving
DWeb Archiving was a session at IndieWebCamp SF 2018.
Notes archived from: https://etherpad.indieweb.org/DWebArchiving
IndieWebCamp SF 2018
Session: DWeb Archiving
When: 2018-07-31 12:30
Participants
- Sawood Alam
- dietrich
- Add yourself here⦠(see this for more details)
Notes
- web archiving
- how to make it decentralized
- IA is largest, but many others
- countries investing, for cultural preservation
- often short lived
- archive-it.org
- IA's subscription service
- European efforts - large national collections/archives
- then changed domain name
- broke every link
- our aggregator detected something wrong
- did make an announcement, but no details
- had to look in the IA to find the new location (WTF)
- -> need to decentralize
- archives, even when stable and well run, can still be shut down by govt, etc
- the short life and history of the web is in these few archives
- newspaper was the hard copy, even 200 years ago
- but try finding a website from 10 yrs ago
- Old Dominion University
- lab in the CS dept
- questions like "how much of the web is archived"
- IPWayback
- hackathon project
- warc files
- http + extra headers
- compiled serially
- way to void inode limits
- no worries about file types, etc
- offsets in the file
- higher level tools: indexers, storage compression, etc
- problems: lots of deduping, etc
- leverage deduping of content addressibility
- store headers and payload separately
- instead, store two hashes for headers and payload
- doesn't matter where it's stored anymore - location independent
- index is not decentralized yet
- ipns maybe a solution - url path -> digest
- ipns doesn't have history though
- one way is using blockchain for history
- then you can get history and what time changed
- protocol called memento - has rfc
- you can query "example.com at date time x"
- can interop across different archives
- ipns blockchain on github
- some namespacing would help - for sets of what is archived
- IPLD maybe help - lazy eval to find all the archives of a given page
- namespaces would allow different views - logged in vs geo differences in rendering, etc
- standardized way of aggregating across those differences
- how to protect personal info in the archive
- how to do personal web archiving
- dweb storage is not magic
- someone is actually storing
- need resilient way of preserving/serving if someone dies, or goes offline, etc
- ipfs cluster and other tools
- british library
- aware of ipfs, but not taken seriously
- can't say blockchain
- so much invested in status quo
- still just finally got the hang of http
- how to elevator pitch decentralization
- branding
- "search" means google
- "wayback machine" - no, there are more
- IIPC, and more
- some countries block IA
- memgator - memento aggregator
- collect from 12-13 different web archives
- then IA went down from ddos
- but still demoing memgator because pulled from other archives
- YACY
- body of knowledge w/o search is not useful
- focus on critical availability
- auto balance availability of resources across DHT
- web archiving people + FFF
Archiving Tools
- https://oduwsdl.github.io/Reconstructive/
- https://github.com/oduwsdl/ipwb
- https://github.com/oduwsdl/MemGator
- https://universalviewer.io/
Remote Thoughts
Hashing of content is a great way to deal with archiving, if we can get uniform hashes and query by hash across systems. Currently systems tend to suck at this - they hash not the resource, but the resource plus some local context, so the same resourcegets a differnt hash. IA uses SHA1 by default, but I've been waiting for them to ship the query by SHA1 since 2016
see https://www.svgshare.com/dweb for an example
The Reynolds Journalism Insititute hosts an annual news/archiving conference every year, a lot of what they're discussing covers doing mass archiving https://www.rjionline.org/search/results?q=dodging+the+memory+hole