From IndieWeb
Jump to: navigation, search

wget is a unix utility for recursively downloading and archiving webpages.

How To



Most linux distros should come with wget installed. If not, use your applicable package manager.

Mac OS X

  • I installed wget via homebrew, and don’t recall having any problems, although homebrew can be a pain if you install lots of stuff from source. brew install wget --Waterpigs.co.uk 09:39, 26 April 2013 (PDT)

Archive a Site

Creating a full mirror of a site can be a challenge, especially if any of the site's content is loaded via Javascript. Assuming you have a typical HTML-only site, you can create a mirror with the following command:

wget --mirror --page-requisites --convert-links -e robots=off -P . http://example.com/

Add --domains=example.com,s3.amazonaws.com,subdomain.example.com if you have assets on other domains or subdomains that are required to display the page. (Be sure to include the primary domain here too)

wget will fetch the first page then recursively follow all the links it finds (including CSS, JS and images) on the same domain into a folder in the current directory called whatever the site domain is.

You can then browse through all the site's files, and most links should work, since the --convert-links option will rewrite the links it finds in the HTML to the local version. You can also configure a local web server to serve this folder at a URL.

To download everything, including images, javascripts, etc, use:

wget \
-e robots=off \
--timeout=360 \
--no-clobber \
--no-directories \
--adjust-extension \
--span-hosts \
--wait=1 \
--random-wait \
--convert-links \
--page-requisites \
--directory-prefix=[dir to save to] \

Serve the archive with nginx

server {
  listen 80;
  server_name archive.example.com;
  root /web/sites/example.com;  # set this to wherever your archive is stored
  index index.html;
  default_type text/html; # treat files with unknown extensions as html rather than prompt the browser to download the file

See Also