Using WGET to Mirror from Internet Archive

Simple mirroring commands using wget will not work on the Internet Archive because they have some protection using javascript. Mirroring a website from The Internet Archive is against their TOS (or so I’ve read). If the rules don’t apply to you, here’s how to get around using WGET to mirror a website from archive.org

Sample:

wget –exclude-domains rail2000.org -e robots=off -nH –cut-dirs=2
–base=http://web.archive.org/web/20010202020600/http://www.rail2000.org/
-r -l 3 -N -k -p -R js -Gbase
http://web.archive.org/web/20010202020600/www.rail2000.org/

Advertisements

4 thoughts on “Using WGET to Mirror from Internet Archive

  1. It still works. wget doesn’t seem to understand the -Gbase switch, but with that removed it does the right thing.

    If you were to use this (hypothetically) to mirror a site, you might still want to use (shell script + sed) or perl to strip out the Internet Archive’s toolbar. This wouldn’t be too hard as it is clearly marked in each file.

  2. Why worry about robots.txt when the blog on archive.org contains a posting about how to use wget to bulk download from archive.org. The posting explicitly mentions “-e robots=off” to work around the robots exclusion file.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s