Download one page from a website with all its prerequisites

Mi, 07/30/2014 - 06:33 — Draketo

Often I want to simply backup a single page from a website. Until now I always had half-working solutions, but today I found one solution using wget which works really well, and I decided to document it here. That way I won’t have to search it again, and you, dear readers, can benefit from it, too ☺

Update 2020: You can also use the copyweb-script from pyFreenet: copyweb -d TARGET_FOLDER URL
Install via pip3 install --user pyFreenet3.

In short: This is the command:

wget --no-parent --timestamping --convert-links --page-requisites --no-directories --no-host-directories --span-hosts --adjust-extension --no-check-certificate -e robots=off -U 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.6) Gecko/20070802 SeaMonkey/1.1.4' [URL]

Optionally add --directory-prefix=[target-folder-name]

(see the meaning of the options and getting wget for some explanation)

That’s it! Have fun copying single sites! (but before passing them on, ensure that you have the right to do it)

Does this really work?

As a test, how about running this:

wget -np -N -k -p -nd -nH -H -E --no-check-certificate -e robots=off -U 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.6) Gecko/20070802 SeaMonkey/1.1.4' --directory-prefix=download-web-site http://draketo.de/english/download-web-page-with-all-prerequisites

(this command uses the short forms of the options)

Then test the downloaded page with firefox:

firefox download-web-site/download-web-page-all-prerequisites.html

Getting wget

If you run GNU/Linux, you likely already have it - and if not, then your package manager has it. GNU wget is one of the standard tools available everywhere.

Some information in the (sadly) typically terse style can be found on the wget website from the GNU project: gnu.org/s/wget.

In case you run Windows, have a look at Wget for Windows from the gnuwin32 project or at GNU Wgetw for Windows from eternallybored.

Alternatively you can get a graphical interface via WinWGet from cybershade.

Or you can get serious about having good tools and install MSYS or Cygwin - the latter gets you some of the functionality of a unix working environment on windows, including wget.

If you run MacOSX, either get wget via fink, homebrew or MacPorts or follow the guide from osxdaily or the german guide from dirk (likely there are more guides - these two were just the first hits in google).

The meaning of the options (and why you need them):

--no-parent: Only get this file, not other articles higher up in the filesystem hierarchy.
--timestamping: Only get newer files (don’t redownload files).
--page-requisites: Get all files needed to display this page.
--convert-links: Change files to point to the local files you downloaded.
--no-directories: Do not create directories: Put all files into one folder.
--no-host-directories: Do not create separate directories per web host: Really put all files in one folder.
--span-hosts: Get files from any host, not just the one with which you reached the website.
--adjust-extension: Add a .html extension to the file.
--no-check-certificate: Do not check SSL certificates. This is necessary if you’re missing one of the host certificates one of the hosts uses. Just use this. If people with enough power to snoop on your browsing would want to serve you a changed website, they could simply use one of the fake certifications authorities they control.
-e robots=off: Ignore robots.txt files which tell you to not spider and save this website. You are no robot, but wget does not know that, so you have to tell it.
-U 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.6) Gecko/20070802 SeaMonkey/1.1.4': Fake being an old Firefox to avoid blocking based on being wget.
--directory-prefix=[target-folder-name]: Save the files into a subfolder to avoid having to create the folder first. Without that options, all files are created in the folder in which your shell is at the moment. Equivalent to mkdir [target-folder-name]; cd [target-folder-name]; [wget without --directory-prefix]

Conclusion

If you know the required options, mirroring single pages from websites with wget is fast and easy.

Note that if you want to get the whole website, you can just replace --no-parent with --mirror.

Happy Hacking!

Druckversion
Login to post comments

Use Node:

⚙ Babcom is trying to load the comments ⚙

This textbox will disappear when the comments have been loaded.

If the box below shows an error-page, you need to install Freenet with the Sone-Plugin or set the node-path to your freenet node and click the Reload Comments button (or return).

If you see something like Invalid key: java.net.MalformedURLException: There is no @ in that URI! (Sone/search.html), you need to setup Sone and the Web of Trust

If you had Javascript enabled, you would see comments for this page instead of the Sone page of the sites author.

Note: To make a comment which isn’t a reply visible to others here, include a link to this site somewhere in the text of your comment. It will then show up here. To ensure that I get notified of your comment, also include my Sone-ID.

Link to this site and my Sone ID: sone://6~ZDYdvAgMoUfG6M5Kwi7SQqyS-gTcyFeaNN1Pf3FvY

This spam-resistant comment-field is made with babcom.

Download one page from a website with all its prerequisites

In short: This is the command:

Does this really work?

Getting wget

The meaning of the options (and why you need them):

Conclusion

Beliebte Inhalte

Heute:

Zuletzt angezeigt:

Draketo neu: Beiträge

sn.1w6.org news

Sep. 12th decides the fate of the internet in the EU!