Often I want to simply backup a single page from a website. Until now I always had half-working solutions, but today I found one solution using wget which works really well, and I decided to document it here. That way I won’t have to search it again, and you, dear readers, can benefit from it, too ☺
wget --no-parent --timestamping --convert-links --page-requisites --no-directories --no-host-directories --span-hosts --adjust-extension --no-check-certificate -e robots=off -U 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:126.96.36.199) Gecko/20070802 SeaMonkey/1.1.4' [URL]
That’s it! Have fun copying single sites! (but before passing them on, ensure that you have the right to do it)
As a test, how about running this:
wget -np -N -k -p -nd -nH -H -E --no-check-certificate -e robots=off -U 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:188.8.131.52) Gecko/20070802 SeaMonkey/1.1.4' --directory-prefix=download-web-site http://draketo.de/english/download-web-page-with-all-prerequisites
(this command uses the short forms of the options)
Then test the downloaded page with firefox:
If you run GNU/Linux, you likely already have it - and if not, then your package manager has it. GNU wget is one of the standard tools available everywhere.
Some information in the (sadly) typically terse style can be found on the wget website from the GNU project: gnu.org/s/wget.
Alternatively you can get a graphical interface via WinWGet from cybershade.
If you run MacOSX, either get wget via fink, homebrew or MacPorts or follow the guide from osxdaily or the german guide from dirk (likely there are more guides - these two were just the first hits in google).
--no-parent: Only get this file, not other articles higher up in the filesystem hierarchy.
--timestamping: Only get newer files (don’t redownload files).
--page-requisites: Get all files needed to display this page.
--convert-links: Change files to point to the local files you downloaded.
--no-directories: Do not create directories: Put all files into one folder.
--no-host-directories: Do not create separate directories per web host: Really put all files in one folder.
--span-hosts: Get files from any host, not just the one with which you reached the website.
--adjust-extension: Add a .html extension to the file.
--no-check-certificate: Do not check SSL certificates. This is necessary if you’re missing one of the host certificates one of the hosts uses. Just use this. If people with enough power to snoop on your browsing would want to serve you a changed website, they could simply use one of the fake certifications authorities they control.
-e robots=off: Ignore robots.txt files which tell you to not spider and save this website. You are no robot, but wget does not know that, so you have to tell it.
-U 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:184.108.40.206) Gecko/20070802 SeaMonkey/1.1.4': Fake being an old Firefox to avoid blocking based on being wget.
--directory-prefix=[target-folder-name]: Save the files into a subfolder to avoid having to create the folder first. Without that options, all files are created in the folder in which your shell is at the moment. Equivalent to
mkdir [target-folder-name]; cd [target-folder-name]; [wget without --directory-prefix]
If you know the required options, mirroring single pages from websites with wget is fast and easy.
Note that if you want to get the whole website, you can just replace
⚙ Babcom is trying to load the comments ⚙
This textbox will disappear when the comments have been loaded.
Note: To make a comment which isn’t a reply visible to others here, include a link to this site somewhere in the text of your comment. It will then show up here. To ensure that I get notified of your comment, also include my Sone-ID.
Link to this site and my Sone ID:
This spam-resistant comment-field is made with babcom.