It’s perhaps the best way for someone that has a good handle on it. Docs say it “sets infinite recursion depth and keeps FTP directory listings. It is currently equivalent to -r -N -l inf --no-remove-listing.” So you would need to tune it so that it’s not grabbing objects that are irrelevent to the view, and probably exclude some file types like videos and audio. If you get a well-tuned command worked out, that would be quite useful. But I do see a couple shortcomings nonetheless:
- If you’re on a page that required you to login to and do some interactive things to get there, then I think passing the cookie from the gui browser to wget would be non-trivial.
- If you’re on a capped internet connection, you might want to save from the brower’s cache rather that refetch everything.
But those issues aside I like the fact that wget does not rely on a plugin.
wget has a
--load-cookies file
option. It wants the original Netscape cookie file format. Depending on your GUI browser you may have to convert it. I recall in one case I had to parse the session ID out of a cookie file then build the expected format around it. I don’t recall the circumstances.Another problem: some anti-bot mechanisms crudely look at user-agent headers and block curl attempts on that basis alone.
(edit) when cookies are not an issue,
wkhtmltopdf
is a good way to get a PDF of a webpage. So you could have a script do awget
to get the HTML faithfully, andwkhtmltopdf
to get a PDF, thenpdfattach
to put the HTML inside the PDF.(edit2) It’s worth noting there is a project called
curl-impersonate
which makes curl look more like a GUI browser to get more equal treatment. I think they go as far as adding a javascript engine or something.