wget: Advanced Usage

 
 7.2 Advanced Usage
 ==================
 
    • You have a file that contains the URLs you want to download?  Use
      the ‘-i’ switch:
 
           wget -i FILE
 
      If you specify ‘-’ as file name, the URLs will be read from
      standard input.
 
    • Create a five levels deep mirror image of the GNU web site, with
      the same directory structure the original has, with only one try
      per document, saving the log of the activities to ‘gnulog’:
 
           wget -r https://www.gnu.org/ -o gnulog
 
    • The same as the above, but convert the links in the downloaded
      files to point to local files, so you can view the documents
      off-line:
 
           wget --convert-links -r https://www.gnu.org/ -o gnulog
 
    • Retrieve only one HTML page, but make sure that all the elements
      needed for the page to be displayed, such as inline images and
      external style sheets, are also downloaded.  Also make sure the
      downloaded page references the downloaded links.
 
           wget -p --convert-links http://www.example.com/dir/page.html
 
      The HTML page will be saved to ‘www.example.com/dir/page.html’, and
      the images, stylesheets, etc., somewhere under ‘www.example.com/’,
      depending on where they were on the remote server.
 
    • The same as the above, but without the ‘www.example.com/’
      directory.  In fact, I don’t want to have all those random server
      directories anyway—just save _all_ those files under a ‘download/’
      subdirectory of the current directory.
 
           wget -p --convert-links -nH -nd -Pdownload \
                http://www.example.com/dir/page.html
 
    • Retrieve the index.html of ‘www.lycos.com’, showing the original
      server headers:
 
           wget -S http://www.lycos.com/
 
    • Save the server headers with the file, perhaps for post-processing.
 
           wget --save-headers http://www.lycos.com/
           more index.html
 
    • Retrieve the first two levels of ‘wuarchive.wustl.edu’, saving them
      to ‘/tmp’.
 
           wget -r -l2 -P/tmp ftp://wuarchive.wustl.edu/
 
    • You want to download all the GIFs from a directory on an HTTP
      server.  You tried ‘wget http://www.example.com/dir/*.gif’, but
      that didn’t work because HTTP retrieval does not support globbing.
      In that case, use:
 
           wget -r -l1 --no-parent -A.gif http://www.example.com/dir/
 
      More verbose, but the effect is the same.  ‘-r -l1’ means to
      retrieve recursively (⇒Recursive Download), with maximum
      depth of 1.  ‘--no-parent’ means that references to the parent
      directory are ignored (⇒Directory-Based Limits), and
      ‘-A.gif’ means to download only the GIF files.  ‘-A "*.gif"’ would
      have worked too.
 
    • Suppose you were in the middle of downloading, when Wget was
      interrupted.  Now you do not want to clobber the files already
      present.  It would be:
 
           wget -nc -r https://www.gnu.org/
 
    • If you want to encode your own username and password to HTTP or
      FTP, use the appropriate URL syntax (⇒URL Format).
 
           wget ftp://hniksic:mypassword@unix.example.com/.emacs
 
      Note, however, that this usage is not advisable on multi-user
      systems because it reveals your password to anyone who looks at the
      output of ‘ps’.
 
    • You would like the output documents to go to standard output
      instead of to files?
 
           wget -O - http://jagor.srce.hr/ http://www.srce.hr/
 
      You can also combine the two options and make pipelines to retrieve
      the documents from remote hotlists:
 
           wget -O - http://cool.list.com/ | wget --force-html -i -