荒天翔鷗的天地: Wget使用簡介

前言

有幾個命令行HTTP/FTP client之類的開放源碼軟體專案被廣泛採用，像curl, Wget, HTTrack, Aria2……。這些不同的軟體各有不同的特色，譬如curl支援最多種網路傳輸協定[1]；HTTrack則可複製網站（俗稱砍站，但這詞可能讓人誤解），以方便離線瀏覽；Wget在某些方面可身兼curl與HTTrack之長；而Aria2支援了Bittorrent。

本文想談的是Wget，它可應付一般的HTTP使用情境，它也預設安裝在Ubuntu 20.04中，上述其他的，則要額外安裝。有圖形介面可用，但這裡只談命令行的使用。文中幾個範例中的URL是虛構的，僅純粹方便展示用。

Wget簡介

Wget的發展可回溯到1995年，是GNU Project的一部分，其名稱源自於World Wide Web與HTTP_GET的結合。支援HTTP, HTTPS, FTP, FTPS協定來下載檔案。非互動式命令行工具，可方便從腳本、排程，以及無圖形界面的終端機來執行它。一些功能像[2]：

續傳之前失敗的下載
檔名可用萬用字元
遞迴目錄
所下載網頁中的絕對連結可轉換成相對的
支援多種作業平台

Wget用法

先看Wget的用法與命令選項：


$ wget --help

顯示的內容有點長，這裡僅節錄部分：


用法：wget [選項]... [URL]...

長選項必須用的參數在使用短選項時也是必須的。

啟動：
  -V,  --version                   display the version of Wget and exit
  -h,  --help                      print this help
  -b,  --background                go to background after startup

紀錄訊息及輸入檔案：
  -o,  --output-file=FILE          log messages to FILE
  -i,  --input-file=FILE           download URLs found in local or external FILE
  -F,  --force-html                treat input file as HTML

下載：
  -t,  --tries=NUMBER              set number of retries to NUMBER (0 unlimits)
  -O,  --output-document=FILE      write documents to FILE
  -nc, --no-clobber                skip downloads that would download to
                                     existing files (overwriting them)
  -c,  --continue                  resume getting a partially-downloaded file
  -N,  --timestamping              don't re-retrieve files unless newer than
                                     local
       --no-if-modified-since      don't use conditional if-modified-since get
                                     requests in timestamping mode
       --no-use-server-timestamps  don't set the local file's timestamp by
                                     the one on the server
  -S,  --server-response           print server response
       --spider                    don't download anything
  -w,  --wait=SECONDS              wait SECONDS between retrievals
       --waitretry=SECONDS         wait 1..SECONDS between retries of a retrieval
       --random-wait               wait from 0.5*WAIT...1.5*WAIT secs between retrievals
  -Q,  --quota=NUMBER              set retrieval quota to NUMBER
       --limit-rate=RATE           limit download rate to RATE
       --restrict-file-names=OS    restrict chars in file names to ones OS allows
       --user=USER                 set both ftp and http user to USER
       --password=PASS             set both ftp and http password to PASS
       --ask-password              prompt for passwords

目錄：
  -P,  --directory-prefix=PREFIX   save files to PREFIX/..

HTTP 選項：
       --http-user=USER            set http user to USER
       --http-password=PASS        set http password to PASS
       --default-page=NAME         change the default page name (normally
                                     this is 'index.html'.)
  -E,  --adjust-extension          save HTML/CSS documents with proper extensions
       --proxy-user=USER           set USER as proxy username
       --proxy-password=PASS       set PASS as proxy password
  -U,  --user-agent=AGENT          identify as AGENT instead of Wget/VERSION

遞迴下載：
  -r,  --recursive                 specify recursive download
  -l,  --level=NUMBER              maximum recursion depth (inf or 0 for infinite)
  -k,  --convert-links             make links in downloaded HTML or CSS point to
                                     local files
  -m,  --mirror                    shortcut for -N -r -l inf --no-remove-listing
  -p,  --page-requisites           get all images, etc. needed to display HTML page

遞迴下載時有關接受/拒絕的選項：
  -A,  --accept=LIST               comma-separated list of accepted extensions
  -R,  --reject=LIST               comma-separated list of rejected extensions
       --accept-regex=REGEX        regex matching accepted URLs
       --reject-regex=REGEX        regex matching rejected URLs
       --regex-type=TYPE           regex type (posix|pcre)
  -D,  --domains=LIST              comma-separated list of accepted domains
       --exclude-domains=LIST      comma-separated list of rejected domains
       --follow-tags=LIST          comma-separated list of followed HTML tags
       --ignore-tags=LIST          comma-separated list of ignored HTML tags
  -H,  --span-hosts                go to foreign hosts when recursive
  -I,  --include-directories=LIST  list of allowed directories
  -X,  --exclude-directories=LIST  list of excluded directories
  -np, --no-parent                 don't ascend to the parent directory

一些應用情境

下載檔案

要下載某個檔案，直接把URL給wget即可：


$ wget https://ftp.gnu.org/gnu/wget/wget-latest.tar.gz

下載時也可指定另存成新的檔名：


$ wget -O wget.tar.gz https://ftp.gnu.org/gnu/wget/wget-latest.tar.gz

可以在命令行指定多個URL，或是干脆寫在文字檔中，用 -i 選項指定該文字檔做為輸入，如文字檔 files.txt 的內容為：


ftp://ftp.gnu.org/gnu/wget/wget-latest.tar.gz
ftp://ftp.gnu.org/gnu/wget/wget-latest.tar.gz.sig

讓Wget讀入files.txt，下載其中記錄的檔案：


$ wget -i files.txt

續傳

這裡所謂的續傳，指的是之前下載的檔案因故未能完成，這回想把尚未下載的部分取回（但要注意某些服務器可能會不支援此功能），而不是重頭開始，可用 -c 選項：


$ wget -c http://cdimage.ubuntu.com/ubuntu-mate/releases/20.04/release/ubuntu-mate-20.04-desktop-amd64.iso

儲存完整的網頁

有時想完整保存某個網頁，包含其中的圖片、CSS檔，這有點像瀏覽器的另存新檔，完整封存的作法，將網頁與相關檔案存入子目錄中。


$ wget -pkH -P my_page URL

以上以URL代表某個網頁。多個選項在無指定參數的情況下可以接合，如這裡的-p -k -H成為-pkH。因為網頁中的圖片是來自其他網站，所以加了 H 選項，此時要注意的是，若沒有 -P 選項指定存入新建的 my_page 子目錄，會在目前目錄以幾個不同網域名稱來建立數個資料夾，這會顯得有點凌亂。

限制流量

若想避免Wget下載檔案時佔用全部網路頻寬，可以限制流量，其單位為bytes per second，可後置k, m分別代表kilobytes與megabytes：


$ wget --limit-rate=1.5m URL

限制下載限額

在遞迴下載與讀取輸入檔時，這功能可用在避免下載過量的檔案，也許只是想先做初步下載的測試就停止。如，下載達Quota額度3 MB時即停止下載：


$ wget -Q3m -i files.txt

注意，Q選項在下載單一檔案時並無影響，仍會將整個檔案下載下來，不論限額；如前述，僅用於遞迴下載與讀取輸入檔才有作用。

修改User Agent

Wget送給網頁服務器的User-Agent標頭欄位字串是'wget/version'，version指Wget的版號，但在必要時可用 -U 選項來指定其他字串，如：


$ wget -U "Mozilla/5.0 (X11; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0 Waterfox/56.2.11" URL

Spider

--spider 選項只檢查檔案是否存在，不會真的下載檔案：


$ wget --spider https://ftp.gnu.org/gnu/wget/wget-latest.tar.gz

另一個例子，把瀏覽器的書籤存成HTML檔bookmarks.html後，讓Wget檢查書籤中的連結：


$ wget --spider -Fi bookmarks.html

複製網站

在談論複製網站之前，先提醒幾點：

如果要下載的量很大，會造成很大網路流量，最好在每個擷取之間稍做暫停，以免造成服務器負荷過大。
通常靜態網頁網站的網站複製效果會較好；動態式的網頁，某些內容可能會無法完美重現。

例如只要複製某網站中的某個目錄：


$ wget -mpknp http://www.mydomain.com/mydir/

依據不同的使用狀況，所用的參數可做點調整，再來個複製整個網站的例子：


$ wget --restrict-file-names=windows -P mydomain -o log.txt -w 1 --random-wait -mpkEH http://www.mydomain.com

如果要複製的網站，其網頁中圖片之類的網頁資產都在網站本身，或集中在少數相同網域中的機器，那麼可以採用 --domains 選項來限制只取得來自網域列表的檔案，此時 -H 選項就無作用。

如果有老舊的網站，是以老舊程式語言或架站軟體所架設，難以更新且可能有安全性漏洞之虞時，可考慮將其轉成靜態網頁封存，此時Wget就是相當好用的工具來複製網站。也許並非網站上的所有內容都要保存，那麼可以先在另一部機器上依原有條件架設好相同內容的網站，在這機器上登入架站機管理介面，把不要的內容移除，只保留想要的內容，然後用Wget複製這個網站，全保存成靜態網頁。完成後原有老舊網站即可撤掉，換裝新式網站，前述保存的靜態網頁，也就是舊網站內容，可以複製進來做為新網站中的一子目錄，在新網站的頁面上加個舊網站的連結並指到舊網站內容的首頁即可。這只是粗略說法，實際細節還會涉及調整網頁服務器的設置，與本文主旨偏離太遠就不談。

以上僅是Wget使用的簡介，詳情可見Wget手冊[3]。

最後提醒一點，Wget並不處理JavaScript該做的工作，如果想擷取的網頁內容是經過JavaScript運作後才會產生的，可能就得改用別種方式，如〈以Python與無頭式Firefox或Chrome做網頁抓取〉。

參考

https://curl.haxx.se/docs/comparison-table.html
https://www.gnu.org/software/wget/
https://www.gnu.org/software/wget/manual/

update: 2020-5-26

荒天翔鷗的天地

2020年5月23日星期六

Wget使用簡介

前言

Wget簡介

Wget用法

一些應用情境

下載檔案

續傳

儲存完整的網頁

限制流量

限制下載限額

修改User Agent

Spider

複製網站

參考

沒有留言:

張貼留言

2020年5月23日 星期六

Wget使用簡介

前言

Wget簡介

Wget用法

一些應用情境

下載檔案

續傳

儲存完整的網頁

限制流量

限制下載限額

修改User Agent

Spider

複製網站

參考

沒有留言:

張貼留言

2020年5月23日星期六