why would curl and wget result in a 403 forbidden?

  • I try to download a file with wget and curl and it is rejected with a 403 error (forbidden).

    I can view the file using the web browser on the same machine.

    I try again with my browser's user agent, obtained by http://www.whatsmyuseragent.com. I do this:

    wget -U 'Mozilla/5.0 (X11; Linux x86_64; rv:30.0) Gecko/20100101 Firefox/30.0' http://...
    

    and

    curl -A 'Mozilla/5.0 (X11; Linux x86_64; rv:30.0) Gecko/20100101 Firefox/30.0' http://...
    

    but it is still forbidden. What other reasons might there be for the 403, and what ways can I alter the wget and curl commands to overcome them?

    (this is not about being able to get the file - I know I can just save it from my browser; it's about understanding why the command-line tools work differently)

    update

    Thanks to all the excellent answers given to this question. The specific problem I had encountered was that the server was checking the referrer. By adding this to the command-line I could get the file using curl and wget.

    The server that checked the referrer bounced through a 302 to another location that performed no checks at all, so a curl or wget of that site worked cleanly.

    If anyone is interested, this came about because I was reading this page to learn about embedded CSS and was trying to look at the site's css for an example. The actual URL I was getting trouble with was this and the curl I ended up with is

    curl -L -H 'Referer: http://css-tricks.com/forums/topic/font-face-in-base64-is-cross-browser-compatible/' http://cloud.typography.com/610186/691184/css/fonts.css
    

    and the wget is

     wget --referer='http://css-tricks.com/forums/topic/font-face-in-base64-is-cross-browser-compatible/' http://cloud.typography.com/610186/691184/css/fonts.css
    

    Very interesting.

    Pages that check referer are really annoying. The header is supposed to be optional and used for gathering statistics.

    The easiest thing I've found is to convert it to a zip file and use it that way.

  • Lekensteyn

    Lekensteyn Correct answer

    6 years ago

    A HTTP request may contain more headers that are not set by curl or wget. For example:

    • Cookie: this is the most likely reason why a request would be rejected, I have seen this happen on download sites. Given a cookie key=val, you can set it with the -b key=val (or --cookie key=val) option for curl.
    • Referer (sic): when clicking a link on a web page, most browsers tend to send the current page as referrer. It should not be relied on, but even eBay failed to reset a password when this header was absent. So yes, it may happen. The curl option for this is -e URL and --referer URL.
    • Authorization: this is becoming less popular now due to the uncontrollable UI of the username/password dialog, but it is still possible. It can be set in curl with the -u user:password (or --user user:password) option.
    • User-Agent: some requests will yield different responses depending on the User Agent. This can be used in a good way (providing the real download rather than a list of mirrors) or in a bad way (reject user agents which do not start with Mozilla, or contain Wget or curl).

    You can normally use the Developer tools of your browser (Firefox and Chrome support this) to read the headers sent by your browser. If the connection is not encrypted (that is, not using HTTPS), then you can also use a packet sniffer such as Wireshark for this purpose.

    Besides these headers, websites may also trigger some actions behind the scenes that change state. For example, when opening a page, it is possible that a request is performed on the background to prepare the download link. Or a redirect happens on the page. These actions typically make use of Javascript, but there may also be a hidden frame to facilitate these actions.

    If you are looking for a method to easily fetch files from a download site, have a look at plowdown, included with plowshare.

    Another really perverse possibility would be that the server for some reason was configured to return 403 instead of 200 on success.

    This gave me the clue I needed. After trying cookies, I found the referrer to be the problem (now, if only that could be spelt properly!!!)

    If it is *still failing* in `wget` try adding `--auth-no-challenge`. Works like magic.

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM