How to discover direct download URL?

Let’s say we need a link to file placed somewhere on the server in the Internet, but not just to copy it and paste in the browser. Example case can be like this: write a program that downloads files from a service or save the link and provide it somewhere else. Basically our goal is to automate some things and skip human interaction.

Most of the time resources like files are handled and managed by different services. This means that sometimes we aren’t given direct access to the file because it’s controlled by application. Reasons are often simple: authorization, increase download count, show ads, prepare file to download, prevent hotlinking, it’s all server-side processing in general.

Example

We will use GitHub service as an example. Repositories that are hosted there can be downloaded as a ZIP file. I imagine GitHub doesn’t store all files of all repositories at once, but generates them on demand and delete later to save space. With Chrome Dev Tools (Network tab) we can see what is happening after clicking the Download ZIP button.

file_1

Actually two requests was made. First to https://github.com/shakiba/planck.js/archive/master.zip was redirected (HTTP status 302) to real file address: https://codeload.github.com/shakiba/planck.js/zip/master. In response headers of the first one we can check redirect location:

file_2

We can repeat the request in terminal by using curl tool. By right-clicking request in Dev Tools (Copy -> Copy as cURL) we have command ready to paste in the terminal. After doing it in this case will see this:

<html><body>You are being <a href="https://codeload.github.com/shakiba/planck.js/zip/master">redirected</a>.</body></html>

We can skip all given headers in previous command for readability and add -I option at the end to show document info:

$ curl https://github.com/shakiba/planck.js/archive/master.zip -I
HTTP/1.1 302 Found
Server: GitHub.com
Date: Sun, 19 Mar 2017 14:36:50 GMT
Content-Type: text/html; charset=utf-8
Status: 302 Found
Cache-Control: no-cache
Vary: X-PJAX
Location: https://codeload.github.com/shakiba/planck.js/zip/master
...

Adding -L option will cause redirect follow:

$ curl https://github.com/shakiba/planck.js/archive/master.zip -I -L
...

HTTP/1.1 200 OK
Content-Length: 518982
Access-Control-Allow-Origin: https://render.githubusercontent.com
Content-Security-Policy: default-src 'none'; style-src 'unsafe-inline'
Strict-Transport-Security: max-age=31536000
Vary: Authorization,Accept-Encoding
X-Content-Type-Options: nosniff
X-Frame-Options: deny
X-XSS-Protection: 1; mode=block
ETag: "4e10a37c09e7f2f808c0aed1bba92b4c86d3d5fd"
Content-Type: application/zip
Content-Disposition: attachment; filename=planck.js-master.zip

Here we can see some file properties like: name, size, type. We can also retrieve only specific header by adding pipeline with grep:

$ curl https://github.com/... -I | grep -Fi Location
...
Location: https://codeload.github.com/shakiba/planck.js/zip/master

To download the file use L – follow redirect, O – write output to file:

$ curl https://github.com/shakiba/planck.js/archive/master.zip -L -O

Code example

To achieve above things I will use Ruby:

What I did was actually using the same curl commands but executed by Ruby. Surrounding text with backticks (`…`) is one way to run system commands from the code in Ruby. This method allows to capture output string and assign it to variable. It’s not recommended in this case because we can’t be sure curl will be available on every system this code is executed in.

This task is simple with Net::HTTP, built-in Ruby http client:

Now we can do whatever we want with link string from response['Location']. This method isn’t guaranteed to work with every service, but it’s enough for simple cases like this. I didn’t show how to download files from Ruby because this post isn’t about it and actually I was never doing it so maybe there will be another post about it.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s