Let’s say we need a link to file placed somewhere on the server in the Internet, but not just to copy it and paste in the browser. Example case can be like this: write a program that downloads files from a service or save the link and provide it somewhere else. Basically our goal is to automate some things and skip human interaction.
Most of the time resources like files are handled and managed by different services. This means that sometimes we aren’t given direct access to the file because it’s controlled by application. Reasons are often simple: authorization, increase download count, show ads, prepare file to download, prevent hotlinking, it’s all server-side processing in general.
We will use GitHub service as an example. Repositories that are hosted there can be downloaded as a ZIP file. I imagine GitHub doesn’t store all files of all repositories at once, but generates them on demand and delete later to save space. With Chrome Dev Tools (Network tab) we can see what is happening after clicking the Download ZIP button.
Actually two requests was made. First to
https://github.com/shakiba/planck.js/archive/master.zip was redirected (HTTP status 302) to real file address:
https://codeload.github.com/shakiba/planck.js/zip/master. In response headers of the first one we can check redirect location:
We can repeat the request in terminal by using
curl tool. By right-clicking request in Dev Tools (Copy -> Copy as cURL) we have command ready to paste in the terminal. After doing it in this case will see this:
<html><body>You are being <a href="https://codeload.github.com/shakiba/planck.js/zip/master">redirected</a>.</body></html>
We can skip all given headers in previous command for readability and add
-I option at the end to show document info:
$ curl https://github.com/shakiba/planck.js/archive/master.zip -I HTTP/1.1 302 Found Server: GitHub.com Date: Sun, 19 Mar 2017 14:36:50 GMT Content-Type: text/html; charset=utf-8 Status: 302 Found Cache-Control: no-cache Vary: X-PJAX Location: https://codeload.github.com/shakiba/planck.js/zip/master ...
-L option will cause redirect follow:
$ curl https://github.com/shakiba/planck.js/archive/master.zip -I -L ... HTTP/1.1 200 OK Content-Length: 518982 Access-Control-Allow-Origin: https://render.githubusercontent.com Content-Security-Policy: default-src 'none'; style-src 'unsafe-inline' Strict-Transport-Security: max-age=31536000 Vary: Authorization,Accept-Encoding X-Content-Type-Options: nosniff X-Frame-Options: deny X-XSS-Protection: 1; mode=block ETag: "4e10a37c09e7f2f808c0aed1bba92b4c86d3d5fd" Content-Type: application/zip Content-Disposition: attachment; filename=planck.js-master.zip
Here we can see some file properties like: name, size, type. We can also retrieve only specific header by adding pipeline with grep:
$ curl https://github.com/... -I | grep -Fi Location ... Location: https://codeload.github.com/shakiba/planck.js/zip/master
To download the file use L – follow redirect, O – write output to file:
$ curl https://github.com/shakiba/planck.js/archive/master.zip -L -O
To achieve above things I will use Ruby:
What I did was actually using the same curl commands but executed by Ruby. Surrounding text with backticks (`…`) is one way to run system commands from the code in Ruby. This method allows to capture output string and assign it to variable. It’s not recommended in this case because we can’t be sure curl will be available on every system this code is executed in.
This task is simple with Net::HTTP, built-in Ruby http client:
Now we can do whatever we want with link string from
response['Location']. This method isn’t guaranteed to work with every service, but it’s enough for simple cases like this. I didn’t show how to download files from Ruby because this post isn’t about it and actually I was never doing it so maybe there will be another post about it.