Downloading https web content - the hard way (for wireshark fans)

Problem: You want to download some content (e.g. video) from a website that doesn't want you to download that content, only to stream it. You may want to do this to view it later (maybe it expires after a while), or to view it offline or on a different device. Please note that circumventing the content owner's wishes against downloading may be a legal problem, but there are less sofisticated ways of doing it (e.g. screen recording either software or with a camera pointed at your screen), so we'll analyze it from a technical point of view.

Up until recently, downloading video content was fairly easy, by starting the debugger (F12 in the browser), going to the network tab, identifying the video resource, copying the curl command and pasting it somewhere for offline download. But the sites got clever and detect when you open the debugger and stop accessing the video content, so that you can't find the video link.

The solution to this is more convoluted - we're going to find the URL to download by doing a packet capture, so that the browser can't detect your snooping attempt. But the sites aren't stupid either - most of them are https enabled. So - you'd need to decrypt the https traffic to look inside.

Fortunately, since you control the browser you can tell it to dump the encryption keys to a file: https://redflagsecurity.net/2019/03/10/decrypting-tls-wireshark/

So, you can start a new browser instance (linux) with:
SSLKEYLOGFILE=.ssl.log chromium-browser

and navigate to the desired site. You can load the page that holds the video you want to download.
Also prepare wireshark to do packet capturing on your outgoing interface (full capture, not just headers). Make sure you configure wireshark as described in the guide above, so that it decrypts TLS traffic with the keys found in the same file used by the browser.

Press play on your video (shouldn't matter much if it's at the beginning or not, but it's best to be at the beginning) and leave it play for a short while - 10-20s. You may pause the video and stop the capture (an alternative would be to play the video fully in the browser while the capture is running and extract it via File -> Export objects -> HTTP, but that might take a long time because you need to play the whole thing in your browser).

Now comes the tricky part - where you kind of need to know your way around wireshark. Your challenge is to find the data stream in your packet capture. It usually is the largest transfer between you and a server. You should be able to find it relatively easily by going to Statistics -> Conversations and sorting TCP traffic by Bytes. The largest transfer should be your desired content. Now you know the destination IP address. If you right click on it and select Apply as filter -> Selected -> A<->B you should see only the relevant traffic in wireshark.

You should see after the SSL handshake a HTTP(s) GET request that we need to "convert" into a CURL string. Thankfully there's an easy way to do that... Select the Hypertext Transfer Protocol section in the GET packet -> Right click -> Copy -> All Visible Selected Tree Items and you should get something like this in your clipboard:

Hypertext Transfer Protocol
    GET /secip/0/1V6iYKTjWZJtAfOEC39TRg/ODAuOTcuMjM4Ljc3/1557435600/hls-vod-s1/flv/api/files/videos/2017/11/14/1510638299qz9g3.mp666Frag5Num5 HTTP/1.1\r\n
    Host: hty4e3.vkcache.com\r\n
    Connection: keep-alive\r\n
    Origin: https://hqq.tv\r\n
    User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/73.0.3683.86 Chrome/73.0.3683.86 Safari/537.36\r\n
    Accept: */*\r\n
     [truncated]Referer: https://hqq.tv/sec/player/embed_player_2048452040101183.php?iss=ODAuOTcuMjM4Ljc3&vid=RhzS9QBcmImK&at=b3ed763ec5f85212ad4e9c275a4094a9&autoplayed=yes&referer=on&http_referer=aHR0cHM6Ly93YWF3LnR2L3dhdGNoX3ZpZGVvLnBocD92
    Accept-Encoding: gzip, deflate, br\r\n
    Accept-Language: en-US,en;q=0.9,ro;q=0.8\r\n
    \r\n
    [Full request URI: https://hty4e3.vkcache.com/secip/0/1V6iYKTjWZJtAfOEC39TRg/ODAuOTcuMjM4Ljc3/1557435600/hls-vod-s1/flv/api/files/videos/2017/11/14/1510638299qz9g3.mp666Frag5Num5]
    [HTTP request 1/1]
    [Response in frame: 2535]

You need to do some trimming in a text editor:
* remove \r\n from the lines (can be done with find and replace)
* remove Hypertext Transfer Protocol
* remove any [truncated] entries
* remove anything after the lonely \r\n or blank line (signifies end of headers)
* reduce the indent of everything so that everything is left-aligned

The end result should look like:

GET /secip/0/1V6iYKTjWZJtAfOEC39TRg/ODAuOTcuMjM4Ljc3/1557435600/hls-vod-s1/flv/api/files/videos/2017/11/14/1510638299qz9g3.mp666Frag5Num5 HTTP/1.1
Host: hty4e3.vkcache.com
Connection: keep-alive
Origin: https://hqq.tv
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/73.0.3683.86 Chrome/73.0.3683.86 Safari/537.36
Accept: */*
Referer: https://hqq.tv/sec/player/embed_player_2048452040101183.php?iss=ODAuOTcuMjM4Ljc3&vid=RhzS9QBcmImK&at=b3ed763ec5f85212ad4e9c275a4094a9&autoplayed=yes&referer=on&http_referer=aHR0cHM6Ly93YWF3LnR2L3dhdGNoX3ZpZGVvLnBocD92
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9,ro;q=0.8
Now we can use h2c (headers to curl) to convert it into a curl request: https://curl.haxx.se/h2c/. Simply paste the string in the form and click convert and it should produce something like:

curl --compressed --header "Accept-Language: en-US,en;q=0.9,ro;q=0.8" --header "Connection: keep-alive" --header "Origin: https://hqq.tv" --header "Referer: https://hqq.tv/sec/player/embed_player_2048452040101183.php?iss=ODAuOTcuMjM4Ljc3&vid=RhzS9QBcmImK&at=b3ed763ec5f85212ad4e9c275a4094a9&autoplayed=yes&referer=on&http_referer=aHR0cHM6Ly93YWF3LnR2L3dhdGNoX3ZpZGVvLnBocD92" --user-agent "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/73.0.3683.86 Chrome/73.0.3683.86 Safari/537.36" https://hty4e3.vkcache.com/secip/0/1V6iYKTjWZJtAfOEC39TRg/ODAuOTcuMjM4Ljc3/1557435600/hls-vod-s1/flv/api/files/videos/2017/11/14/1510638299qz9g3.mp666Frag5Num5

Time to test it. Remember to redirect output to a file:

curl --compressed --header "Accept-Language: en-US,en;q=0.9,ro;q=0.8" --header "Connection: keep-alive" --header "Origin: https://hqq.tv" --header "Referer: https://hqq.tv/sec/player/embed_player_2048452040101183.php?iss=ODAuOTcuMjM4Ljc3&vid=RhzS9QBcmImK&at=b3ed763ec5f85212ad4e9c275a4094a9&autoplayed=yes&referer=on&http_referer=aHR0cHM6Ly93YWF3LnR2L3dhdGNoX3ZpZGVvLnBocD92" --user-agent "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/73.0.3683.86 Chrome/73.0.3683.86 Safari/537.36" https://hty4e3.vkcache.com/secip/0/1V6iYKTjWZJtAfOEC39TRg/ODAuOTcuMjM4Ljc3/1557435600/hls-vod-s1/flv/api/files/videos/2017/11/14/1510638299qz9g3.mp666Frag5Num5 > test

It should have downloaded a file called test. Let's see what it is:

$ file test
test: MPEG transport stream data
$ mediainfo test
General
ID                                       : 1 (0x1)
Complete name                            : test
Format                                   : MPEG-TS
File size                                : 4.71 MiB
Duration                                 : 19 s 960 ms
Overall bit rate mode                    : Variable
Overall bit rate                         : 1 977 kb/s
FileExtension_Invalid                    : ts m2t m2s m4t m4s tmf ts tp trp ty

In some cases the server might give you back gzipped data. File should tell you if it's the case. You will need to uncompress it to proceed. You should now be able to use vlc to play the file to make sure the data is fine.

Now, there's one more issue. For efficiency reasons caching services like vkcache.com will store large data in chunks (2-10MB in size). Your web player knows how to request the next chunk, but our capture has only one. You'll need to guess the other fragment names and download all of them. As you can see the server file name is 1510638299qz9g3.mp666Frag5Num5. The most likely things you can iterate on are Frag5 and Num5. We'll try one, then the other and if you don't get different chunks we'll try both. How many chunks can we expect? Well - depends on the length of your content. For 1 hour of content you can expect ~300 2MB chunks. You can always try to download chunks that are not there, we'll remove them later.
Note a little change we need to do. We need to add -L (follow redirection) to the curl command line (it's not suggested by default), because for some chunks the sites will redirect you to some other storage and you need to be able to follow it.
Let's see what happens when we run this little script:

$ cat downloader.sh
#!/bin/bash

for F in `seq 5 5`;
do
        for N in `seq 0 300`;
        do
                        echo "Downloading Frag $F Num $N"
                        curl -L --compressed --header "Accept-Language: en-US,en;q=0.9,ro;q=0.8" --header "Connection: keep-alive" --header "Origin: https://hqq.tv" --header "Referer: https://hqq.tv/sec/player/embed_player_2048452040101183.php?iss=ODAuOTcuMjM4Ljc3&vid=RhzS9QBcmImK&at=b3ed763ec5f85212ad4e9c275a4094a9&autoplayed=yes&referer=on&http_referer=aHR0cHM6Ly93YWF3LnR2L3dhdGNoX3ZpZGVvLnBocD92" --user-agent "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/73.0.3683.86 Chrome/73.0.3683.86 Safari/537.36" https://hty4e3.vkcache.com/secip/0/1V6iYKTjWZJtAfOEC39TRg/ODAuOTcuMjM4Ljc3/1557435600/hls-vod-s1/flv/api/files/videos/2017/11/14/1510638299qz9g3.mp666Frag${F}Num${N} > Frag${F}Num${N}.ts
        done
done
At some point the chunks will start to output 0 bytes downloaded - that's how you know how to stop.
When it's done you should be left with a bunch of Frag5Num***.ts files in your current directory. Take your time and test a few files (make sure they play in VLC and that file/mediainfo output makes sense).

Next, let's delete "empty" files. They're not exactly empty, but should contain an error response:

$ cat Frag5Num299.ts
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL /hls-vod/flv/api/files/videos/2017/11/14/1510638299qz9g3.mp4Frag5Num299.ts was not found on this server.</p>
</body></html>

$ find . -name "*.ts" -size -10k -print -delete

Now we need to use ffmpeg to concatenate the chunks into one mpegts file with this script (adjust as needed):

$ cat concatenate.sh
#!/bin/bash
base=Frag5Num
for N in `seq 0 300`;
do
    if [ -f "${base}${N}.ts" ]; then
        echo "file '${base}${N}.ts'" >> filelist.txt
    fi
done

$ bash ./concatenate.sh

We can now run ffmpeg to stitch the chunks together in one file:
$ ffmpeg -f concat -safe 0 -i filelist.txt -c copy output.ts

So - what can the media provider do against this kind of attack? Well, lots, actually...
* they can use one-time URLs/Headers - that expire once used. But this would add complexity on their end and would kill their load balancers/caches.
* they can use harder to guess chunk names. The player normally receives a list of them, so they don't need to be consecutive. But the attack would focus on intercepting the list and remove the guesswork (I was too lazy to look for it).

Leave your ideas/suggestions in the comments below

Installing Home Assistant Supervised on an old 32bit HP laptop

I've received a challenge from my former boss: an old HP laptop that was born in 2005: an HP-Compaq NC6220 ( https://www.pocket-lint.com/laptops/reviews/hp/68181-hp-compaq-nc6220-notebook-laptop/ ). The specs are abysmal: So, i386, 1.7GHz single-core CPU (remember those?), 1G of DDR2 RAM (2x512M) and a 40GB ATA (not SATA!) drive. But hey, at least it has a serial port! The challenge is to install HomeAssistant ( https://www.home-assistant.io/ ) on it so that he can monitor some Zigbee temperature sensors and relays (via a gateway). The first hurdle was to remove the BIOS password - following this nice guide: https://www.youtube.com/watch?v=ZaGKyb0ntSg Next-up - install HASSOS. Unfortunately, it doesn't support i386, but only x86_64... So, I went the Home Assistant Supervised route, and installed Debian 11 i386 edition from a netinstall USB ( https://cdimage.debian.org/debian-cd/current/i386/iso-cd/debian-11.6.0-i386-netinst.iso ). Once Debian was up and run...

ACiD-One said…

Hi! Thanks for this, I was able to download a video from hqq using your method, combined with something I found.

For getting the url for the video (the one we're going to use for curl). You can use the inspector tool, but as you know they block the dev tools and prevent the video from loading if they detect it's opened. So after i stumbled upon your blog post, I was doing some test my self, had no luck with wireshark so I was investigating their javascript and found that the function they use to block the dev tools looks something like this:

(this is pseudo code from the top of my head)

if(cookie('userid') != 1){
detectDevtools();
}

So you can easily bypass it by creating the cookie yourself with that name (userid) and "1" as value. I tried it and it worked perfectly. Now you can refresh the browser with your dev tools open and it will let you play the video, after that you can easily identify the url :)

I then used your script to download the fragments and worked like a charm! This could probably be automated in python with selenium as doing it manually is a pain!

Greets from venezuela.

September 20, 2019 at 5:42 AM

al said…

Hello Adrian Popa,I am romanian too, but now I live in Greece!
I want do download this greek movie, but its too complicated with Wireshark.
I think the solution proposed by Acid-One from Venezuela its much easier and simpler!
Unfortunatly Acid One didnt explain in detail how to do it!
Mayebe you can help me and explain me because I am not too experienced!
How can I find the cookie and where I must run my cookie to block their cookie wich trigger the stupid detectdevtools, so I can finaly identify the url of the video!
By the way I am reding and enjoing the superb Odroid Magazine!
Thank you for your propose!
Multumesc mult!

February 11, 2020 at 6:55 PM

Random thoughts

Search This Blog

Downloading https web content - the hard way (for wireshark fans)

Labels

Comments

Popular posts from this blog

Home Assistant + Android TV = fun

SmokePing + InfluxDB export + docker + slaves + Grafana = fun

Installing Home Assistant Supervised on an old 32bit HP laptop