A Brief Sojourn in Network Security

March, 2021

I know so much more about wikipedia now than I ever meant to learn. I've probably forgotten more about wikipedia's inner workings than most of its power-users and top contributors ever need to learn. Over the past few years, I've spent a lot of moonlighting hours decompressing from physics, stuck in bus stations, frustrated at other projects and taking out that frustration on what this project could have been — only to find that every question I and my collaborator had seeked to ask had been answered more than a year before we'd begun working on it, most comprehensively in the doctoral thesis of Tao Wang in 2015, done at the CrySP lab at Waterloo, and brought to my attention recently by friend (and CrySP alum) Anna Lorimer. Many thanks to Anna. At time of writing, Dr. Wang is a professor of computer science at HKUST, and Anna is a cryptography researcher at the University of Chicago.

It all started in the Winter of 2016. My friend Tim McLean and I were bored and looking for something to do, and we start playing around with some ideas about website fingerprinting. Tim's a security engineer, primarily, and had been spending a lot of his time in those days reading about side-channel attacks. HTTP/2 had only come out that past May. There was a lot to be learned by civilians like me about how information was transferred over the internet in the first place, let alone in this new framework. We had some notion about being able to identify websites from their packet sizes, in load-order.

Now, I don't really know the first thing about internet protocol or transmission control protocol, or much about networking generally. But what I did have was a working understanding of statistical modelling and a bit of time on my hands. I still don't have much grasp on the technological points, so if what I write has some errors in it, it will not come as a shock to me (and if you spot one or many, do let me know).

My understanding is that as a webpage is loaded, it calls things in a particular order. First, you might have some site or process-specific overhead, things like decal formatting or navicons or things of that nature that don't change from page to page but must be loaded with each. That might be some html and then some associated CSS, javascript, and media. The html needs to come in first, because it's what dictates where everything else goes. The other items just get called. If that's right, it explains why on a poor load-job you get a very bare-bones looking white page with no formatting and big walls of text.

Tim, for his part, wrote a script for us to load the web pages and collect the traffic data — so for our purposes, that included the URLs and the packets, as well as the time and date at which we collected them. We then assembled that into a database. This database would be used to compare with the data collected from an unwitting target. That target data wouldn't include the URL's and wouldn't have the date and time associated with it. All that would be available to the attacker would be the target's packet sizes, and what website they were on (but not which specific page). I'll explain the approach we took both in words and also with a pretty diagram.

Diagrammatic description of our
statistical attack process

In point of fact, this diagram details more than I managed to accomplish before finding out we'd gotten scooped before even beginning. I never got around to doing page-view correlations, but I imagine I would've tried to see which articles might have linked to eachother or maybe have come up with some categorization scheme following that which many wiki articles already dutifully provide. The frequency analysis would have been the last-ditch attempt to increase accuracy, based solely on which pages were more popular to provide the best guess. We probably would have only used that in the unlikely event of a tie.

As for the direct matching, I wrote a few scripts that did a post-processing on the URL's to make sure that they would be the same length (so that we might compare pages sans media to pages which included images, sound, and/or video) to those without. Then I put those vectors, now usually something like 13-dimensional into a cosine-similarity test. After a bit of tweaking one Saturday night a couple weeks ago, I'd managed to get a match rate of just over 30% between a database and some generated target data collected 6 months apart from each other. That was an exciting result. We'd spent some time on this over the years, a few late nights here and there, a few hours tinkering away getting the code together, me getting myself familiar with LaTeX-ing a proper article in many columns! It was nice to think that maybe it would go somewhere after all. We even had a title agreed upon, we were gonna call the paper "Handshake the Devil: Statistical Analysis of TLS Traffic Records and Cryptanalytic Implications". But alas...

I asked Anna if she thought this was a direction worth moonlighting on. She was encouraging, but followed up fairly quickly with the thesis linked above and all hopes were dashed. According to her, and she is an authority, the environment for website fingerprinting attack research is mostly along the lines of machine learning these days. I've got no serious background in ML, so that's a bridge too far for the time being — and I expect by the time I've got cause to learn enough, the avenue will be explored. So for the first time, I took all the data and all the code we wrote, and all the papers we read over those 5 years, and I put them on a usb which now rests in a box in the deep of my own desk drawer. It's a sad but relieving feeling to put something like that away. I'll be holding onto some of the nice memories of our working sessions, as well as much of the knowledge I gained, for quite some time.