Machine learning isn’t a silver bullet for network security.
I know, right now you’re saying “But wait Matt, All I hear is that every hot new company in the space is based on machine learning and that VCs are funding machine learning companies in record numbers, even with the economy in question.”
And you’re right, they are – usually for very good reasons.
I am not advocating against an entire field of study, only against its recent inappropriate application in analyzing the entirety of your network.
Algorithmic learning theory, clustering, self-organizing maps and all that other neat sounding stuff –could- potentially be very useful in specific areas of security, under the right circumstances.
For instance –UBA and EDR are very interesting and seem plausible from a volumetric standpoint. I think that what companies like Exabeam and Cylance are doing is very interesting and show a lot of promise. There are also very viable technology solutions specifically outside of the network security realm that are doing great things with ML.
What I take issue with, is the notion that machine learning models can be effectively applied to network detection as the principal means of detecting complex attacks. I’ve sat through countless vendor presentations, evals and partnership opportunities over the last 2 years and have observed one of the following outcomes in all of them:
Data and Feature Selection is Bad
The features that the data scientists who are looking at network protocol data we’ve captured typically choose to extract are byte and packet counts so that they can determine a deviation in network usage. The usual story here is that this type of algorithm can detect someone who is suddenly “working” in off hours. The assumptive overtone here is usually “someone else is on that machine!” or “Malware is exfiltrating the super secret sauce!”
The reality is that it typically end in a conversation with the end user consisting of “yes, I have been backing up my entire hard drive to dropbox for the last 2 weeks” or “You, um, weren’t looking at which torrents I was seeding, right? Cool, then let’s just call them Nintendo ROMs.”
Cost to Performance for Data Size is Abysmal
The number of features we do want to analyze with ML cause a ghastly look of fear and loathing with most data scientists. This is because the values are wildly varying and there is a MASSIVE amount of data. PacketSled captures all network traffic from layer 2 to layer 7 and we retain it for a long time (our smallest customer ingests and stores about 100 million events per day). The typical response here is something akin to “doing this work across this dataset is simply cost prohibitive unless you only want to look for very specific problems.”
Much in the same way we don’t want to look for overly broad false positives, we don’t want to look for specific, minute problems. That’s the problem that we had with signature-only approaches. They were too specific and only meaningful during a small snapshot in time. The promise of ML was supposed to be better.
Malicious Garbage In, Nothing Out?
Even if platforms that use ML as a primary method of detecting bad stuff could ingest and process all the data, extract all the features we want, and cluster them all appropriately, there is a massive philosophical issue at play here – the sanctity of customers’ baselines. The same vendors who are pitching machine learning as their core technology advantage will be the first to cite the Verizon statistics – “nearly 100% of networks are compromised!”
Ok, so then how does your model get a clean baseline of the network traffic from a dirty network? Show me that trick and I will show you how to moonwalk from North Beach to Alcatraz.
“A 10 year-old with google would most certainly be able to answer questions with better accuracy, faster.”
As if the depth problem, the feature problem, and the baseline issue weren’t enough, there is the issue of time. Machine Learning algorithms need to be limited by time in order to ingest data in sizes which can be processed meaningfully. Analyzing data for very lengthy periods of time, even if you’re only looking at a handful of attributes causes serious performance issues, if not outright failures.
Long running and widely scoped memory is necessary. Imagine IBM’s Watson running on a palm pilot with only a SD card worth of knowledge – a 10 year old with google would most certainly be able to answer questions with greater accuracy, faster. Incidentally, it is important to point out that machine learning algorithms could never do the job that Watson does.
So what happens when a user downloads an arbitrary executable off the internet, executes it, it lays dormant for 30 days, then phones home? I can tell you one machine that’s learning something there – the one on the other end of that command and control session back to somewhere nefarious. Without a long-term forensic data set telling you what happened 30 days ago, you’re in big trouble.
Your network is a living organism that is constantly evolving, and it is chaotic. Baselining a chaotic moving target is not just impractical – it is impossible.
That said, machine learning does have a place in network security. We need to use ML models as an atomic input to a chain of events that tell a bigger picture. We can’t ask it to sort through billions of objects in real-time, and historically to solve for any meaningful number of scenarios.
What the enterprise needs is not a magic math robot that observes all things. We need to package the knowledge of security experts by automatically chaining micro-analytics, threat intelligence and metadata with forensically sound network traffic and files to understand and mitigate attacks in record time. We need to embed expert logic into our approach and make it possible for IR folks to stop running down false positives and do their actual jobs, responding to security events.