Sorting Hay Stacks to Find CTI Needles
CTI systems are faced with several major problems ranging from the size of the collection network to its diversity, which ultimately affect the level of trust that can be placed on its signals. Are they fresh enough and reliable enough to avoid false positives or poisoning? Am I taking the risk of acting on outdated data? This difference is large because a piece of information is just a decision aid, whereas an actionable piece of information can be directly armed against an aggressor. If raw data is a hayfield, information is a haystack, and needles are actionable signals.
To illustrate the size & variety point of a collection network, without naming anyone in particular, let’s imagine a large CDN provider. Your role is to deliver, at scale, content over HTTP. It attracts a lot of “attention” and signals, but only at the HTTP layer. Also, any clever attacker would probably avoid checking your IP range (which is public and known in your US). Therefore, you only receive indiscriminate “Gatling gun” scanners or direct attacks via the HTTP layer. This is a very narrow focus.
Now if you are a big EDR/XDR or any glorified antivirus you can also argue that you have a very large detection network spanning millions of devices… From rich companies. Because let’s be honest, not every nonprofit, public hospital, or local library can afford the tools. Therefore, you are potentially only seeing threats that are targeted at sophisticated actors, and are mostly carried by malware on LAN machines.
On the honeypot front, there is no silver bullet either. The “Gatling gun scanner” represents the Internet’s background radioactivity. A kind of static noise that is constantly present around Internet-connected devices. Here, the problem is that no good cybercriminal group is going to use meaningful resources to target honeypot machines. What’s the point of investing a few DDoS resources to knock down a straw doll? Would you use exploits or meaningful tools, let alone burn your IP, on “potential” targets? Honeypot collects “intent”, automated exploits, something like “This IP wants to know if you are (still) vulnerable to log4j“.
Stay ahead with CrowdSec, an open source security suite that offers crowdsourced protection against malicious IPs. With its simple integration into your existing security infrastructure, you get behavior detection and automatic remediation. Additionally, you will benefit from highly actionable cyber threat intelligence with zero-false positives and reduced volume of alerts generated from a network of 190K+ machines spread across 180+ countries. Don’t fight alone, let people support you. Get started using CrowdSec for free!
It can be attractive to some extent but is limited to low hanging fruits. Also, your diversity is limited by your ability to spread across places. If all your probes (honeypots) sit over ten or worse, only 3 or 4 distinct clouds, you can’t see everything, and you can be “evaded”, meaning criminals can voluntarily cross your IP range to avoid detection. You’ll also need to set up deployment systems for each platform, but you’ll only see IPs not circumventing GCP, AWS, or whatever cloud you’re using. And since the provider is not an NGO, the size of your network is also limited by…money. If an automated HP running on XYZ cloud cost you $20 per month, your pocket must be too big to run thousands of HP.
To curb the trajectory of mass cybercrime, we need to act on inherently limited resources, otherwise you can’t set up a proper “shortage”. The famous Conti-Leaks highlights the true pain points of major cybercrime groups. Obviously (crypto) money laundering, recruitment, payroll, the classics as you’d expect. But interestingly enough, when you read the exchange on their internal chat system, you can see the IP, change it, borrow, rent, clean it, install tools, migrate operations and C2, etc.… expensive. Both in terms of time & money.
There are almost unlimited hash variations and SHA1 offers space with 2^160 possibilities. So collecting them is one thing, but you’re almost certain each new malware variation will have a different signature. As we talked about, most good cybercriminal groups’ CI/CD procedures already include a one-byte modification before sending the payload to the target.
Aiming for a domain name also struggles against an infinite space in size. You can order domain1, domain2, domain3, etc. Technically there is no limit to the number of variations. There is a smart system out there, which protects your brand and checks if any domain names similar to yours have been ordered recently. This pre-crime style system is especially helpful for dealing with future phishing attempts. You start being proactive with these kinds of attitudes & tools.
However useful for tracking & indexing malicious binaries based on their Hashes or C2’s they try to contact or even indexing IPs trying to auto-exploit known CVEs, but doing so is a rather reactive attitude. You don’t counterattack by knowing the enemy’s position or tactics, you do it by disabling its offensive capabilities, and this is where IP addresses really get interesting. This system is decades old and will still be after us. He
Now there is a resource that is actually scarce: IPV4. Historical IP space is limited to about 4 billion of them. Bringing the fight to the ground is efficient because if resources are scarce, you can actually be proactive and burn down IP addresses as soon as you notice they are being used by the enemy. Now, this landscape is one that is constantly evolving. VPN providers, Tor, and Housing proxy apps offer a way for cybercriminals to borrow IP addresses, not to mention the fact that they can leverage some of the already compromised servers on the dark web.
So if the IP address is in use at that moment, it may not be the next hour and you then generate a false positive if you block it. The solution is to create a crowdsourcing tool that protects all sizes of businesses, in all types of places, geographies, cloud, home, private corps DMZ, etc., and across all types of protocols. If the network is large enough this IP rotation is not a problem because if the network stops reporting IPs you can drop them, whereas new ones appearing in a number of reports need to be integrated into the block list. The bigger the network, the more timid it becomes.
You can monitor almost all protocols except UDP-based ones, which should be excluded because it’s easy to spoof packets over UDP. So taking into account the reports about UDP-based protocols for banning IPs, you can easily be fooled. Other than that, every protocol is good to monitor. Also, you can definitely search for CVE but, even better, for behavior. Thus, you can catch business-oriented aggression that may not be solely CVE-based. A simple example, beyond classic L7 DDoS, scanning, credential bruteforce, or stuffing is scalping. Scalping is the act of buying products automatically with a bot on a website and reselling them for a profit on eBay for example. This is a business layer issue, not a security related one. CrowdSec’s open-source system is designed precisely to enable this strategy.
Finally, for the past two decades, we’ve been told, “IPV6 is coming, get ready”. Well… let’s say we have time to prepare. But it really does exist now and 5G deployment will only accelerate its usage exponentially. IPv6 changed the stage with a new pool of 2^128 IP addresses. This is still limited in many ways, not least because all of the IP V6 ranges are not yet fully utilized, but also because everyone gets multiple IPV6 addresses at once instead of just one. However, we are talking about a large number of them now.
Let’s combine AI & Crowdsourcing
As data begins to flow massively from large networks of people sourced and the resources you’re trying to shrink grow larger, AI sounds like a logical path to explore.
Network effects is already a good start on its own. An example here could be a credential field. If the IP uses multiple login/pass pairs on your premises, you’d call it credential bruteforce. Now on a network scale, if you have the same IP knock in different places using different login/pass, that’s credential stuffing, someone tries to reuse stolen credentials in multiple places to see if it’s valid. The fact that you’re viewing the same action, leveraging the same credentials from multiple angles, gives you an additional indication of the purpose of the behavior itself.
Now, to be honest, you don’t really need AI to sort out Credential bruteforce from Credential Reuse or Credential stuffing, but there are places where it can excel, especially when working with large networks to get lots of data.
Another example could be a massive internet scan, made using 1024 hosts. Each host can only scan one port and it may go unnoticed. Unless you notice, in many different places, the same IP is scanning the same port in the same time frame. Again, barely visible on an individual scale, clearly visible on a large scale.
On the other hand, AI algorithms are good at identifying patterns that would be invisible if you only looked at one place at a time but are striking on a large network scale.
Representing data into appropriate structures using graphs and embedding can reveal complex levels of interaction between IP addresses, ranges, or even AS (Autonomous Systems). This leads to the identification of cohorts of machines working in unison towards the same goal. If multiple IP addresses sequence the attack in many steps such as scanning, exploiting, installing backdoors and then using the target server to join the DDoS attempt, the pattern may repeat itself in the logs. So if the first IP of the cohort is visible at a certain timestamp and the 2nd 10 minutes later and so on, and this pattern repeats with the same IP in multiple places, you can safely tell everyone to ban 4 IP addresses at once.
The synergies between AI and crowd-sourced signals allow us to effectively overcome each other’s limitations. While crowd-sourced signals provide ample real-time data on cyberthreats, they may lack precision and context, ultimately leading to false positives. AI algorithms, on the other hand, usually only become relevant after ingesting large amounts of data. In return, the models can help refine and analyze these signals, removing noise and uncovering hidden patterns.
There are strong couples to marry here.