How does spam filtering work?

Spam filtering is a complicated set of rules to prevent the delivery of unwanted email. Sorry - it’s going to get technical in here!

Let’s start with how email is transported. For an example let’s consider homer331@gmail.com has sent an email to lisa.simpson@example.org. After Homer clicks “send” gmail may do some checks themselves on the email before sending it on. When gmail is satisfied it will do a DNS lookup to see where to send the email to. DNS is a big topic, but let’s just say that it’s like a phone book for the internet. So gmail finds that example.org email can be delivered to the three hosts spamf.radishnetworks.net, spamf2.radishnetworks.net, and spamf3.radishnetworks.net. These hosts can have different priority and let’s say that spamf.radishnetworks.net has the highest priority. Now gmail opens up a network connection to spamf.radishnetworks.net and the spam filtering begins!

The sending server communicates with the recipient spam filter and establishes an envelope. Well, that’s if the firewall allows it. Just like old-school mail, the sender says who the message is From, who it is To, and stores a lot of record-keeping information. So now let’s switch perspective to the spam filter and have a look at what information we have. The spam filter has:

  • Since this is an internet connection, the spam filter knows the IP address of the sending server.

  • From field: in our example From is homer331@gmail.com

  • To field: in our example To is lisa.simpson@example.org

The spam filtering has these three variables to work with and believe me, they are heavily leveraged! Let’s start with the IP address. The IP address is matched against blacklists and whitelists. If the IP address shows up on a blacklist then the spam filter politely ends the conversation right there. We have geoblocking capability so messages can be blocked based on what country they come from. There is also additional technology where the sender domain says which IP addresses are allowed and what to do if you encounter a sender not on the allow list. Our spam filter looks up these records and checks to see if the IP address is on the allow list and if the sender policy says to drop failures or not. The From field is also subject to a black/whitelist and these can be set on the user, domain, or global levels. In a Radish Networks quarantine report, you can click “Block” and that tells our spam filter to block that sender. Sometimes our administrators will manually add a From address to our global blocklist during a novel spam attack. The last step in the envelope filtering is the “To” field. The spam filter will verify that the “To” field is valid and that we have it in our list of accepted domains.

Whew! Now that’s just the envelope! Just like old school mail, email has an envelope and the message inside. Once the message has passed envelope checks the spam filter will move on to checking the message contents. Here we have a ton of tests but most are pretty intuitive:

  • Virus scanning based on signatures (Radish uses two scan engines)

  • Links are checked with black/whitelists

  • The entire message is ‘digested’ to a one-way hash and that hash is compared with blacklists

  • Macros are blocked

  • Attachments are ‘sandboxed’ which means they are uploaded to specialized virtual environments and see if they exhibit malicious activity (I’m not even kidding)

  • Attachments with specific extensions are blocked (like portable executable files)

  • Password protected archives are blocked (these are pretty much always bad. If you need to 'send' an archive use onedrive, nextcloud, google drive, box, dropbox, etc.)

  • Bayesian filtering which scores the “spam-i-ness” of the message. Good content decreases the score and bad content increases it.

  • Pattern matching is used mainly with novel spam attacks. Here an administrator will blacklist a message based on a specific phrase in the message.

  • Anti spoofing checks

  • “pen-pal” scoring. The idea here is that if Lisa sends Homer email then we can decrease the spam score for when Homer sends Lisa. This doesn’t count for a lot though, just in case Homer’s account gets hacked.

  • Our optional Link lock service can follow a chain of URLs just in case someone thinks they are being sneaky by using a url redirector in front of their target. See more info here.

So that’s inbound scanning. But did you know we do the same set of filtering for outbound email to? Really! We need to ensure that our system doesn’t get blacklisted and we are also very interested in any indicators of compromise with our customers' devices. Everything is scanned and logged, both directions.

Those are pretty much all of the checks I want to make public but I’ll chat just a bit on corrective actions. When you’re filtering on so many rules, mistakes can happen. Radish Networks spam filtering uses a scoring system so one failed test might not drop the email. We also set the threshold for actions. Actions would include sending the message on as-is, or maybe sending it on but adding [SPAM] to the subject line. Or maybe we should quarantine the message and let the user decide on releasing the message in a daily report. And of course we have the action of throwing the message in the trash. There are a LOT of configuration options and our new system allows us to delegate what options are available to whom.

Spam filtering is an absolute arms race. There are attempts to circumvent or exploit every single filter I’ve described. The technology employed today will inevitably be replaced in the future in the never-ending struggle against unwanted email.

Stay safe and think before you click!