THE BASICS OF SPAM FILTERING
What you need to know to pick the right tools to fight spam.
by Ben Westbrook, President, Mail-Filters.com

There are over 570 million electronic mailboxes on the Internet and every one is a target for bombardment with unsolicited commercial e-mail (UCE), more popularly dubbed spam. No longer just a means of interpersonal communication, e-mail has become a powerful and cost effective business tool. Nonetheless, we are close to swamping the greatest person-to-person communication tool since the advent of the postal system and the telephone.

To say that spam is a problem is the understatement of the year. It wastes resources and resources are money. You pay for your bandwidth. You pay for your data storage. Do you want to pay to have unwanted, irrelevant, e-mail messages clogging your inbox?

There are four ways to deal with junk e-mail:

Living with spam is what most people do now. Unless you never give your e-mail address, never publish that address on any site, and frequently change it for good measure, avoiding spam is almost impossible. Legislation and litigation make for great 8-second sound bites, but frequently run into the global nature of the Internet and something called the First Amendment in the United States. Only in the EU, where there is a strong government support for Open Source, has anti-spam legislation begun to advance significantly.

In the long run, an approach that uses all these solutions could be effective, but for now and even in the long run, we believe that technology has to play the major role in bringing spam under control. In the typical scenario, a spammer sends UCE through an ISP into the Internet. It is then distributed by an ISP to an organization’s e-mail server that distributes it to users within that organization. Legislation and litigation are designed to stop a spammer from sending UCE in the first place, or to at least pressure ISPs into refusing to handle it.

Technology fights spam at two places: at the end user’s computer, and at the ISP’s or organization’s e-mail server. More than 80 vendors now offer solutions to fight the spam epidemic and most claim the exact same statistic: 90%+ effectiveness in catching spam. None of these tools seems to mention any statistics, however, on the rate of false-positive matches for spam. As a result, deciding which one is best is almost as frustrating as handling spam itself.

The greatest range of technology solutions for suppressing spam exists for individual end users. Some solutions reject e-mail unless the user expressly gives permission for it to pass. Some provide multiple disposable e-mail addresses to be used when exposing the user’s location during e-commerce or communications. Some are spam filters.

For the IT department running an organization’s e-mail server or an ISP running multiple e-mail servers, the technology solution always takes the form of a spam filter, which can be located either at that server or off-site. Such spam filters are often compared to anti-virus software. A far more useful comparison, however, is to a metal detector at an airport. If the detector is too sensitive, everyone gets stopped. If it is not sensitive enough, undesirable objects will get through.

Like a metal detector, a restrictive or rigorous spam filter, unless properly designed, is likely to incorrectly define many legitimate messages as spam. These are the infamous false positives. What’s more, spam filters have a very unique challenge: Spam is subjective. No one would argue about wanting to receive a particular computer virus because it's valuable. One person’s spam, however, could easily be another person’s gold.

SPAM TECHNOLOGY
CHECK LIST
For corporations and ISPs, the checklist for a technology solution to spam begins with a decision to whether all e-mail will stay on-site. If this is not important, then a service that filters spam for you is a possible choice. If, however, you want the e-mail to stay on-site, you will need to use a filtering package installed on an on-site server. In either case, look for a filtering package with these characteristics:
  • It uses specific spam-filtering techniques rather than general techniques.
  • It can be customized at both the server and user level.
  • It is automatically updated by the vendor.
  • It does not delete spam, but tags it and files it where it can be inspected before it is disposed.

As a result, spam-filtering packages need be measured by two parameters: effectiveness and accuracy. Effectiveness is measured by the percentage of spam that is caught. This percentage should be as high as possible. Accuracy is measured by the percentage of e-mails incorrectly identified as spam. This second percentage should be as low as possible. A well-designed package of spam filters can score well on both measures.

While many factors determine the effectiveness and accuracy of any given filter, four stand out:

Any spam-filtering package should be able to take advantage of special codes issued by organizations and companies like TRUST-e and Habeas  that identify legitimate communications. Alternatively you may want the package to be able to check lists of known spammers such as the Realtime Blackhole List (RBL) of open relays maintained by MAPS.

Spam-filtering techniques can be classified into two categories: specific and general. Specific filtering techniques look for characteristics from actual spam messages, such as key phrases or words, source of the spam, or specific action requested of the recipient. General filtering techniques typically look for a series of suspicious elements in an e-mail like an ALL CAPITALIZED SUBJECT field, exclamation points in the subject field, or obvious fake names in the "From" field (including numbers, etc.). In general, specific filtering techniques are more accurate than general filtering techniques; however, they require significantly more maintenance by the vendor because they depend on someone seeing the actual spam message to create and update the spam signatures.

Customizing or "tuning" a filter is essential to compensate for the subjective nature of spam. Nonetheless, many filters are characterized by simple slider bars or other mechanisms that provide a simple means to make the technique “aggressive” or “conservative,” which equates to a compromise between effectiveness and accuracy. The real goal is to increase both the filter’s effectiveness and its accuracy. As an alternative, some solutions enable a user or administrator to manually define new rules to tune. The best solutions are automated and allow for tuning at both an individual and domain level.

As for maintenance, the filter supplier is best qualified to track the constantly changing tricks of the spammers. Updates to counter these tricks should be sent automatically to the filter at least daily, and probably even more frequently.

A user’s trust in a filtering solution is built up over time. Some prefer to receive all e-mail, regardless of whether it is marked as spam, until that trust is earned. A good filter solution allows the individual’s e-mail client to move marked spam into a suspected spam folder, away from the Inbox, but within easy inspection range. After a week or two, most individuals have an idea of the level of competence of their filtering solution and can feel comfortable either ignoring the suspected spam or automatically deleting it.

An alternative is to sideline the spam at the server. This saves bandwidth for the organization and reduces what an individual needs to know. It does, however, reduce the ability of individuals to easily review what’s been marked as spam. This leaves it up to administrators to clear the suspected spam folder of any false positives, requiring very high accuracy or it will unduly burden the IT department. Both methodologies are useful, depending on the organization. It is highly recommended not to automatically delete spam until the filter has been tuned for maximum effectiveness and accuracy.