![]() |
REVERSE SPAM FILTERING
|
|
||
MetaNote 1 This is a work in progress. MetaNote 2 The copyright notice above refers to my writing. I would be delighted if people adopt this strategy or terminology (“greenlist” etc.), and help refine it. MetaNote 3 This page was mentioned in the 22 August 2002 issue of The Naked PC Newsletter, which has over 100,000 subscribers! Keywords (for bots) spam fighting, junk email, UCE, UBE, filter, filtering, procmail, negative filter, reverse filter, inverse filter, The Art of War |
|
|
|
Therefore, one hundred victories
in one hundred battles is not the most skillful. Subduing the other’s military without battle is the most skillful. |
The Art of War by Sun Tzu, Chapter 3 |
After years of thinking about, writing about, and filtering messages, I've decided that the best strategy for me is to not filter spam, but instead to filter non-spam and let the dregs, which are often spam, fall through my filters and land in catchall mailboxes. I then periodically open each catchall box, sort it by spam score, and visually scan it looking for non-spam. Since the non spam messages, if there are any, bubble to the top of the sorted-by-spam-score box, I need to look carefully at only the top of these catchall mailboxes. When I find a non-spam message in a catchall box, I “bounce forward” it to one of my magnet-updater email addresses so that my non-spam filters will catch this type of message in the future.
For an example of a strategy that seems antithetical to the wisdom of The Art of War, see Paul Graham's article Filters that Fight Back. For a discussion about this article and some other strategies, see the 10-August-2003 thread Paul Graham: Filters that Fight Back at slashdot.org. An especially insightful comment is Re: And now by the mysterious Zeinfeld, a former experimental physicist.
Below is a flowchart that illustrates my strategy. Note that this flowchart leaves out a lot of the SMTP level filtering that is essential nowadays. For example, nowadays it is common for an SMTP server to:
My focus below is on aspects of the mail flow that are needed for my reverse spam filtering strategy. For example, recipient splitting (exploding) is essential so that each user can use his personal bluelist and personal greenlist to pre-filter his mail before expensive or error-prone direct spam filtering -- such as content-based filtering, Bayesian filtering, etc. -- is done.
A
message sent to one or more of my addresses (& possibly other addresses) |
||||||
|| || \/ |
||||||
Receiving Message Transfer Agent on My Mail-Hosting Provider's Mail Server | ||||||
|| \/ |
||||||
No MX or A record for hostname of envelope sender? | |
=====> yes |
SMTP-level reject | |||
|| no \/ |
||||||
Sent
through an open
relay or open proxy server? or Hostname of envelope sender is in a respected blocklist (for example RFC-ignorant)? or Caught by other SMTP level anti-spam techniques? |
|
=====> yes |
SMTP-level reject | |||
|| no \/ |
||||||
“Explode/Split”
message so there is a separate copy for each local recipient address |
|
|||||
|| \/ |
||||||
Inject
two headers: 1) original
envelope sender and 2) original envelope recipient of this instance of the exploded message |
|
|||||
|| \/ |
||||||
Local Message Delivery Agent (leaving the SMTP world; no longer kosher to bounce back) | ||||||
|| || \/ |
||||||
Global Server-Based Filters (e.g. snag viruses; should be user configurable) | ||||||
|| \/ |
||||||
Does this seem like a virus? | |
=====> yes |
violet (virus quarantine) box¹ | |||
|| no \/ |
||||||
Personal Filters (server-based or client-based or a combination of these) | ||||||
|| || \/ |
||||||
Sent
to one of my secret magnet-updater
addresses (possibly via Bcc or “bounce forward”)? |
|
=====> yes |
magnet-updater program (updates blue & green magnets) |
|||
|| || || || || || no || \/ |
|| \/ |
|||||
|
magenta (magnet) box | |||||
Mailing list or other solicited bulk email? | |
=====> yes |
appropriate blue (bulk) box | |||
|| || no || \/ |
||||||
From a
trusted non-bulk sender (but not From one of my addresses²)? |
|
=====> yes |
green box | |||
|| no \/ |
||||||
From a
sender who is trusted by one of my trusted senders³ (but not From one of my addresses)? |
|
=====> yes |
lime green
box [auto move undeleted messages to green box]² |
|||
|| no \/ |
||||||
Analyze & tag with a spam score or probability | |
|||||
|| \/ |
||||||
Is spam score low? | |
=====> yes |
yellow box [auto move undeleted messages to green box]² |
|||
|| || || \/ |
||||||
=========================> no |
red (rubbish)
box [use mail client to list messages ordered by score; move non-spam to green box² & delete the rest] |
---
¹ The word “box” can be interpreted to mean a traditional
mailbox, a bucket,
or a label. Labels, which are also called keywords, can be used to construct
virtual mailboxes (aka virtual folders or smart folders).
² When a message is moved to the green box, its sender will be automatically greenlisted (for example, by a script that is regularly run against the green box).
Tip: Do not put your own email addresses in your greenlist because spammers often forge the From header with your address as a way to sneak into your green box! Another reason to keep your addresses out of your greenlist is that you can see how spammy your messages are (because they will get analyzed and tagged with a spam score).
³ LOAF is a GPL'd distributed-social-network filter that is a private way to greenlist your correspondents and limelist the correspondents of your correspondents (aka your “2nd degree correspondents”).
![]() all plans include shell access, pine, & procmail Current promotion: |
![]() |
![]() |
The keys to implementing this strategy are
I implement this strategy
_HITS_
)
inserted into the beginning of the Subject header. This makes it easy
to order messages by score because I can simply do a sort-by-subject
in my mail client. I recommend that you surround the Subject tag with
braces (squiggly brackets) because 1) these tagged Subjects will then
sort below almost all other Subjects when the Subjects are sorted using
an ASCII
sort (because {
is ASCII character 123 out
of 127); and 2) a popular alternative, square brackets, will not work
in an IMAP sort because the IMAP
sort and thread specification says to ignore text that is in square
brackets during a sort. If you want to muck around with the format of
the _HITS_
token, see Keith C. Ivey's message Re:
adding SPAM hits score to headers in the SA mailing list.user_prefs
file (comments are preceded by #
). The first two settings
are essential to my deflexion strategy. # Site-wide settings go in ...etc/mail/spamassassin/local.cf (location depends on installation) # User-specific settings override site-wide settings and go in ~/.spamassassin/user_prefs # To insert the spam score at the beginning of the Subject, include the following two settings. # This is an essential part of my deflexion strategy. rewrite_subject 1 subject_tag {* _HITS_ *} # Because automatic spam/non-spam separation is not as simple as black/white or yes/no, # I want the "spamminess" to be put in the X-Spam-Level header and Subject header # (see above 2 settings) of every message that gets a score (hits) of 1.00 or more. # [Note that this is a much lower threshold than what most people use because I use # the scores to stratify my messages (rather than to try to say 'YES this is spam.')] required_hits 1.00 # I need the spam-level "stars" (default character is *) for my Procmail recipes spam_level_stars 1 # I use R (for Red) as the spam-level character because it is easier than * to match in Procmail and # because spammers are more likely to forge SA headers using the default spam-level character, which is * add_header all Level _STARS(R)_ # NOTE: add_header was introduced in 2.60. If you use SA 2.55 or earlier, use the following instead: ## spam_level_char R # I (unfortunately) read only English so I use the following settings ok_languages en ok_locales en # Since I use Pine 4.61 as my mail client and Dallman Ross's Virus Snaggers (vsnag), # I do not need to worry about malicious attachments. And I need each # message to be close to its original state for the message processing that # I do using Procmail, IMAP, and "bounce forward" report_safe 0 # NOTE: report_safe is available in SA version 2.50 and later (and you need 2.51 or # later to be able to set its value to 2). If you are using 2.4x or earlier, use # defang_mime instead (but you really should upgrade!) # NOTE: use_bayes is available in SA version 2.50 and later # I do not use SA's Bayesian analysis because with my deflexion system, almost every # message that SA sees is spam and thus SA cannot automatically learn much about my # non-spam messages (and I don't want to spend time using sa-learn to train SA; I'd # rather spend that time training my procmail-invoked greenlists and bluelists!) use_bayes 0 # NOTE: trusted_networks is available in SA version 2.60 and later # Some of my email addresses are hosted on other systems and automatically # redirected or forwarded to this system. To save SA from checking those # system's IP addresses, I tell SA to trust them. ## trusted_networks ip.add.re.ss[/mask] # I do not skip RBL checks because they improve the accuracy of SA's scores. # NOTE 1: SA 2.60-rc4 and later are much better at timing out if a Real-time # Block List (RBL) is having problems or is dead. NOTE 2: SA does not use RBLs # to "block" mail, it uses them to help determine a message's score. skip_rbl_checks 0 # I use the following so that I (& my users) will be able to see that a message # was processed by SA running at spam.deflexion.com. I prepend 'www.' because # some mail clients, for example pine, will then turn it into a link. add_header all Checker-Version SpamAssassin _VERSION_ (_SUBVERSION_) on www.spam.deflexion.com # If you are thinking about changing some of the default scores that SA gives, I recommend # that you first read Matt Kettler's SA-rules-howto.txt and note that in order for # user rules to work with spamd 2.55 & later, local.cf must contain allow_user_rules # (if you run spamassassin rather than spamd, custom user rules will work by default) ## put custom user rules here # For info about the following rule, see Re: habeas problems in the SA discussion group # If you use Bayes, I suggest you read the thread bayes should ignore habeas headers? score HABEAS_SWE -0.1
|
||
![]() |
The above SpamAssassin user_prefs variables work
SA 2.64. Some of these variables do not exist in earlier or later
versions of SA so make sure that you read the documentation
for your version of SA. For example, In SA 3.x,
The |
user_prefs
file is syntactically valid,
type
spamassassin --lintand make sure that this invokes the
spamassassin
that is
processing your messages! To check which spamassassin
the
above command invokes, type
which spamassassinTo check what version this is, type
spamassassin --versionTo see what tests SA performs, for example to check if network tests are happening, type
spamassassin --lint --debug
But, you can implement this strategy using many different tools. For example,
if your mail client can do filtering and can be set up to do spam-scoring
(possibly via a plug-in), you can use it to do the scoring, filtering,
and processing of your mailboxes. Another option is to find a mail hosting
service that gives their users the option to use server-based filtering
and spam-detection tools. My IMAP
Service Providers page includes many such providers.
explanation of my strategy (why i do things this way and in this order);
the imperfect metaphor I'm using (light:prism :: incoming-messages:message-filtering
software); details about how I implement and adapt it using Procmail,
SpamAssassin, Mozilla, Pine, Mulberry; my magnet-updater script
--list-name
; also see the
section Avoiding False Positives With Greenlists and Bluelists
|
|
Reverse Spam Fitlering: Winning Without Fighting |
1st published 16-Feb-2002 |