|
SLAMMING SPAM & PUMMELING POP-UPS Steeped in the history of 18th century philosophy, the latest production release of Mozilla stands ready to make a lot of history and not just as a browser. |
||||
![]() by Jack Fegreus September 4, 2003 |
|
|
Thomas Bayes, full-time Presbyterian minister in Tunbridge Wells and part-time mathematician extraordinaire, was the first to use probability inductively, and established a mathematical basis for probability inference: In more practical terms, he devised a means to calculate the probability that an event will occur in a future trial based on the number of times an event has not occurred. The classic example is a jar with 100 jelly beans, where 90 are white and 10 are red. What are the odds on choosing a red jelly bean after picking out 10 white jelly beans in a row? For discerning readers, the 1-in-9 answer has probably induced several yawns, but let's take this a step further. Suppose the snowfall at a mountain resort takes place over a 120-day period. Now let's suppose that in studying past weather reports we determine that the average number of days with significant snowfall is 30, and that on 10 days there is enough snow to close the road to the resort. What are the odds that in the coming winter season the first storm will not come for 30 days, will last for two days, and it will be severe enough to close the road on the second day? Just plug the numbers into Bayes's Theorem and the odds can be cranked straight out—about 1.5-in-1,000,000, which is a tad better than being hit by a meteor in the year 2012. |
|
|
|
|
|||||||||||
|
What we've been calculating is "conditional probability." This field of mathematics has
long found a home in the life sciences where answers to a posteriori problems like "Did the medication cure
the disease?" are quite important. Along these lines, Bayes's Theorem can be expressed in a variety of forms
including one that is particularly useful for inferring causes from their effects. As a result, probability theory
is central to decision and game theory. Statistical testing, confidence intervals, and regression methods are all
markedly prevalent in the practical sciences.
There is, however, another academic domain where the work of Rev. Bayes is quite important: epistemology—the theory of knowledge. Intuitively, conditional probability can be thought of as the re-evaluation of a probability based on additional information. In the case of our long-range weather forecast, the first day's probability of storm-free weather is 90-in-120, but the probability on the next day changes to 89-in-119 based on the previous day's weather. In other words, we learned. In epistemology, subjectivists model beliefs and opinions using probability functions. Learning is then modeled by the updating of those opinion functions. Subjectivists think of learning as a process of belief revision in which an a priori subjective probability P is replaced by an a posteriori probability Q that incorporates newly acquired information. Out of the wellhead of what is dubbed Probabilistic or Bayesian Learning flows a torrent of Artificial Intelligence algorithms for use in areas such as speech recognition, image recognition, and diagnosis. The common thread is the use of probabilistic criteria to select a most likely hypothesis. |
| That notion of "most likely" has a very practical and unfortunately increasingly urgent
application on the average business desktop: the identification of UCE (unsolicited commercial e-mail), a.k.a. spam.
Like pornography, which more often than not is the subject matter of spam, just about all people can recognize spam
when they see it. The nagging question, however, is what's the best way to automate a cleanup of one's inbox?
Most spam filters attempt to recognize key message properties with a high probability for predicting spam. One such feature-recognizing filter is SpamAssassin, which assigns a weighted spam "score" to various e-mail features. Are the message headers corrupted? Was the message generated by an application which forges an MS Outlook ID? Is the sending e-mail server on a blacklist? Does the body contain suspect words? These are some of the tests utilized by SpamAssassin as it generates an overall measure—the sum of the individual tests—of the likelihood that a particular message is spam. |
| The e-mail module is an excellent example. To prevent an open relay, we require user
authorization in order to access the SMTP module of our Openexchange Server. When configuring either an IMAP or POP
connection to the Open magazine mail server, there was no way to tell the Mozilla e-mail client that
SMTP authorization would be necessary.
So we continued without entering this critical setting. With the Outlook client, this tactic would result in the inability to connect to the server when attempting to send mail. With Mozilla, the e-mail client simply informed us of the need to supply a password on the first attempt to send, and continued to work perfectly once the password was entered. The spam filter is even easier to use. It will probably be a bit disconcerting for those used to configuring and function before using it, but there is nothing to enter. Remember that one of the more powerful aspects of the Bayesian Learning model is the ability to determine a most likely a posteriori (MAP) hypothesis given a particular outcome. So don't be surprised when first starting the e-mail client if everything comes up as spam. Mozilla's Bayesian classifier needs to know what you, the user, consider to be spam. Once one or two e-mail messages are declared not to be spam, it will likely be a very long time before another e-mail message triggers a query concerning the nature of that message. At any time, however, the user can proactively intervene and declare an unmarked message to be a false negative. By declaring a message to be spam, the user effectively teaches the classifier the user's definition of spam. |
|
| In particular, the classifier scans the e-mail for patterns of tokens. Tokens are simply
groups of symbols, which can be letters, numbers, or typographic symbols grouped in any combination. The Bayesian
classifier then assigns an actual probability to the tokens it discovers. The HTML codes for colors can turn out to
be just as important as a spam-indication token as five exclamation points in a row. Finally, it is then relatively easy to manage messages that have been marked as junk and to remove junk mail. The e-mail module now has junk-mail context menu items, a “delete junk mail” menu item, and many other usability improvements for junk-mail controls. |
| Joining spam in the Internet hall of shame is the pop-up, or worse,
pop-under Window for advertising. These devices are proliferating across the Net making it painful to perform
reasoned research for complex material. One way to block these noisome schemes is via a firewall. The Mandrake MNF
Firewall can be set to block content from servers associated with the delivery of these dodgy windows. The one
drawback is of course in the need to list the sites at the outset. Mozilla can be set to decline pop-up windows, just like refusing to accept cookies, as its default configuration. Then if there are sites that utilize login screens or other legitimate automatic window launching, the user can add that site to a list of exceptions to the policy to decline. The analogy to cookies carries down to a small set of icons on the bottom of the browser. These icons only appear if a cookie has been refused as part of a security policy or the launching of an unrequested window has been suppressed. This feature alone is enough to justify a switch to Mozilla. Finally, openBench Labs made a cursory examination of the functionality of the HTML
editor, dubbed Composer. For basic single-page editing, the interface to deal with text, tables, and images is quite
good. Composer now supports click-and-drag dynamic image and table resizing in real-time. Unfortunately, Composer is
just that, a simple single-page editor. What would be interesting would be the integration of composer with the
likes of Eclipse. |
|