Technology Notebook: 01/01/2003

1/29/2003

Exchange 2000::Orphaned Mailbox

PROBLEM:
An active directory user object was erroneously deleted. The MMC where this misdeed took place did not have the Exchange Admin plug-in installed. The mailbox still existed, but was pointed to a reference to a user object that no longer exists. Can't be "reconnected" through Exchange System Manager since the mailbox was never marked for deletion.
- or -
A user deletion was attempted properly (from Exchange Admin MMC plugin with the proper attempt to remove mailbox) but it didn't delete the mailbox. So the mailbox is associated with a non-existent user.
SOLUTION::
Right-Click the "mailbox" container under the information store in Exchange System Manager. Choose "run cleanup agent." Now the mailbox will appear with a red X icon. The mailbox may now be reconnected with another user or purged manually.

1/28/2003

SQL Applications

Thanks Kevin for this useful list of SQL/MSDE applications: http://www.sqlsecurity.com/DesktopDefault.aspx?tabindex=10&tabid=13

1/22/2003

E-Mail::Exchange::Outlook

PROBLEM: A message is stuck in my Outlook OUTBOX.
ADDITIONAL INFORMATION: This problem is especially likely with "deferred" messages setup to be delivered after a future time.
CAUSE: According to MS Article 262052: http://support.microsoft.com/default.aspx?scid=kb;en-us;262052
"After you send an e-mail message, Outlook moves the e-mail message to your Outbox. When Outlook connects to your mail server, the e-mail message is delivered and a copy of the sent e-mail message appears in your Outbox. If you open and close an e-mail message while it is in your Outbox, the e-mail message may not be sent because you may have changed its status. If the status of the e-mail message has been changed, the title of the e-mail message no longer appears in italic."
SOLUTION: According to MS Article 262052: http://support.microsoft.com/default.aspx?scid=kb;en-us;262052
"To send the e-mail message, open the message, and then click Send. The message title changes back to the italic format in the Outbox Messages view. The next time that you make a connection to the mail server, the e-mail message is delivered. "

1/20/2003

E-Mail::SPAM::Spam Filtering

Paul Graham discussed a very intelligent spam filtering algorithm at a spam filtering conference.

His complete article is at:
http://paulgraham.com/spam.html

Here is a quote that sums up his approach:

Most hackers' first instinct is to try to write software that recognizes individual properties of spam. You look at spams and you think, the gall of these guys to try sending me mail that begins "Dear Friend" or has a subject line that's all uppercase and ends in eight exclamation points. I can filter out that stuff with about one line of code.

And so you do, and in the beginning it works. A few simple rules will take a big bite out of your incoming spam. Merely looking for the word "click" will catch 79.7% of the emails in my spam corpus, with only 1.2% false positives.

I spent about six months writing software that looked for individual spam features before I tried the statistical approach. What I found was that recognizing that last few percent of spams got very hard, and that as I made the filters stricter I got more false positives.

False positives are innocent emails that get mistakenly identified as spams. For most users, missing legitimate email is an order of magnitude worse than receiving spam, so a filter that yields false positives is like an acne cure that carries a risk of death to the patient.

The more spam a user gets, the less likely he'll be to notice one innocent mail sitting in his spam folder. And strangely enough, the better your spam filters get, the more dangerous false positives become, because when the filters are really good, users will be more likely to ignore everything they catch.

I don't know why I avoided trying the statistical approach for so long. I think it was because I got addicted to trying to identify spam features myself, as if I were playing some kind of competitive game with the spammers. (Nonhackers don't often realize this, but most hackers are very competitive.) When I did try statistical analysis, I found immediately that it was much cleverer than I had been. It discovered, of course, that terms like "virtumundo" and "teens" were good indicators of spam. But it also discovered that "per" and "FL" and "ff0000" are good indicators of spam. In fact, "ff0000" (html for bright red) turns out to be as good an indicator of spam as any pornographic term.
_ _ _

Here's a sketch of how I do statistical filtering. I start with one corpus of spam and one of nonspam mail. At the moment each one has about 4000 messages in it. I scan the entire text, including headers and embedded html and javascript, of each message in each corpus. I currently consider alphanumeric characters, dashes, apostrophes, and dollar signs to be part of tokens, and everything else to be a token separator. (There is probably room for improvement here.) I ignore tokens that are all digits, and I also ignore html comments, not even considering them as token separators.

I count the number of times each token (ignoring case, currently) occurs in each corpus. At this stage I end up with two large hash tables, one for each corpus, mapping tokens to number of occurrences.

Next I create a third hash table, this time mapping each token to the probability that an email containing it is a spam, which I calculate as follows [1]:

(let ((g (* 2 (or (gethash word good) 0)))
(b (or (gethash word bad) 0)))
(unless (< (+ g b) 5)
(max .01
(min .99 (float (/ (min 1 (/ b nbad))
(+ (min 1 (/ g ngood))
(min 1 (/ b nbad)))))))))

where word is the token whose probability we're calculating, good and bad are the hash tables I created in the first step, and ngood and nbad are number of nonspam and spam messages respectively.

I explained this as code to show a couple of important details. I want to bias the probabilities slightly to avoid false positives, and by trial and error I've found that a good way to do it is to double all the numbers in good. This helps to distinguish between words that occasionally do occur in legitimate email and words that almost never do. I only consider words that occur more than five times in total (actually, because of the doubling, occurring three times in nonspam mail would be enough). And then there is the question of what probability to assign to words that occur in one corpus but not the other. Again by trial and error I chose .01 and .99. There may be room for tuning here, but as the corpus grows such tuning will happen automatically anyway.

The especially observant will notice that while I consider each corpus to be a single long stream of text for purposes of counting occurrences, I use the number of emails in each, rather than their combined length, as the divisor in calculating spam probabilities. This adds another slight bias to protect against false positives.

When new mail arrives, it is scanned into tokens, and the most interesting fifteen tokens, where interesting is measured by how far their spam probability is from a neutral .5, are used to calculate the probability that the mail is spam. If probs is a list of the fifteen individual probabilities, you calculate the combined probability thus:

(let ((prod (apply #'* probs)))
(/ prod (+ prod (apply #'* (mapcar #'(lambda (x) (- 1 x))
probs)))))

One question that arises in practice is what probability to assign to a word you've never seen, i.e. one that doesn't occur in the hash table of word probabilities. I've found, again by trial and error, that .4 is a good number to use. If you've never seen a word before, it is probably fairly innocent; spam words tend to be all too familiar.

There are examples of this algorithm being applied to actual emails in an appendix at the end.

I treat mail as spam if the algorithm above gives it a probability of more than .9 of being spam. But in practice it would not matter much where I put this threshold, because few probabilities end up in the middle of the range.
_ _ _

One great advantage of the statistical approach is that you don't have to read so many spams. Over the past six months, I've read literally thousands of spams, and it is really kind of demoralizing. Norbert Wiener said if you compete with slaves you become a slave, and there is something similarly degrading about competing with spammers. To recognize individual spam features you have to try to get into the mind of the spammer, and frankly I want to spend as little time inside the minds of spammers as possible.

But the real advantage of the Bayesian approach, of course, is that you know what you're measuring. Feature-recognizing filters like SpamAssassin assign a spam "score" to email. The Bayesian approach assigns an actual probability. The problem with a "score" is that no one knows what it means. The user doesn't know what it means, but worse still, neither does the developer of the filter. How many points should an email get for having the word "sex" in it? A probability can of course be mistaken, but there is little ambiguity about what it means, or how evidence should be combined to calculate it. Based on my corpus, "sex" indicates a .97 probability of the containing email being a spam, whereas "sexy" indicates .99 probability. And Bayes' Rule, equally unambiguous, says that an email containing both words would, in the (unlikely) absence of any other evidence, have a 99.97% chance of being a spam.

Because it is measuring probabilities, the Bayesian approach considers all the evidence in the email, both good and bad. Words that occur disproportionately rarely in spam (like "though" or "tonight" or "apparently") contribute as much to decreasing the probability as bad words like "unsubscribe" and "opt-in" do to increasing it. So an otherwise innocent email that happens to include the word "sex" is not going to get tagged as spam.

Ideally, of course, the probabilities should be calculated individually for each user. I get a lot of email containing the word "Lisp", and (so far) no spam that does. So a word like that is effectively a kind of password for sending mail to me. In my earlier spam-filtering software, the user could set up a list of such words and mail containing them would automatically get past the filters. On my list I put words like "Lisp" and also my zipcode, so that (otherwise rather spammy-sounding) receipts from online orders would get through. I thought I was being very clever, but I found that the Bayesian filter did the same thing for me, and moreover discovered of a lot of words I hadn't thought of.

When I said at the start that our filters let through less than 5 spams per 1000 with 0 false positives, I'm talking about filtering my mail based on a corpus of my mail. But these numbers are not misleading, because that is the approach I'm advocating: filter each user's mail based on the spam and nonspam mail he receives. Essentially, each user should have two delete buttons, ordinary delete and delete-as-spam. Anything deleted as spam goes into the spam corpus, and everything else goes into the nonspam corpus.

You could start users with a seed filter, but ultimately each user should have his own per-word probabilities based on the actual mail he receives. This (a) makes the filters more effective, (b) lets each user decide their own precise definition of spam, and (c) perhaps best of all makes it hard for spammers to tune mails to get through the filters. If a lot of the brain of the filter is in the individual databases, then merely tuning spams to get through the seed filters won't guarantee anything about how well they'll get through individual users' varying and much more trained filters.

1/15/2003

E-Mail::Exchange::Tool

OnTrack Power Controls
http://www.ontrack.com/powercontrols/
Can restore granular mailbox items from an offline EDB without any brick-level backup.
I have never used this product, but it looks like a something I might want someday.

1/14/2003

E-Mail::Spam::RBL's Bad ?

There is a long article at http://theory.whirlycott.com/~phil/antispam/rbl-bad/rbl-bad.html discussing alternatives to using Real Time Black List's (RBL's) to block delivery of messages from mail servers known to allow open relay.
I haven't read the entire thing, but I think--so far--I disagree with most of it.
I don't use RBL's for some of these reasons, but I respect them in principle and certainly wouldn't attack the ideas or their implementations as being unconstitutional, censorship, or discriminatory. A lot of those points seem to really be stretching it.
Let me know what you think

Internet::Research::Tools

Awesome mapping site. Find close up pictures of your neighborhood from space.
http://www.acme.com/mapper/

1/07/2003

Internet::Research::Link::Internet Archive

http://www.archive.org
"The Internet Archive is building a digital library of Internet sites and other cultural artifacts in digital form. Like a paper library, we provide free access to researchers, historians, scholars, and the general public."

This site contains a valuable archive of nearly the entire internet at different points in time.

Partition Manager

Great unrestricted shareware Partition Manager.
http://www.ranish.com/part/
And very good disk partitioning primer.

Windows NT/2K::Boot Diskette for Mirrored Disk Volumes

Problem
I have mirrored boot/system volume on two different disks. If the primary disk fails I can't boot.
Solution
Setup boot disk and modify the boot.ini to use the other disk. Details below may vary depending on specifics of machine.
Steps
1. Format floppy on W2K(NT) machine
2. Copy NTLDR, NTDETECT.COM, and BOOT.INI to diskette. Copy BOOTSECT.DOS and NTBOOTDD.SYS if they are required. If you do this from the machine you want to boot, copy them if they exist.
3. Edit BOOT.INI on diskette to point to the second disk. For example:
[operating systems] multi(0)disk(0)rdisk(0)partition(2)\WINNT="Microsoft Windows 2000 Advanced Server" /fastdetect
becomes
[operating systems] multi(0)disk(0)rdisk(1)partition(2)\WINNT="Microsoft Windows 2000 Advanced Server" /fastdetect

1/05/2003

Network::Security::Wireless::WarDriving

I glimpsed his excellent info about WarDriving today. I want to review Chad's site more later.
He has pictures and details of his setup and a map with wireless networks plotted.
I hope he doesn't drive past my house.
Check it out at http://www.phpmp.com/?page=wardriving

Someday I hope to make available a serious list of links and also a list of excellent freeware, but for now here is just an article about a great free system tool I started using recently.

Windows 2000/XP::Defragment Page File

Windows 2000 includes a basic disk defragmentation tool for NTFS that was sorely missing from NT and I use it regularly.
One thing I hate is seeing that the system files remain fragmented.
SysInternals has an extraordinary selection of free utilities for Windows 2000:
http://www.sysinternals.com/ntw2k/utilities.shtml
One of them is PageDefrag v2.20 http://www.sysinternals.com/ntw2k/freeware/pagedefrag.shtml
When you boot W2K it defragments system files including the page file.

http://www.sysinternals.com has a great list of freeware system tools for Windows 9x - XP. Just browsing there briefly I found a number of items I wanted to try out.

Pages