Training the anti-spam engine

Jared Benson's picture

We've put some automated anti-spam measures into Typophile recently, and the good news is that it is really helping keep the spammers at bay.

However, a lot of you seasoned Typophiles are getting legitimate posts flagged by the spam engine. As frustrating as this is, we're going through an essential period of training the engine to know which comments are legitimate and which are not. Bear with us, and this will soon be behind us.

If you are seeing specific patterns about why your posts are being flagged, please share them below.

Thanks-

charles ellertson's picture

Nick Shinn made a good point in one of his posts -- the new filters don't seem to take into account how long someone has been registered at Typophile, yet most spammers have a quite new registration.

oldnick's picture

Charles,

Thank you for pointing that out. How about considering a very simple rule: anyone who posts more than four times in the first twenty-four hours he or she is registered should be flagged for human inspection, and anyone who posts more that a dozen times in the same time period should be sent to Hell, forthwith...

George Thomas's picture

I've been thinking about this and believe a good solution would be: if the account is less than six months old, the posts have to go through moderation.

JamesM's picture

As others have mentioned (and as I've been saying for a long time), most spam is from new accounts.

Focus on screening posts from new accounts, and give older accounts an automatic pass unless there is a major red flag. On the rare occasions when an old account is used for spam, it can be deleted manually later.

Jared Benson's picture

Yes - the antispam modules don't take into account the user account's age, except for catching new accounts as they are created. I'll continue to look into other modules that might dovetail in that might support this function.

aluminum's picture

I'd echo James' thought. It seems most spamming on here is drive-by. No idea on the ease of implementation, but seems that one option would be that all new user posts have to be moderated.

Of course, if the moderation can't happen in a timely manner, that'd frustrate new legitimate users so perhaps is a big drawback.

George Thomas's picture

@Jared

Does your filtering module have the capability to establish whitelists? If so that might be the best way.

I agree that moderation is the worst solution because new users will post multiple times, not understanding what is happening.

JamesM's picture

> new users will post multiple times, not
> understanding what is happening

Yes that can happen, but you can reduce it by explaining the delay clearly to the poster. I've posted at sites where you get an automatic message like: "Your post will appear after it's approved by a moderator (usually within 24 hours)". But posts from whitelisted accounts actually appear immediately, it's only the new member posts that get reviewed.

happyalu's picture

I agree with JamesM: As a new user, my first post today got flagged as spam, and that surprised me. If the message had said "waiting for mod approval since this is your first post", it would feel much more welcoming :)

dezcom's picture

Thanks, Jared. It had just been depressingly impossible before you made the new changes. I agree that new posters are almost always the culprits.

Chris Hunt's picture

does it allow users to flag spam?

F Randall Farmer, who's a world authority on community systems, recommends this. he did it for Yahoo Answers, and it was highly effective. the idea being, you don't have to get rid of all spam outright, just slow the spammers down enough to make it uneconomical.

dezcom's picture

Marking the spam is not the problem, it is deleting each and every spam and blocking the user a billion times that kills you. Spambots are quite automated SOBs.

Chris Hunt's picture

users should remove spam as well. from his book, Building Web Reputation Systems.

Yahoo! Answers decided to let users themselves remove content from display on the site because of the staff backlog. Because response time for abusive content complaints averaged 12 hours, most of the potential damage had already been done by the time the offending content was removed. By building a corporate karma system that allowed users to report abusive content, Yahoo! Answers dropped the average amount of time that bad content was displayed to 30 seconds. Sure, customer care staff was still involved with the hardcore problem cases of swastikas, child abuse, and porn spammers, but most abusive content came to be completely policed by users.

With the final iteration, the designers had incorporated all the desired features, giving historically trusted users the power to hide spam and troll-generated content almost instantly while preventing abusive users from hiding content posted by legitimate users. This model was projected to reduce the load on customer care by at least 90% and maybe even as much as 99%. There was little doubt that the worst content would be removed from the site significantly faster than the typical 12+ hour response time.

Chris Hunt's picture

could you put me on the whitelist, if there is one? i can't post links.

Theunis de Jong's picture

Let me try :^)

Training the anti-spam engine

[Edit] In reply to Hrant, below, I just pressed "Edit" and got the old familiar "Edit comment" box. Where does one get into a waiting queue?

[Edit #2] Nope. Sorry, Hrant, it seems to work dandy for me!

hrant's picture

Every time I want to edit a post (in any thread) the time I need to wait increases. And I think it might even carry over from previous days... Right now in one thread I'm being told to wait 413 seconds before I can fix a spelling mistake...

Hmmm, it actually seems to affect new posts too. For this one I was asked to wait 1106 seconds (over 18 minutes).

hhp

HVB's picture

I'm guessing that there's a timing trigger for posters who post many messages in a short period of time.

But this was my second post in about five minutes, and no 'please wait'. (Yet) - Herb

And then my second edit in less than a minute. Another theory down the drain :)

Chris Hunt's picture

i can't use the src tag for images at the moment, either.

any idea how long the training or settings change is going to take?

Jared Benson's picture

Yes, there is a whitelist, of sorts. Not sure what this new delay thing is; I've not seen it anywhere in the settings.

@Chris Hunt: Done

(The rest of you in this thread are already on the whitelist.)

JamesM's picture

Several times when I've edited one of my posts, I've gotten a message saying I need to wait a number of seconds. Here's an example from this morning.

Syndicate content Syndicate content