spam evolution

Despite some rather modest protection (like a simple captcha), I still receive spammy comments on this blog every now and again. They’re easily spotted and actually never appear on the website.

There’s obviously an incentive for the spammer to post something as convincing as possible: either you’re taken in and think it’s a genuine comment, or it takes so much time for you to decide whether it’s genuine or not, you just give up. In order to achieve that, I’ve noticed a new generation of comments that simply copy texts from somewhere on the web. The text is more readable than a Markov-chain generated blurb and thus more taxing for the blogger to identify. It does it with a twist though: there’s usually a word seemingly deliberately misspelt. Here is an example:

Hi Louis apparently my honstig company have had a few issues today. As far as I can see, the images are there now. Have they returned for you as well? If not, I can try tweaking a few things and seeing what happens

I wondered why the spelling mistake was introduced and my current, unsubstantiated guess is that it’s a way for the spammer to detect which have gone through and identify blogs that are weak on security.

Today I’ve started receiving an even more pernicious spammy comments on my blog: the comments are genuine comments from R-related blogs and thus even more difficult to spot since they seem, at least superficially, somewhat related to the post they’re posted under. Here is an example:

Lattice and ggplot add a lot of value in that they pruocde objects with which you can do things. Also, the whole reason lattice (trellis) was created in the first place was to provide a powerful system that takes care of a lot of tedious things. For example, if you want a histogram conditional on some categorical variable, you’ve got it immediately. Just because it also works in the simple case presented above does not mean it is an equivalent alternative to hist(). I would say that having many options does not make R look like legacy at all. If you need something simple, use something simple (like hist()). If you need something more powerful and flexible, use that.

It threw me at first, because my original post was indeed about ggplot but it was completely off-topic and I got suspicious. I found its origin on a 2009 blog post. Notice that the spelling mistake does not appear in the original (?) comment.

I filed the comment as spam, slightly amused by the attempt and what do you know? A few hours later, I receive another spammy comment, which is exactly the reply of the comment in the original thread.

to whom it may concern I was never in doubt, that havnig graphic objects and conditioning is an advantage (sorry, when I was unclear at this point) but as you already pointed out, there are already two packages which are mostly equivalent from an ordinary user’s perspective.My concern regards havnig many packages in parallel with very much overlap and little structured and coordinated progress.

Again, with added misspelt words. This type of spam definitely requires more time to identify and I guess it’s achieving its purpose. I wonder how widespread this is. One unintended consequence of this might be fewer off-topic comments though!

This entry was posted in Uncategorized and tagged , , . Bookmark the permalink.

3 Responses to spam evolution

  1. I’ve had the exact same thing happen on my blog about a week and a half ago. I blog about ML and Big Data usually; But in a post about something completely unrelated (Prüfer sequences for compressing trees), a long comment appeared, namedropping so many machine learning techniques in a text that is otherwise incoherent that I actually thought it was generated from a domain-specific language model. It, too, contained a shuffled-around word (‘comment’ was spelled something like ‘cnotmem’ or something).

    Unfortunately I already deleted it so I can’t show parts of it here. Detecting the posting of a comment by the intentional misspelling is smart. I somehow immediately assumed it was spam and the misspelling was meant to throw off people searching the text online to see if it was spam (but it’s so similar so you’d find it anyway; So your explanation makes much more sense).

    It’s too bad the spammers don’t publish papers, I think they might have interesting techniques or at least data to share. But specifically this technique is easy to filter out, if you assume no one would want to plagiarize someone else’s (or his own!) comment, using the kind of “soft” text hashing that SpamAssassin and similar projects employ to identify copies of texts containing minor alteration.

    • CL says:

      ah! It wasn’t just me then.

      I don’t know about the hashtag used by SpamAssassin, I’ll have a look at it. I guess that’s the kind of techniques used by plagiarism checkers like turnitin.

  2. Chris says:

    Interesting and fascinating post. I’ve been spambombed this week with a lot of comments that follow this same pattern (a word about six or seven places in is intentionally mangled and can quickly be searched on, as per your assumption). Since I verify my comments manually, some of the comments make me pause for a second, as they almost seem real. I agree – spammers should host a conference and publish some papers. I’m sure a few of them would blow us away with their techniques!