Scraper Websites Remain An Internet Scourge
Last Saturday morning I logged onto my main email account and discovered a highly unusual number of messages, particularly for a weekend morning. A quick survey of what had come in were notifications from my two main automotive blogs – The Auto Writer and Auto Trends – that comments were awaiting my approval.
It didn’t take me long to realize that these comments were pingbacks from articles I had written for the two blogs, information that was scraped from both these sites with links intact to other articles. Moreover, each photograph I included was also snagged, meaning that the pictures were hot-linked to my website, a further drain on my resources.
Admittedly, I have let some scrapers slide, figuring it wasn’t worth the hassle of hunting down and finding them in order to register a complaint. However, this time I was furious and decided enough was enough.
Hunting Down Website Scrapers
To find a website scraper, I’ll plug in the URL for that website to Domain Tools at www.domaintools.com to see what comes up. Sometimes personal information is blocked as the owner hides behind a domain proxy. Other times, a person’s name isn’t listed, but a contact email is given. If you’re lucky, you may even find the person’s name, phone number, email address, etc. In this situation, all I had available to me was the person’s email address, but that was enough for me to fire off the following email:
It has come to my attention that the owner of scrapersite.com is taking articles from sites that I own and posting these articles to his site. Since the email address yourname@yourmail.com is listed as the contact for the site, you are being contacted.
Please remove at once every article you have taken from http://www.autotrends.org (Auto Trends) and http://www.thearticlewriter.com/autowriter (The Auto Writer). What you have done is illegal and violates international copyright law.
(I then listed examples of articles that had been culled from my sites and concluded my email with the following statement):
You do not have my permission to use my personal material, therefore I expect you to remove the articles in question at once.
I concluded my message with my name and repeated the URLs to my two automotive sites.
I Get A Response
I wasn’t expecting a response, figuring that if this person cared that he scraped articles from my site he’d remove the information and that would be that. Or, if he had no plans to take down my information, then he wouldn’t respond.
I was wrong on both counts.
Turns out I got a reply within the hour of sending off my notice which was as follows:
Hello, this site is only collecting RSS Feed. You know? Mybloglog, Zimbio, Yahoo, blogcatalog do same, they collecting RSS feed. Also, this site give you linkback, not claimed as my article, so its not illegal. But, since u mind, I will remove your feed as soon possible.
Yes I mind! So much so I replied:
It is illegal. You posted each one of my articles on your site in full. The other sites that you mentioned include a snippet, not the entire work. Just because you can read my works with an RSS reader doesn’t mean you can use it on your own site.
Please read up on what constitutes copyright and what constitutes fair use:
http://www.templetons.com/brad/copymyths.html
In addition your country, Pakistan, is a signatory to the Berne Convention which protects people like me from getting their intellectual property taken without their knowledge:
http://en.wikipedia.org/wiki/Berne_Convention_for_the_Protection_of_Literary_and_Artistic_Works
If you remove all of the material (taken) from my two sites I will let this matter go. If not, I’ll be forced to contact the Berne Convention office in Karachi to file a complaint.
I doubt that there is a Berne office in Karachi or anywhere else for that matter. But, I wanted to put a little teeth in my reply to encourage him to follow through. He sent a follow up reply doubting that I would make good on my threats, reiterating that he posted “automatic content” which gave him, in his eyes, the right to republish. I decided to let him have the last word.
Wrapping It Up
In the end my articles were removed so I no longer have an issue with this person. I am glad that I followed through and that he took the proper action. My time was taken up with having to do something I really didn’t want to do, but I needed to protect my intellectual property even if someone else doesn’t understand what that means.
Article scraping remains an internet scourge, but with some diligence on our part we can limit its effect one scraper at a time.

By LarryJackson, June 26, 2009 @ 6:49 am
I have seen this on my own blog, although the ones I see are using only a portion of the article. Most of the time, the website is clearly not very old. I always spam the pings or trackbacks I get from these blogs. As far as I know, I have seen no websites that are using the entire article.
I suppose I do not mind if I get any amount of traffic from those sites, but I wouldn’t like it if I found someone using my blog material to populate their own website. Not only is it theft, but it is also intellectually lazy. If I can write my own material, so should they. I started my blog to share my viewpoint, not the viewpoint of someone else.
Just my honest opinion.
.-= LarryJackson´s last blog ..Governor Mark Sanford =-.
By Matthew C. Keegan, June 26, 2009 @ 7:01 am
Larry, I certainly don’t mind if someone takes an excerpt of my article and posts that information elsewhere with a link back to the original article. Several news sites do this which helps bring traffic my way and builds my authority.
I recently discovered that a person in Greece has 46 articles (and counting) from my The Auto Writer site on his site. I emailed him just as I did with the guy from Pakistan, but he hasn’t responded nor has he taken action. I also noticed that he is swiping information from at least five other sites and is using Google ads to generate income. I plan on following up with him this weekend and may contact his domainer, GoDaddy or his web host to see if that does me any good.
By Hobo, June 26, 2009 @ 7:51 am
I’m actually not that bothered with sites taking my feeds…. as long as they link back to my originals. It’s the feeds that nick your content and display it as thier own that gets my goat.
By Tracy, June 26, 2009 @ 7:58 am
I haven’t seen full scraping too much with my articles, which makes sense because I don’t write the kind of thing that people search for, however my photos get stolen all of the time.
Although once, we were doing reviews of a tv show and one of those made for adsense sites was posting an excerpt …about 2 inches of content in a page filled with ads. How did I discover this? I was using my Adwords credit I got from my host and noticed people were clicking my ad from there so had a peak – Gah, I was paying this guy to send people to my site! Good thing it wasn’t money out of my pocket.
Sometimes going through adsense is the only way to get action. Hit them in the pocketbook.
By Matthew C. Keegan, June 26, 2009 @ 8:05 am
@ Hobo — I don’t mind people taking excerpts and linking it back to my site, but cutting and pasting entire articles and then having those appear ahead of me in the SERPs is what infuriates me the most.
@ Tracy — I may contact AdSense directly as you suggest. I, too, have a hosting credit that I will be using, so it should be interesting to see how much abuse takes place with that too. Yes, I’ve had photos lifted or hotlinked which is another problem, one that professional photographers probably experience a lot.
By Mig, June 26, 2009 @ 8:22 am
Sorry this happened to you, Matt. It happens to all of us, and even to the most popular sites. I wish I knew how to stop these people, but there is no real effective way – one to give long term results.
If he scrapes the content, he will also see “I doubt that there is a Berne office in Karachi or anywhere else for that matter. But, I wanted to put a little teeth in my reply to encourage him to follow through. He sent a follow up reply doubting that I would make good on my threats, reiterating that he posted “automatic content” which gave him, in his eyes, the right to republish.” – you could make a complaint to Google and they will deindex his site. Obviously, the man has a MFA site, therefore, if the site gets no more SE traffic, this will seriously hurt him. Google complaints are manually checked by Google editors, so if you are not right, he will not suffer from this.
But I am amazed that he gets content from two of your sites that are unrelated when it comes to the covered topics!!! What is his site about anyway (except stealing other peoples’ work?)
.-= Mig´s last blog ..Michael Jackson – From The Right Perspective – Relived =-.
By Matthew C. Keegan, June 26, 2009 @ 8:31 am
Mig, the two sites are related in that they both serve automotive content. When emphasizes trends the other industry news.
It seems that I should contact Google to report a guy from Greece who is the latest person to lift my stuff. I will wait until the weekend to decide the best course of action, but I want it removed.
As I mentioned in some of my other comments, I don’t mind if someone reposts excerpts especially when they supply to their readers a “Read the rest of the article” link or something like it. When lifted material beats my original stuff in the SERPs, then I get real mad!
By Genesis, June 26, 2009 @ 9:23 am
I have the same problem right now with a scraper. Unfortunately, there is no email address or information to find this person. I wrote to their webhost, but they apparently didn’t care either.
For now I’m leaving it, but I am contemplating switching to partial feeds . . . though that will likely cost me a lot of subscribers.
By Matt Keegan, June 26, 2009 @ 9:34 am
Genesis, that makes it hard when no contact information is supplied. Worse, is a host that doesn’t seem to care which seems to be standard operating procedure with most shared hosting companies.
I would hate to switch to partial feeds for the same reason you outlined — losing traffic may be a worse penalty than having content lifted from your site.
By Vlad Zablotskyy, June 26, 2009 @ 9:38 am
Hi Matt,
I think the scrapers is one of those evils we’ll have to leave with. I do however share your sentiment. Most of of the scrapers I have dealt with generally pull only first paragraph of my posts. Some of them offer me a link without nofollow attribute, most of the times I let these slide and even give them a trackback from my blog. But those that give me a “nofollow” trackback ignore completely and never give them trackbacks from my blogs.
Only once I have noticed a “guru” publishing my entire post, without giving me any credit but keeping my deep links inside my posts (another good reason for linking to you older posts).
But there are two plugins that can help you out to keep the scrapers away – RSS Footer (http://yoast.com/wordpress/rss-footer/) with which you can insert a messages such like this ” If you are reading this not from your e-mail or RSS readers chances are you are on a website who is stealing content from matthewkeegan.com please copy the web address of the page you are reading and contact me at matthewkeegan.com/contact immediately” or something like this.
My favorite though is Similar Posts (http://rmarsh.com/plugins/similar-posts/) plugin which allows you to place links to your other posts in RSS feed for each article. This does not word if they scrape only first paragraph. But if they crape your entire post they will end up linking to at least 5 (or as many as you decide) posts of yours. You need another plugin called Post-Plugin Library from the same author for the Similar Post to work.
As far as SERPs goes I found that Google will eventually let your content outrank the scrapers as they look to the older version as possible original.
I would also do as Mig suggested to contact Google.
By Matthew C. Keegan, June 26, 2009 @ 9:57 am
Good tips, Vlad! I’ll look into these plugins and see if they do the trick. I appreciate your taking the time to offer an exhaustive (not exhausting) response.
I think I’ll be contacting Google too as it is definitely an AdSense scraper site.
By Vlad Zablotskyy, June 26, 2009 @ 10:17 am
Thanks Matt… I also think scrapers stay away from my blogs due to my spelling and grammar – a technique I would never recommend to employ if you don’t have to
By Matthew C. Keegan, June 26, 2009 @ 10:19 am
Now that is funny, Vlad! But, my two most recent scrapers are a Paki and a Greek, so English isn’t standard with them. I think my site being based in the US has something to do with it too, figuring that they can harness American traffic in order to get the AdSense ad income.
By Vlad Zablotskyy, June 26, 2009 @ 10:59 am
I think it is very unwise and a shortsighted “money making” scheme. They would be so much better of by simply translating your posts with proper credits. It will not offend you, I am sure you are not trying to rank for a Pakistani or Greek equivalent to “car”. I think Ford and other’s have plenty of resources to run Pakistani or Greek ad campaigns, and these guys would not have to compete with American marketers. Oh well…
I also forgot to mention that even those spologs/scrapers that offer a SEO friendly links seam to disappear with time- leaving you with bunch of links to websites that no longer exists.
By Lord Matt, June 26, 2009 @ 11:14 am
I have no problem with sites that take the first paragaph or a summary especially if they are not storing it but simply displaying the current availability. My fact-o-page project gets the latest and greatest from technorati for this purpose but the listing of recent articles changes and the older content is not saved.
When it comes to hot linking (a crime under the computer misuse act in the UK and generally considered theft) there is a good trick you can pull with a .htaccess file. In your case I would test the referer and if it was the spammy site that was illegally hot linking then you silently have your server return an alternative image. This could be an offensive but low rez image (to breach the ad networks TOS that is making the money for the lazy boy) or an image that says “the page has stolen content”. Alternatively you could 302 them to the biggest image you can find on flickR.com …
One time my content was stolen and the guy used wordpress with no anti-spam features. I noticed that if I commented and then hit refresh just so… it posted lots of copies. I was very tempted to do something mean.
By Lord Matt, June 26, 2009 @ 11:16 am
I should also mention that a reverse DNS look up will give you the hosting company 9 times out of 10. More often than not they are most sensitive to a firm but polite email.
By Matthew C. Keegan, June 26, 2009 @ 11:18 am
Good tips, Matt! I believe hot linking is illegal in the US too, but no telling what the rules are in Greece (EU) and Pakistan (Taliban, lol) when it comes to upholding international law.
I did think about visiting his site and clicking on his ads hundreds of times, at a library computer of course. Get Google’s attention and you’ll get action. Hmmm….
By Hobo, June 26, 2009 @ 5:41 pm
Good advice from Vlad – I was about to suggest a similar thing after looking at your feed.
Get those links in and let them scrape away
By Gordano, June 27, 2009 @ 7:44 am
I am agreed with the Hobo & he has given very right solution about the problem that we should get the links & then scarp away!:)