Parasite Scraping
AKA: Two Wrongs CAN Make a Right
OK....to start, let me state the obvious:
The Intertube is overrun with scraper sites. They are everywhere, and many people are profiting from them. While they may be questionable in practice, they can still be very successful.
As you already know, volume is one of keys to success in seo. And scraper sites give you the ability to quickly and easily automate the production of a huge amount of copy for use in your niches. Some implementations of the scraper-site are better then others, for example.....this one isn't bad: http://www.red78.net/loafers/
And here's a pretty bad one: http://opensource.votio.com/php/forum/Loafers
Whats sets them apart is simply advertising integration....I gotta wonder if it is converting as well as the first one.
Anwyays, this is very interesting and all, but I dont just want to give an overview generic scraper sites for you guys...I want to talk about something called Parasite Scraping.
You see, all these BlackHatters out there are busting their ass creating scraper sites. They are coming up with the perfect Markov Chains, creating huge synonym databases, scraping old cached websites, scraping wikipedia, etc etc etc......
Point is, people are dedicating a huge amount of effort, and processing power to create scraper sites.
And amongst all this hype for scraper sites, here I am; and Im thinking, "I hate dedicating effort to pretty much anything I find mudane, especially 'huge' amounts of it."
So, whats a lazy bastard like me to do?......Scrape the Scrapers
Lets dive right in and focus on the crappy one, shall we?
You see, the Crappy One, as we're calling it, has this url: http://opensource.votio.com/php/forum/Loafers
By altering this final term, we can get a whole new set of free content, served up for us and easy to extract! Watch: (By the way, I program in ruby...it's pretty easy to read and follow along)
This would be a simple process to extract the <td class="box"> at the top and bottom of the page. You'd do this by using its xpath //td.box.
This box contains tags relevant to the keyword (in this case: Loafers)
require 'mechanize'
agent = WWW::Mechanize.new
doc = agent.get("http://opensource.votio.com/php/forum/Loafers")
tags = doc.search("td.box").inner_text
tags = tags.split(',')
tags.each do |tag|
tag = tag.gsub(/(Tags:)/, '').squeeze(' ').strip
puts tag
end
**DEMO REMOVED**
Now lets get ourselves some free rss links...shall we?
require 'mechanize'
agent = WWW::Mechanize.new
doc = agent.get("http://opensource.votio.com/php/forum/Loafers")
links = doc.search("a.rsslink")
links.each do |link|
puts link.inner_text
puts link[:href]
end
**DEMO REMOVED**
Annnnnnnnd to wrap things up, lets really get evil:
require 'mechanize'
agent = WWW::Mechanize.new
doc = agent.get("http://opensource.votio.com/php/forum/Loafers")
page = doc.search("html").inner_html
page = page.gsub(/^google_ad_client = quot;pub-([0-9])+";/, 'google_ad_client = "pub-XXXXXXXXXXXXXXXXX";')
puts page
It doesn't take a rocket scientist to tell what this does, **DEMO REMOVED**
So there you have it! Why continue to waste your time, money, and server resource?!? Its a pain in the ass to set up a fleet of scraper sites. All the effort finding sources, creating wiki scrapers, scraping search results, even building templates and hard coding the structure of the site itself!! I have just shown you how to take a simple keyword based scraper site and republish it under your own adsense ID; and using only 7 lines of code.
I want you guys to keep in mind that those two example websites are just the tip of the iceberg. Like everything else, creativity must be applied here. There are many websites that scrape and markov content, unlike the example sites that just directly scraped rss feeds and serps. So be on the lookout for good quality scraper sites!
In terms the ethical issues surrounding parasite scraping....I think this posts slug-line says it best:
Two Wrongs CAN Make a Right
**UPDATE: I removed the demos and links, sorry guys***
--RobBack
Website: http://nis.sarup.dk/
Comment:
Isn't it kind of risky to have that kind of site? --> http://opensource.votio.com/php/forum/poker-gambling
Comment:
Yeah, as we're talking about here, it is definately easy to exploit scraper sites....not even just ones that have the scraped query in the URL through mod_rewrite.....there are other techniques to exploit other kinds of scrapers....but thats fodder for a later post.
There are tons of sites like this this you know how to find them
Website: http://moneymakerblogs.com
Comment:
Thanks for this post. I have been reading about scraping recently and I'm trying to learn how to do it. This is interesting - scraping the scrapers... I like it. :)
This is a newbie question, but how do you implement the code examples above on a webpage to show the scraped content?
Comment: That is Ruby. It's not as simple to run as just copy and pasting.....It requires its own interpreter program to execute the code, and it is not standard on many hosts. Do some research.
Website: http://moneymakerblogs.com
Comment: Thanks for the response, off to my friend google I go :)
Website: http://t4ky.info
Comment:
Good post, hopefully all the newbies will be able to benefit off of everyone elses Wordze accounts.
Hehe.
I just got finished writing my own content generation program that is similar, the first example scraper site is really good. That is a good blackhatter, and the sites are similar to mine (very clean and neat looking template).
I have been pretty sick of the spammiest looking spam sites you can ever imagine coming up in a G search. I'd also be suprised if those sites converted at all, for clicks or for affiliate programs. They just look so shitty.
The secret is to look neat, and have your PPC ads above the fold. Easy.
Comment:
Whenever I write a post in the wee horus of the morning (like this one), I invariably forget to mention some key things....so here's a very quick run-down of the key things you should also thing about...
Be anonymous. Proxy. Rotate IPs if possible. Install tor on your server if you have to.....
You shouldn't do anything half ass, especially this....don't forget to consider all aspects of your code that might leak your sites IP or other identify features.
So be slick by remembering to pay attention to how you display the CSS for a page....I usually have my scraper scrape all pages ending in .css and then copy them into <style> tags in the header of the page I am scraping.
Always filter your html to avoid opening yourself to XSS
Keep in mind that the people you are going to be targeting will generally have no qualms about fucking with you if they catch you. It wouldnt be hard for them to feed your scraper malicious code.
Oh...that reminds me....I didnt proxy those examples.....oh shit..uhhhhhhhhhhhhhh......
# bus errorWebsite: http://www.hearder.com
Comment:
You mentioned installing Tor..
Do you know of any sites that give instructions for installing Tor on a linux server, so that I can run my scripts a "bit" more secretly..
Hopeing you can help
Bruce
Comment:
Bruce, my suggestion is that you read all the help files on the tor website.
There are so many ways to set up an install on linux dependent on the environment, so I suggest you read the doc files.
Make sure you have libevent installed on your machine (that was a stumbling block for me at first)
Website:
Comment: Another way to avoid detection would be to do your scraping off-site on a test server, injecting content into your own db, then uploading to your own sites. It's also a good work around if you don't have VPSes or your own dedicated servers yet. Can you say WP splogs?
Announcements & News 14 Posts
General news relating to this site
Google Hacking 9 Posts
Oh, the treasures that are to be found on Google!
Links & Points of Interest 9 Posts
Links of interest
Technical 14 Posts
Scripts, Programming, Advanced SEO Techniques
Theory 23 Posts
Off the top of the dome...
Tools & Applications 5 Posts
Tools to help you grow your empire
Twitter 6 Posts
Anything and everything having to do with Twitter
Website Development 4 Posts
Principals and Best Practices for general web development
recent comments:
nickycakes on I Could Be Anythingabdul on An open letter to all my Friends across all Social Networks.
Musashi on Fun with String Permutations
Rob on An Introduction to Datapresser's Content Generator
stack paper on An Introduction to Datapresser's Content Generator
stack paper on An Introduction to Datapresser's Content Generator
big man on Dude, where's my proxy?!?!
5ubliminal on Stuffing website inputs: A technique for gaining backlinks.
abdul on Stuffing website inputs: A technique for gaining backlinks.
Paul on An Introduction to Datapresser's Content Generator
Subscribe to Recent Posts
Subscribe to Featured Databases
Subscribe to Free Downloads
