Technology

A Tale of a Computer Genius Turned Data Miner

April 14, 2014

By John Reiley

The term “scraping” encompasses a broad range of connotations and meanings. In the world of internet technology and computing, this word often means trouble. Scrapers, as they are known across the World Wide Web, are individuals with a high level of computer knowledge and expertise, who often use their advanced computing skills to bring harm to others. Scrapers may intentionally or unintentionally steal others’ information. Some operate as individuals, while others form large groups and even form businesses centered on scraping.

The act of scraping involves taking bits and pieces of written content, such as articles, blogs, and electronic journals, and transporting it to another website. This act is performed by a third party, and is often done without consent. Material may be scraped from public and private websites, and can be equally problematic in both instances.

Is Scraping Good or Bad?
Surprisingly, many people disagree on whether content scraping is helpful or harmful. Some argue that scraping is beneficial in certain areas of the internet, such as social media. Content, they argue, is designed to be shared, and ultimately benefits by gaining exposure to a wider audience on the web. Proponents elaborate on that point by saying that scraping helps people and companies gain a greater visibility when their content is duplicated and spread to different sites, beyond just the web pages their initial target audience would most likely visit. Secondly, they defend that content scraping may increase traffic to a website or SEO for a blog or news feed. Scraped material links back to other posts and blogs on a website which, in turn, creates free back links. Those links help increase the volume of traffic to a site and may, in turn, increase SEO and increase the individual’s or company’s online visibility.

While some people make a case for content scraping, others offer compelling reasons for why it does more harm than good. Just as content scraping can, potentially, help a company’s SEO, it can also hurt. That happens when the duplicated article ranks higher than the initial post. Another drawback of scraping is that it may decrease brand awareness. Search engines, when presented with duplicate material, often have trouble deciphering original content from replicated verbiage. This might cause headaches for larger firms, but can be catastrophic to smaller businesses who are trying to make a good reputation and get their brand known.

Crossing the Line
Sometimes, content scrapers blur the line between right and wrong. One such scraping scandal in the United States involved a computer expert named Aaron Swartz. Swartz successfully scraped content from the Public Access to Court Electronic Records (PACER) program, which stores public records filed in the federal court system. Typically the program charges nominal fees to users to gain electronic access to those files. Swartz, however, bypassed that requirement using sophisticated software to retrieve information from those records and send it back to his own computer. Swartz continued this operation for several years, and ultimately gained access to millions of files in the PACER database, from which he scraped content. Swartz, when questioned by authorities, defended his actions by saying the data he retrieved was public information and, therefore, he was doing members of the public a favor by gaining access to their files and distributing that information without charging them fees.

Protecting Against Content Scraping
Whether you work for yourself or a small business, you may benefit from knowing how to detect and prevent content scraping. As with many areas of computer maintenance, taking action against content scraping can be done manually or with the assistance of an outside firm. On your own, you can prevent scrapers from taking your material by setting Google Alerts, adding canonical links to your content, and downloading Copyscape or other plagiarism detection programs to find replicated versions of your work. Installing programs like CAPTCHA, which require humans to re-type a series of letters and numbers, provides a content-scraping blockade to prevent scrapers from removing content from your site. These programs offer the additional benefit of keeping your site clean by reducing the volume of spam and junk mail that, otherwise, might flow freely through your website. Larger firms and agencies often enlist the help of advance services to detect scraping and stop scrapers in their tracks.

Regardless of whether you work for yourself or a corporation, either large or small, you probably conduct a fair amount of business on the web. The internet brings many opportunities for business, and is a great way to gain exposure. The internet also brings threats, including content scraping. You can keep your company safe from electronic harm by taking preventative measures and using advanced services to detect and deflect content scraping.

John Reiley is a tech enthusiast who loves being a part of the ever growing web community. He is a network specialist currently residing in US. John has a passion for writing articles and daily columns to serve his audience and educate them about the current technological advances.

RELATED ARTICLESMORE FROM AUTHOR

How to Address the Top 5 Cybersecurity Challenges in Hybrid Work

5 Easy Tips for Outsourcing Mobile Apps Development in 2021

Top 8 Attributes To Look For In A Powerful Inventory Management Software

Website Maintenance Tips Every Entrepreneur Should Know

4 Considerations for Small Businesses Following Biden’s Cybersecurity Executive Order

How Modern Software Can Help Your Business Thrive

RELATED ARTICLES MORE FROM AUTHOR