The Internet’s Sticky Problem: Web Scraping

By Rami Essaid of Distil Networks

Think of it as “The Imitation Game.” Not the 2014 movie about British mathematicians racing against time to crack Nazi codes during World War II but a battle on the Internet between website operators and others who copy or steal their content.

screen shot 2015 08 31 at 3 41 10 pm1 The Internets Sticky Problem: Web Scraping

rami The Internets Sticky Problem: Web Scraping

Rami Essaid
(Photo courtesy of Rami Essaid)

screen shot 2015 08 31 at 3 41 10 pm1 The Internets Sticky Problem: Web Scraping

The practitioners of web scraping use Internet bots to gather data from someone else’s website and then copy the content on their own site or use it for a variety of nefarious purposes, such as undercutting a competitor’s promotional pricing, stealing leads or hijacking marketing campaigns.

You’d think web scraping, which is on the rise, would be against the law. But the legal landscape is rife with inconsistencies and inconclusive cases, and varies from country to country.

Scraping has been around almost as long as the web and, in its good form, it’s actually a key component of how the Internet operates. Search engines like Google use “good bots” to index web content and quickly find, say, the latest Grumpy Cat video.

“Bad bots,” however, are designed to fetch content from a website with the intent of using it for purposes outside the site owner’s control — prices, promotions, offers or information that’s meant to be available only to paid subscribers or authorized business partners.

 

A Shadow Business

screen shot 2015 08 31 at 3 41 10 pm1 The Internets Sticky Problem: Web Scraping

Web scraping has become a lucrative business, with high-powered tools such as DiffbotUipath and Screen Scraper that go beyond simple data extraction to provide automated form filling and manipulating software to initiate data transfer between applications. Others can imitate human browsing right down to natural pauses and are almost indistinguishable from people.

If you’re not a programmer, web scraping is still within easy reach. Just type “web scraping consultant” into a search engine and you’ll get pages and pages of professional service offerings.

The motivation behind commercial web scraping has always been to gain an easy commercial advantage.  Why spend time and money creating your own content when you can take someone else’s for free?

Startups are especially prone to the scraping temptation because it’s a cheap and powerful way to gather data without starting from scratch. Tough luck if the content that’s stolen and repackaged hurts customer trust and the originator’s SEO results (the content is no longer seen as unique by search engines).

 

The Hardest Hit Industries

screen shot 2015 08 31 at 3 41 10 pm1 The Internets Sticky Problem: Web Scraping

Though any website is susceptible to web scraping, certain industries are prime targets.

Digital Publishers And Directories

Given that much of their intellectual property is right out in the open, digital publishers are easy prey.

E-commerce

Pricing and product information scraped off a retailer’s site by bots can be fed to an analytics engine, enabling competitors to match prices and products in close to real time – and seconds can make the difference between the retailer keeping a sale and a scraper co-opting it.

Travel

Online travel agencies — Priceline, Expedia, Trivago and Hipmunk, etc. built their meta-search businesses around site-scraping, but do so legally. The flipside? Red Label Vacations, the largest independent travel brand in Canada, had bots from unauthorized third parties executing searches on its site and stealing content.

Real Estate

Scrapers use bots to grab content such as property listings and then create derivative products such as lead generation programs and appraisal data.

 

Tell It To The Judge?

screen shot 2015 08 31 at 3 41 10 pm1 The Internets Sticky Problem: Web Scraping

If a website operator doesn’t want to be scraped, will the law be supportive?  What constitutes fair use and what’s just a free ride?

The answer is very murky.

A variety of laws may apply to unauthorized scraping, including contract, copyright and trespass to chattels laws. (“Trespass to chattels” protects against unauthorized use of someone’s personal property, such as computer servers.)

Much of scraping’s legality depends on how the scraping was done and what the scraped data was used for.  For example, if someone scrapes pricing data and then uses that data to determine their own pricing schedule, this likely would be deemed legal. If, on the other hand, the scraper stole content, sold that content to others and then made a profit off it, a judge would be more likely to rule that a crime was committed.

Two cases in recent years seemed to be landmark victories for scraping protection. In 2009, Facebook won one of the first copyright suits against a web scraper when a US District Court judge in California ruled that the operator of Power.com, a site that aggregated multiple social networks into one site, illegally collected user information from Facebook and displayed it on their own site.

In 2013, a US District Court judge in New York sided with the Associated Press against Meltwater, an electronic news clipping service that included excerpts of AP stories in search results for its clients seeking news coverage based on particular keywords. The judge ruled that AP’s copyrights were infringed and that Meltwater was using AP’s resources and not providing any value back.

Despite those decisions, rulings in other recent cases have been inconsistent and the world remains largely in legal limbo when it comes to scraping. Data protection laws in Europe have been used successfully to prevent scrapers from what amounts to invasions of privacy, but US scraping still often appears to be considered an acceptable risk in the hypercompetitive world of online business.

 

Fighting the Good Fight

screen shot 2015 08 31 at 3 41 10 pm1 The Internets Sticky Problem: Web Scraping

With the legality of web scraping stuck in such a gray area, many companies are reluctant to pony up the legal costs it would take to bring such a case to trial. But some are willing to fight the good fight.

CouponCabin, one of the largest Internet coupon code providers, has sued the owners of at least 10 other coupon websites, accusing them of stealing its exclusive coupons and presenting them as their own —  and circumventing security measures it put in place.

The case is a vivid illustration of the damage web scraping can do. CouponCabin differentiates itself as a provider of coupons that are guaranteed to work. “How many times have you shopped online and skipped the promo code box because you didn’t have anything to fill it with?” the company asks on its website. “Or, worse, how often do you search for coupon codes only to find a whole bunch of junk?”

CouponCabin has scoured the web and worked with merchant partners for more than 10 years to offer what it calls a one-stop shop for savings. So imagine the company’s alarm when it noticed its content appearing on competing sites.

In its lawsuit, CouponCabin claims the scraping, plus the security measures it was forced to implement, strained its servers and slowed its website by some 300 percent.

I hope more companies continue to take legal action, in the hopes that the courts will finally establish a clear legal landscape for scraping.

Where scraping is concerned, imitation is not the sincerest form of flattery. Companies need to know what’s happening with their sites and take immediate steps to block bad actors.

 

Rami Essaid is CEO and cofounder of Distil Networks, a bot detection and mitigation company.

The views, opinions and positions expressed within this guest post are those of the authors alone and do not represent those of CBS Small Business Pulse or the CBS Corporation. The accuracy, completeness and validity of any statements made within this article are verified solely by the authors.

 

Comments

Leave a Reply

Fill in your details below or click an icon to log in:

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Listen Live