Web scraping is a technique that consists of extracting data from web pages in an automated way. Web scraping is based on the indexing of content or, rather, on the transformation of the information contained in web pages into intelligible duplicate information, which can be exported to other documents such as spreadsheets.
The people in charge of this crawling task, called scraping, are the so-called bots or crawlers. They are robots that are dedicated to automatically navigate through web pages, collecting data or information present in them.
The types of data that can be obtained are very varied. For example, there are tools that are responsible for price mapping, i.e., obtaining information on hotel or travel prices for comparison sites. Other techniques such as SERP scraping are used to find out the first results in search engines for certain keywords.
Data scraping is used by most large companies. Perhaps the clearest example is Google: where do you think it gets all the information it needs to index websites? Its bots continuously analyze the web to find and classify content by relevance.
Protecting your data from Web Scraping
Data scraping is a practice that continues to raise some eyebrows, as it is considered unethical in some quarters. In the end, in many cases, it is used to obtain data from other web pages to replicate them in a new one through the use of an API, which in some cases could lead to copying or duplication of information.
Also, these bots can be designed to navigate automatically through a website, even creating fake accounts, hence in many websites, you will see the typical captcha to confirm that you are not a bot.
On the other hand, the automatic extraction of information can create problems for the analyzed web pages, especially if the crawling is done on a recurring basis. Think that Google Analytics or other web metrics sites collect visits from bots. Therefore, if crawlers continuously visit a website, it could be affected and harmed by these “low quality” visits and lose ranking.
But all these are moral rather than legal issues. What does the General Data Protection Regulation (GDPR) say?
This law establishes new data protection and internet crime prevention data. The regulation states that the fact that a web page is public, accessible or indexable does not imply, in any way, that its data can be extracted. This technique is only allowed in the following cases:
- They are publicly accessible sources or the data are collected for a purpose of general public interest.
- The interest of the data controller prevails over the right to data protection.
- The tracked person is tracked with their consent.
Therefore, in case of a complaint, it must be demonstrated that the information is in the general public interest according to Article 45 of the GDPR, or the right of the controller to collect the data must be weighed.
In addition, web scraping cannot be used to infringe intellectual property law or the right to privacy of individuals, for example through practices such as identity theft.
How can I prevent data scraping?
Web data scraping is a technique that can cause damage to crawled websites, especially if it is used continuously. One of the most direct consequences is the alteration of visitor data by the bots, damaging the perception that Google has of the website in relation to the bounce rate, time per visit, etc.
In addition, depending on the data collected, web scraping could be an act of unfair competition or infringement of intellectual property rights. For example, websites that copy content directly from Wikipedia or other websites, or stores that duplicate the product descriptions of others.
Furthermore, a website can also be scraped for other malicious purposes that fall under the scope of the right to privacy, for example, companies that scrape emails, phone numbers, or social network profiles in order to sell them to third parties.
If you want to avoid data scraping on your website, we recommend following these tips:
- Using cookies or Javascript to verify that the visitor is a web browser. As most web scrapers do not process complex javascript code, to verify that the user is a real web browser you can insert a complicated javascript calculation into the page, and verify that it has been correctly computed.
- Introduce Captchas to make sure that the user is a human. It is still a good measure to eliminate robot visitors; although lately they have become more sophisticated and manage to bypass them.
- Set limits on requests and connections. You can mitigate scrapers’ visits by adjusting the number of requests to the page, and connections; since a human user is slower than an automatic one.
- Obfuscate or hide data. Web scrapers crawl data in text format; therefore it is a good measure to publish data in image or flash format.
- Detecting and blocking known malicious sources. Locate and block access to known site scrapers, which may include our competitors, and whose IP address could be blocked.
- Detecting and blocking web scraping tools. Most tools use an identifiable signature to detect and block them.
- Constantly update the HTML tags of the page. Scrapers are programmed to search for certain content in the tags of the web page. Frequently changing the tags by introducing, for example, spaces, comments, new tags, etc … can prevent the same scraper from repeating the attack.
- Using fake web content to trap attackers. If you suspect that your information is being plagiarized, you can publish fictitious content and monitor its access to discover the scraper.
- Inform in the legal conditions section about the prohibition of web scraping on your site.
Preventing scraping attacks is difficult because it is increasingly difficult to distinguish scrapers from legitimate users. That is why the companies most exposed to plagiarism of their content, such as online stores, airlines, gambling sites, social networks, or companies with content that is subject to intellectual property, among others, must reinforce the security measures of their content published on the Internet. Remember how important it is to keep your data protected on the Internet to avoid spam, phishing, and other computer crimes.