Modern Web Scraping
Published 11th September 2019
Web scraping is an automated data collection from publicly available sources, or, in other words, using bots to gather information on the web that is intended to be read by humans.
Technically, everything that is openly available on the web can be scraped (though do remember that not everything is allowed to be scraped by law). Things that are commonly collected include organisations’ contact details (emails, addresses, phone numbers), product ranges, characteristics and prices – all this data is invaluable for competitive analysis. Bookmakers can collect information about sporting events, odds and outcomes.
Importantly, by using web scraping bookmakers can get answers to the following: are my odds are in line with the market? And can any of my odds be used in arbitrage betting? Bookmakers can even calculate their own odds based on those from other bookmakers, but all this is subject to the main web scraping technical limitation: latency.
In performance engineering, the term ‘latency’ means a delay between cause and visible effect. In our case this is a delay between the change of information on the scraped website and the respective change in our informational system.
The tolerable latency value depends on the kind of information we deal with. For static, rarely changing data, such as contact details, it is fine to have a couple of days’ delay. For information about pre-match sporting events odds, a 30-minute delay is quite okay. But when dealing with information about live sporting events, every second counts – a delay of just 10 seconds renders the collected data useless from a bookmaker’s point of view. So we need to understand that these cases demand completely different technical approaches.
The common approach to scraping involves a robot (or ‘web crawler’) walking through the monitored pages one by one, performing an ‘open-scan-close’ cycle for each page, rescanning the set of monitoring pages in a loop. The minimum achievable latency for this approach equals the time it takes a crawler to re-read the whole set of its pages. Involving more crawlers can reduce the latency, but there will still be a lag of at least several minutes. This is because we cannot reload the pages too often for fear of overloading the scanned website. I will focus on this point later.
If we want less than a minute’s latency, we need to use another approach: a dedicated crawler for each of the scanned web pages. We open a page for, say, a live sporting event, keep it open, allowing the page to renew itself, and instantly snapshot every change that occurs.
While a crawler that walks through the pages in turn can be used to monitor 100-200 pages, for real-time web scraping we need one or two crawlers per page. Thus, in terms of computing power, real-time web scraping is several hundred times more expensive.
And still there are fundamental restrictions that prevent us from achieving an arbitrarily small latency. These include network round-trip times and the performance of the websites themselves.
At Melbet, we achieve a figure of five seconds latency or less on 95% of the cases we work on, which isn’t perfect but is generally tolerable for the scraping of live sporting events .
RESPONSIBLE WEB SCRAPING
The cloud providers like Amazon, Google Cloud or DigitalOcean are the best choice for web scraping, first of all because you can easily scale out your system to hundreds of virtual machines on demand during peak periods and then cut costs by turning off the machines when the load is low. Doing so is necessary since we have many times more live events during weekends than we do on weekdays. Another advantage of the cloud is that in order to reduce the network round-trip time, you can easily choose a data centre that is geographically close to the servers of the website you are going to scrape. If you are a responsible web scraper, you won’t have any problems with cloud service providers.
What do I mean by ‘responsible’ web scraping? One needs to understand that hosting information in publicly available sources is nthat the website owner has to pay for. If they find out that a significant percentage of traffic is consumed not by human customers but by robots, they will have every right to fight this by any means, including legal action. This is why we should avoid placing a noticeable extra load on the scraped websites. We get round this by caching the static data, limiting the frequency of page reloading and avoiding simultaneous requests to the website from many robots. Of course, all these measures come at the cost of latency.
Web scraping is as old as the web itself. Although the practice is sometimes associated with illegal things such as email spamming and bank card fraud, we should also remember that Google search itself is a kind of web scraper. In responsible hands, then, web scraping can become a powerful, high-tech tool for your business.ot free for the site owner. Accessing the web page in your browser might look ‘free’, but each request to the server consumes computing power