Why You Probably Don’t Need Javascript With A Scraper

DECEMBER 18

PIM FROM SCRAPPEY.COM

A conversation with Pim from Scrappey (A Software Engineer) that explains why you probably don’t need Javascript to scrape your data.

As web developers, we often rely on web drivers like Puppeteer to automate tasks and interact with web pages. However, it's important to remember that we can accomplish many of the same tasks using plain HTTP requests.

In fact, HTTP is the foundation of the modern web, and nearly everything that we do in a web browser can be done using HTTP requests. This includes making API calls, fetching data from a server, and submitting forms.

💡

Requests (HTTP) are the building blocks of the web, almost everything can be replicated

So why use a web driver at all? One reason is that web drivers allow us to execute JavaScript code and interact with web pages in a way that is not possible with HTTP requests alone. This can be useful for tasks such as rendering complex web pages or testing the behavior of a website.

💡

Requests work fine, until it becomes too difficult. Then it might be interesting to switch to Puppeteer or use a solution like Scrappey.

However, it's important to consider the trade-offs. HTTP requests are generally faster and more efficient than using a web driver, which requires running a full browser instance. This can be particularly important for tasks that need to be performed quickly, such as web scraping or testing. Additionally, using HTTP requests can be simpler and easier to implement than using a web driver, which requires more complex code and setup.

Of course, there are situations where using a web driver is the best choice. But it's worth considering whether plain HTTP requests could be a suitable alternative for your project. In many cases, they can be a powerful and efficient tool for interacting with the web.

Here is an example of using plain HTTP requests to perform a task that is often associated with web drivers: scraping a web page.

First, let's take a look at an example of using the request module in Node.js to make a GET request and parse the HTML of a web page:


const request = require('request');
const cheerio = require('cheerio');

request.get('https://www.example.com', (error, response, body) => {
  if (error) {
    console.error(error);
    return;
  }

  const $ = cheerio.load(body);
  const title = $('title').text();
  console.log(title);
});

This code sends a GET request to the specified URL, loads the HTML of the page into a cheerio object, and extracts the title of the page using a jQuery-like syntax.

Now, let's compare this to using a Chromium-based web driver like Puppeteer to accomplish the same task:


const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://www.example.com');
  const title = await page.title();
  console.log(title);
  await browser.close();
})();

As you can see, both approaches allow us to scrape the title of a web page, but they go about it in different ways. The request module uses plain HTTP requests to fetch the HTML of the page, while Puppeteer uses a full web browser to navigate to the page and extract the title.

💡

When looking at both solutions, scrapers built with only HTTP requests in mind to tend to break faster and require more maintenance.

As web developers, we often rely on HTTP requests to interact with web pages and perform tasks like scraping or testing. However, there are certain businesses that use cloud services like Cloudflare or Akamai to protect their websites from bots and scraping, which can make it more difficult to use HTTP requests.

To bypass these protections, some developers may turn to using a web driver like Puppeteer, which emulates a real browser and can bypass certain types of blocking or rate limiting. While this can be an effective solution, it comes with its own set of drawbacks. Running a full web browser and emulating user interactions can be resource-intensive and slow, and it may not be suitable for tasks that need to be performed quickly or at scale.

💡

Bypassing services like Cloudflare and Akamai are hard to do at scale. Browser based solutions require alot of resources. Eventually too many to be even considered profitable.

That's why we've developed a solution that allows us to keep sending HTTP requests, but in a way that is less likely to trigger the protections put in place by cloud services. By carefully constructing our requests and adding appropriate headers and cookies, we can make our requests appear more like those of a real user and avoid being blocked or rate-limited.

Of course, there are pros and cons to consider when deciding whether to use plain HTTP requests or a web driver. HTTP requests are generally faster and more efficient, simpler and easier to implement, and can be more secure, as they avoid running a full browser instance. However, they may be more susceptible to being blocked or rate-limited by cloud services, and they are limited to making HTTP requests and parsing the resulting HTML, CSS, and JavaScript. On the other hand, web drivers like Puppeteer allow for more advanced interactions with the web page, such as executing JavaScript code or interacting with the DOM, but they are slower and more resource-intensive, and they require more complex code and setup.

Ultimately, the choice between using plain HTTP requests and a web driver will depend on the requirements of your project. Both can be powerful tools for interacting with the web, and it's important to carefully consider your needs and choose the approach that is most appropriate for you.

Check here for an example in Javascript how scraping would look like using Scrappey to bypass most limitations.