How to Scrape the Web with NodeJs

Need Javascript?

If you need to scrape a webpage that requires javascript to properly render, check out our article on browser automation with NodeJs using either Puppeteer or Selenium

In recent years, Javascript has come up the ranks of popularity with the advancements with NodeJs. It’s taken the web and the world by storm. There may come a time when you need to scrape a web page and you want to do it in node. By the end of this article, you will have a full understanding of how to get html data from web pages using NodeJs.

Prerequisites:

  • Have NodeJs and npm (installed with the NodeJs installer) installed on your machine. If not, you can follow this great tutorial to get that installed.
  • Experience programming in Javascript Es6.
  • An interest in web scraping.

Now there are a few different ways to make a request to the web in Node. In this article we’ll cover axios and requests. First up axios.

To install axios type:

npm install axios

Now that’s installed we can write a simple script.

const axios = require('axios');
const fs = require('fs');

axios.get('https://www.google.com/search?q=gtx+3080+ti')
.then(function (response) {
    fs.writeFile('response.html', response.data, function (err) {
        if (err) {
            return console.log(err);
        }
        console.log("The file was saved!");
    });
}); 

Now when we run that script what it’ll do is make a request to Google, querying the GTX 3080ti, and save the response html to the same directory your Javascript file is in.

Now what you may notice if you load that html into a browser by double clicking the file that the rendered web page looks a little off. This can be for a few reasons, the reason here is that Google changes how it formats the html based on the headers we send up. One header in particular is important, and that’s the User-Agent header. This header tells the web server what kind of client is calling it.

Here’s what the new script will look like, and I’ll explain the changes below:

const axios = require('axios');
const fs = require('fs');

let agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0';

let headers =
{
    'User-Agent': agent,
    'Host': 'www.google.com',
    'Sec-Fetch-Dest': 'document',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8'
};

axios.get('https://www.google.com/search?q=gtx+3080+ti',
{
    headers: headers
})
.then(function (response) {
    fs.writeFile('response.html', response.data, function (err) {
        if (err) {
            return console.log(err);
        }
        console.log("The file was saved!");
    });
});

First up is the agent variable, which will be sent as the User-Agent header. This provides information about the calling client to the web server. This is sent via the headers of the request.

Host is the request header that specifies the host and port number (defaults to 80/443) of the server to which the request is being sent.

Sec-Fetch-Dest tells the web server how the fetched data will be used. In this case document means displayed in a web browser.

Accept is the request header that indicates which content types, also known as MIME types, the client is able to understand.

By including these different headers in the request, we are mimicking what would typically come from a browser when requesting the same web page. If we run the request, you should see after opening the returned html that it looks a lot more like what you would see if you were to run the same query through google.

Now we’ll look at another library for making http requests, this one is simply called request. This library has been around longer and is used by many applications and libraries. But the author of this library has stated that this project is officially deprecated. That doesn’t mean that this library is useless and not worth looking at.

To install this library you’ll want to type into the console:

npm install request

With a few simple modifications we can update our script to use the request library. Here’s the full script and I’ll explain the changes below.

const request = require("request");
const fs = require('fs');

let agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0';

let headers = 
{
    'User-Agent': agent,
    'Host': 'www.google.com',
    'Sec-Fetch-Dest': 'document',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8'
};

const options = {
    url: 'https://www.google.com/search?q=gtx+3080+ti',
    method: 'GET',
    headers: headers
};

request(options, function(err, res, body) {
    fs.writeFile('response.html', body, function(err){
        if(err) {
            return console.log(err);
        }
        console.log("The file was saved!");
    });
});

As you can see there are a few slight differences. The request library doesn’t have specific methods for the different http methods, you define the method through the options object, which is where everything lives. This library also uses callbacks vs Promises, which is something you’ll have to take into consideration when choosing between the two. The values we pass in here are the same, and will produce similar web page requests as far as google is concerned.

One problem with the approaches above is that making a web request like this does not have the browser backend to run the javascript (which is funny because we’re programming in javascript right now). For that you’ll want to read this tutorial to learn how to use libraries like puppeteer or selenium to control a headless browser and get that javascript going.