Node.js has emerged as a powerful option for building web scrapers, offering convenience for both client-side and server-side developments. Its extensive catalog of libraries makes web scraping with Node.js a breeze. In this article, cheerio will be spotlighted, and its capabilities will be explored for efficient web scraping.
Cheerio is a fast and flexible library for parsing and manipulating HTML and XML documents. It implements a subset of jQuery features, which means anyone familiar with jQuery will find themselves at home with the syntax of cheerio. Under the hood, cheerio uses the parse5
and, optionally, the htmlparser2
libraries for parsing HTML and XML documents.
In this article, you’ll create a project that uses cheerio and learn how to scrape data from dynamic websites and static web pages.
Web Scraping with cheerio
Before you begin this tutorial, make sure you have Node.js installed on your system. If you don’t have it already, you can install it using the official documentation.
Once you’ve installed Node.js, create a directory called cheerio-demo
and cd
into it:
Then initialize an npm project in the directory:
Install the cheerio and Axios packages:
Create a file called index.js
, which is where you’ll be writing the code for this tutorial. Then open this file in your favorite editor to get started.
The first thing you need to do is to import the required modules:
In this tutorial, you’ll scrape the Books to Scrape page, a public sandbox for testing web scrapers. First you’ll use Axios to make a GET
request to the web page with the following code:
The response
object in the callback contains the HTML code of the web page in the data
property. This HTML needs to be passed to the load
function of the cheerio
module. This function returns an instance of CheerioAPI
, which will be used to access and manipulate the DOM for the rest of the code. Note that the CheerioAPI
instance is stored in a variable named $
, which is a nod to the jQuery syntax:
Finding Elements
cheerio supports using CSS and XPath selectors for selecting elements from the page. If you’ve used jQuery, you’ll find the syntax familiar—pass the CSS selector to the $()
function. Use this syntax to find and extract information on the first page of the Books to Scrape website.
Visit https://books.toscrape.com/ and open up the Developer Console. Search the Inspect Element tab, where you’ll learn more about the HTML structure of the page. In this case, you can see that all the information about the books is contained in article
tags with the class product-pod
:
To select the books, you need to use the article.product_pod
CSS selector like this:
This function returns a list of all the elements that match the selector. You can use the each
method to iterate over the list:
Inside the loop, you can use the element
variable to extract the data.
Try to extract the title of the books on the first page. Going back to the Inspect Element console, you can see how the titles are stored:
You see that you need to find an h3
, which is a child of the element
variable. Inside the h3
, there is an a
element that holds the book’s title. You can use the find
method with a CSS selector to find the children of an element, but initially, you need to pass element
through $
to convert it into an instance of Cheerio
:
Now, you can find the a
inside titleH3
:
Note:
titleH3
is already an instance ofCheerio
, so you don’t need to pass it through$
.
Extracting Text
Once you’ve selected an element, you can get the text of that element using the text
method.
Modify the previous example to extract the book’s title by calling the text
method on the result of the find
method:
The complete code should look like this:
Run the code with node index.js
, and you should see the following output:
Navigating the DOM: Finding Children and Siblings
Once you’ve extracted the titles, it’s time to extract the price and availability of each book. The Inspect Element reveals that both the price and availability are stored in a div
with the class product_price
. You can select this div
with the .product_price
CSS selector, but since you’ve already covered CSS selectors, the following will discuss another way to do this:
Note: The
div
is a sibling of thetitleH3
you selected previously. By calling thenext
method oftitleH3
, you can select the next sibling:
You’ve already seen that you can use the find
method to find the children of an element based on CSS selectors. You can also select all the children with the children
method and then use the eq
method to select a particular child. This is equivalent to the nth-child
CSS selector.
In this case, the price is the first child of priceDiv
, and the availability is the second child of priceDiv
. This means you can select them with priceDiv.children().eq(0)
and priceDiv.children().eq(1)
, respectively. Do that and print the price and availability:
Now, running the code shows the following output:
Accessing Attributes
So far, you’ve navigated the DOM and extracted texts from the elements. It’s also possible to extract attributes from an element using cheerio, which is what you’ll do in this section. Here, you’ll extract the rating of books by reading the class list of elements.
The rating of the books has an interesting structure. The ratings are contained in a p
tag. Each p
tag has exactly five stars, but the stars are colored using CSS based on the class name of the p
element. For example, in a p
with class star-rating.Four
, the first four stars are colored yellow, denoting a four-star rating:
To extract the rating of a book, you need to extract the class names of the p
element. The first step is to find the paragraph containing the rating:
By passing the attribute name to the attr
method, you can read the attributes of an element. In this case, you need to read the class list, which is demonstrated in the following code:
The class list is in the following form: star-rating X
, where X
is one of One
, Two
, Three
, Four
, and Five
. This means you need to split the class list on space and take the second element. The following code does that and converts the textual rating into a numerical rating:
If you put everything together, your code will look like this:
The output looks like this:
Saving the Data
After scraping the data from the web page, you’d generally want to save it. There are several ways you can do this, such as saving to a file, saving to a database, or feeding it to a data processing pipeline. In this section, you’ll learn the simplest of all—saving data in a CSV file.
To do so, install the node-csv
package:
In index.js
, import the fs
and csv-stringify
modules:
To write a local file, you need to create a WriteStream
:
Declare the column names, which are added to the CSV file as headers:
Create a stringifier with the column names:
Inside the each
function, you’ll use stringifier
to write the data:
Finally, outside the each
function, you need to write the contents of stringifier
into the writableStream
variable:
At this point, your code should look like this:
Run the code, and it should create a scraped_data.csv
file with the scraped data inside:
Conclusion
As you’ve seen here, the cheerio library makes web scraping easy with its jQuery-esque syntax and blazing-fast operation. In this article, you learned how to do the following:
- Load and parse an HTML web page with cheerio
- Find elements with CSS selectors
- Extract data from elements
- Navigate the DOM
- Save scraped data into local file storage
You can find the complete code on GitHub.
However, cheerio is just an HTML parser, so it can’t execute JavaScript code. That means you can’t use it for scraping dynamic web pages and single-page applications. To scrape those, you need to look beyond cheerio at complex tools like Selenium or Playwright. And that’s where Bright Data comes in. Bright Data’s vast web scraping solutions include a Selenium Scraping Browser and Playwright Scraping Browser. To learn more about the products, you may visit our Scraping Browser documentation.