In this guide, you will see:
- What Puppeteer is and what it offers for web scraping
- How to use it with a step-by-step tutorial to extract data from a dynamic site
- Advanced Puppeteer web scraping scenarios
- A comparison between Puppeteer, Playwright, and Selenium
- Its limitations and how to overcome them
Let’s dive in!
What Is Puppeteer?
Puppeteer is a Node.js library that provides a high-level API for automating Chromium-based browsers—and recently, also Firefox. It is widely used for web scraping as a browser automation tool to control headless browsers.
The library allows you to appear like a regular user by simulating user behavior in a controlled browser. Puppeteer also offers a rich API for selecting HTML elements and extracting data from them.
Key features of Puppeteer for web scraping include:
- JavaScript execution for handling dynamic content
- API for simulating user interactions, such as navigating pages, clicking elements, and more
- API for data extraction
- Ability to take screenshots of HTML elements and pages for visual scraping (e.g., competitor sites)
- Headless browser support for faster scraping
- Network request interception
- Extensive customization options like user-agent, proxy settings, and more
In short, Puppeteer lets you interact with pages like a real user, making it perfect for scraping modern, JavaScript-heavy pages.
Web Scraping With Puppeteer: Step-By-Step Example
In this guided section, you will learn how to build a scraping script using Puppeteer. The target site will be a modified version of Quotes to Scrape, which loads quotes dynamically via JavaScript rendering:
This page uses JavaScript to render quotes after a specified delay, controlled by the delay
query parameter. In this case, the delay is set to 500ms (?delay=500
).
The script will use browser automation to:
- Connect to the target page
- Wait for the quote elements to render
- Scrape data from them
- Export the scraped data to CSV
Time to build your Puppeteer web scraping script!
Step #1: Project Setup
First, make sure you have the latest LTS version of Node.js installed on your machine. If not, download it from the official site and follow the installation steps.
Create a new folder for your JavaScript project and navigate to it in the terminal:
mkdir puppeteer-web-scraper
cd puppeteer-web-scraper
Run the command below to initialize a Node.js project inside the folder:
npm init -y
Next, open your Puppeteer web scraping project in your preferred JavaScript IDE, and add index.js
to it:
Amazing! You are ready to start working with Puppeteer.
Step #2: Puppeteer Installation and Setup
Install Puppeteer by adding the puppeteer
npm package to your project’s dependencies:
npm install puppeteer
During installation, Puppeteer automatically downloads the appropriate version of Chrome.
To get started, import puppeteer
in index.js
:
const puppeteer = require("puppeteer");
Next, create an async
function where to initialize a browser instance:
const puppeteer = require("puppeteer");
(async () => {
// launch the browser to control and open a new page
const browser = await puppeteer.launch();
const page = await browser.newPage();
// browser automation for web scraping...
// close the browser and release resources
await browser.close();
})();
The launch()
method starts a new browser instance in headless mode by default. Then, newPage()
opens a new tab within that instance. Finally, close()
properly shuts down the browser.
To run the browser in headed mode—useful for debugging and visually tracking the script’s execution—set the headless
option to false
:
const browser = await puppeteer.launch({
headless: false,
});
Wonderful! Your Puppeteer scraping setup is complete.
Step #3: Visit the Target Page
Use the goto()
method to navigate the browser to the target page:
await page.goto("https://quotes.toscrape.com/js-delayed/?delay=1000");
The delay=1000
query parameter instructs the page to load the quotes asynchronously after 1 second.
Enable headed mode in your script, then set a breakpoint before the close()
method, and run the script in the debugger. You should see something like this:
Perfect! Your Puppeteer web scraping navigation is working perfectly.
Step #4: Inspect the Target Page
Before implementing the scraping logic, it is important to study the site’s structure and the DOM of the target page. To do that, open the website in incognito mode in your browser.
Since you need to extract data from each quote on the page, inspect a quote’s HTML element. Right-click on the first quote and select “Inspect” to open the DevTools panel:
See that each quote is contained within a .quote
HTML element. Each quote includes:
- The quote text inside a
.text
node. - The author’s name inside a
.author
element. - Multiple
.tag
elements inside a.tags
node.
Good! You now have all the necessary information to do web scraping with Puppeteer.
Step #5: Implement the Scraping Logic
Since you want to scrape multiple quotes, start by initializing an array where to store the extracted data:
quotes = []
Remember that quotes are loaded dynamically via JavaScript. Initially, the page appears empty, and the content is populated later:
To verify that all quotes are loaded before scraping, wait for the .quote
elements to appear:
await page.waitForSelector(".quote");
The waitForSelector()
method waits for the specified selector to be on the DOM of the page.
Now that all quotes are on the page, use $$()
to select all elements matching the CSS selector .quote
:
const quoteElements = await page.$$(".quote");
Note that $$()
supports both CSS and XPath selectors. Discover more in our guide on XPath vs CSS Selector.
Next, iterate through each quote element:
for (const quoteElement of quoteElements) {
// scraping logic...
}
Inside the loop, extract the quote text, author, and tags:
const textElement = await quoteElement.$(".text");
const text = await textElement.evaluate((el) =>
el.textContent.replace("“", "").replace("”", "")
);
const authorElement = await quoteElement.$(".author");
const author = await authorElement.evaluate((el) => el.textContent);
const tags = await quoteElement.$$eval(".tags .tag", (tagElements) =>
tagElements.map((el) => el.textContent)
);
$()
works just like $$()
but returning the first element matching the given selector. The snippet above uses the evaluate()
method to execute a function within the context of an HTML element. Specifically, it extracts the text content from the selected element. The $$eval()
method combines $$()
with evaluate()
to collect all tag texts into an array.
After extracting the data, store it in an object and add it to the quotes
array:
const quote = {
"text": text,
"author": author,
"tags": tags,
};
quotes.push(quote);
Terrific! The data extraction logic is finalized.
Step #6: Export Data to CSV
While Node.js has built-in tools for populating CSV files, using a dedicated library like csv-writer
simplifies the process. Add it to your project’s dependencies with:
npm install csv-writer
Then, import it into your index.js
file:
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
Before exporting the data to CSV, make sure to flatten it into an object with all string values. Currently, the tags
field is an array, so use join()
to convert it into a string:
const flattenQuotes = quotes.map((quote) => {
return {
...quote,
tags: quote.tags.join(";"),
};
});
Now, use the flattened data to write the scraped quotes into a quotes.csv
file:
const csvWriter = createCsvWriter({
path: "quotes.csv",
header: [
{ id: "text", title: "Text" },
{ id: "author", title: "author" },
{ id: "tags", title: "Tags" },
],
});
// write the scraped quotes in a flattened format to the CSV
await csvWriter.writeRecords(flattenQuotes);
Fantastic! You started with quotes from a webpage and now have them neatly organized in a CSV file.
Step #7: Put It All Together
Your final Puppeteer web scraping script should look like this:
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
(async () => {
// launch the browser to control and open a new page
const browser = await puppeteer.launch({
headless: false, // comment out in production
});
const page = await browser.newPage();
// navigate to the target page
await page.goto("https://quotes.toscrape.com/js-delayed/?delay=1000");
// where to store the scraped data
const quotes = [];
// wait for the quote elements to load
await page.waitForSelector(".quote");
// select all quote containers
const quoteElements = await page.$$(".quote");
// extract data from each quote
for (const quoteElement of quoteElements) {
// scraping logic
const textElement = await quoteElement.$(".text");
const text = await textElement.evaluate((el) =>
el.textContent.replace("“", "").replace("”", "")
);
const authorElement = await quoteElement.$(".author");
const author = await authorElement.evaluate((el) => el.textContent);
const tags = await quoteElement.$$eval(".tags .tag", (tagElements) =>
tagElements.map((el) => el.textContent)
);
// populate a new quote with the scraped data
// and add it to the list
const quote = {
text: text,
author: author,
tags: tags,
};
quotes.push(quote);
}
// prepare the scraped data to be exported to CSV
const flattenQuotes = quotes.map((quote) => {
return {
...quote,
tags: quote.tags.join(";"),
};
});
// configurations for the CSV output file
const csvWriter = createCsvWriter({
path: "quotes.csv",
header: [
{ id: "text", title: "Text" },
{ id: "author", title: "author" },
{ id: "tags", title: "Tags" },
],
});
// write the scraped quotes in a flattened format to the CSV
await csvWriter.writeRecords(flattenQuotes);
// close the browser
await browser.close();
})();
Great job! In less than 100 lines of code, you built a script to scrape quotes using Puppeteer.
To run the script, execute this command in your terminal:
node index.js
The script will take some time to run. After a while, a quotes.csv
file will appear in your project’s directory. Open it, and you will see the scraped data:
Et voilà! The output CSV file now contains the same data from the target page, neatly organized in a structured format.
Advanced Web Scraping Techniques Using Puppeteer
Now that you know the basics of web scraping with Puppeteer, you are ready to explore more advanced scenarios!
User Interaction
Over time, web pages have become more complex than ever. They now include advanced navigation and interaction patterns like “Load More” buttons, forms, and dynamic dropdowns.
Learn more in our guide on scraping sites with complex navigation.
Puppeteer is built to handle many of these scenarios with dedicated methods. For instance, you can locate and click a “Load More” button with:
await page.click("#load-more");
The click()
function simulates a real user interaction on the button identified by the #load-more
selector.
Keep in mind that the recommended way to interact with an element is by using a Locator
. This defines a strategy for finding elements and performing actions on them. If you already have a locator for the button, simply call click()
on it:
const button = await page.locator("#load-more")
await button.click();
Note: When using locators, Puppeteer automatically waits for the element to appear in the DOM and reach the right state before performing an action. For example, before clicking, Puppeteer:
- Ensures the element is within the viewport.
- Waits for the element to become visible.
- Waits for the element to be enabled.
- Confirms the element’s bounding box remains stable across two consecutive animation frames.
To fill out and submit a form, you can use:
await page.type('input[name="email"]', "[email protected]");
await page.type('input[name="password"]', "mysecurepassword");
await page.click('button[type="submit"]');
The type()
method inputs text into form fields, while click()
submits the form by clicking the submit button. Again, you can call those methods also on locators.
Finally, you can select an option from a dropdown with:
await page.select("#country", "USA");
The select()
function picks a dropdown option based on its value. The above snippet will select the “USA” item in the <select>
element identified by the #country
selector.
JavaScript Execution
While standard Puppeteer simulation methods handle most scenarios, they may not cover every interaction. In such cases or when custom interactions are needed, you can run JavaScript directly within the page.
For example, to scroll down the page, you would run the following JavaScript script:
window.scrollTo(0, document.body.scrollHeight);
Puppeteer allows you to execute JavaScript on the page using the evaluate()
method:
await page.evaluate(() => {
window.scrollTo(0, document.body.scrollHeight);
});
This injects the JavaScript code inside the evaluate()
callback and executes it on the page. As a result, the page scrolls to the bottom, mimicking real user behavior. Custom scripts like these are useful for handling pages with complex UI interactions.
User-Agent Spoofing
User-Agent
is one of the most important HTTP headers for web scraping. By default, Puppeteer uses the User-Agent
string of the controlled browser, which usually looks like this:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/132.0.0.0 Safari/537.36
The issue is the presence of “HeadlessChrome” string. That can signal automation, as real users do not run browsers without a graphical interface—only bots do.
To reduce the risk of detection, you can override the User-Agent
with the setUserAgent()
method:
const customUserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.6834.160 Safari/537.36";
await page.setUserAgent(customUserAgent);
Discover more in our guide on changing and setting the Puppeteer User-Agent
.
Request Interception
On most dynamic sites, data is fetched via AJAX requests rather than being embedded in the initial HTML. Puppeteer allows you to intercept these requests and inspect their responses using the request
and response
listeners.
For example, to intercept a request to /api/v1/quotes
and log its response data, use the following code:
// enable request interception on the page
await page.setRequestInterception(true);
// intercept requests
page.on("request", async (request) => {
if (request.url().includes("/api/v1/quotes")) {
// perform some actions...
// continue the request normally
request.continue();
}
});
// intercept responses
page.on("response", async (response) => {
if (response.url().includes("/api/v1/quotes")) {
// log the request's response
const jsonData = await response.json();
console.log("Intercepted data:", jsonData);
}
});
Here, setRequestInterception(true)
enables request interception, while event listeners capture both requests and responses for the targeted URL.
Intercepted responses often contain dynamically loaded data from the server. That means you can intercept them to access that information without waiting for the page to fully render. Additionally, those listeners enable you to modify, delay, or block requests as needed.
Proxy Integration
To protect your IP and identity while using Puppeteer, you can configure your script to route traffic through a proxy server. This helps mask your real IP address and bypass rate limits.
Use the following code to set up a proxy in Puppeteer:
const proxyURL = "http://your-proxy-server.com:PORT"
const browser = await puppeteer.launch({
args: [
`--proxy-server=${proxxyURL}`,
],
});
const page = await browser.newPage();
If the proxy requires authentication, you then need to call authenticate()
:
await page.authenticate({
username: "your-username", // optional proxy username
password: "your-password", // optional proxy password
});
For more details, check out our Puppeteer proxy integration guide.
If you are looking for reliable proxies, remember that Bright Data proxy network is used by Fortune 500 companies and over 20,000 customers. This worldwide proxy network involves:
- Datacenter proxies: Over 770,000 datacenter IPs.
- Residential proxies: Over 72M residential IPs in more than 195 countries.
- ISP proxies: Over 700,000 ISP IPs.
- Mobile proxies: Over 7M mobile IPs.
Puppeteer vs Playwright vs Selenium: Scraping Comparison
The two most popular Puppeteer alternatives for web scraping are:
- Playwright: A powerful browser automation library developed by Microsoft. It supports multiple browsers, including Chromium, Firefox, and WebKit, and is designed for both web scraping and testing. Playwright offers built-in waiting mechanisms, a rich automation API, and multi-language support, making it a modern and versatile choice.
- Selenium: A widely-used browser automation framework that supports Chrome, Firefox, Internet Explorer, and Safari. It has been around for many years and is commonly used for testing, automation, and web scraping. It supports several programming languages, but its API is generally slower and less modern than Puppeteer and Playwright.
Compare these browser automation solutions in the table below:
Feature | Puppeteer | Playwright | Selenium |
---|---|---|---|
Developed By | Microsoft | Community and Sponsors | |
GitHub Stars | 89.6k stars | 69k stars | 31.5k stars |
Browser Support | Chromium-based browsers, Firefox | Chromium-based browsers, Firefox, WebKit | Chromium-based browsers, Firefox, IE, WebKit |
Automation API | Rich | Extensive | Focused on the basics |
Developer Experience | Good | Great | Decent |
Documentation | Excellent | Excellent | Adequate |
Programming Languages | JavaScript | JavaScript, Python, C#, Java | Java, Python, C#, Ruby, JavaScript |
Performance | Fast | Fast | Medium/Slow |
Community | Large | Medium | Very large |
For a deeper comparison, check out the following comparison articles:
Puppeteer Limitations and How to Overcome Them
Puppeteer is a powerful tool for web scraping, but it has the following limitations:
- Performance issues: Browsers, even in headless mode, consume significant system resources—in terms of RAM and CPU. Running multiple instances can slow down even large servers and significantly impact performance.
- Instability: The connection with local browsers can break over time, especially during long-running sessions or when handling many tabs.
- Anti-Bot mechanisms: The configurations required to automate browsers are slightly different from those of regular users. These differences can trigger anti-bot systems, leading to CAPTCHAs or IP bans. To mitigate that, explore our tutorial on the Puppeteer Stealth plugin. Take also a look at our tutorial on how to bypass CAPTCHAs with Puppeteer.
Switching between browser automation libraries like Selenium, Playwright, or others will not solve all the above issues. Those challenges are related to the browser itself, not the library. Browsers are resource-heavy, unstable, and require special configurations for automation.
The real solution lies in using a cloud browser optimized for web scraping, such as Puppeteer Web Scraping Browser. This solution offers a scalable browser with built-in anti-bot bypass functionality. It also rotates the exit IP at each request and can handle browser fingerprinting, CAPTCHA resolution, and automated retries for you.
That is all you need for efficient and effective Puppeteer web scraping!
Conclusion
In this article, you learned why Puppeteer is an excellent tool for web scraping, especially when dealing with dynamic sites. Thanks to the step-by-step guide provided, you now know how to use Puppeteer to scrape a JavaScript-dependent site. Additionally, you explored more complex Puppeteer web scraping scenarios.
No matter how well your script is configured, headless browsers are resource-intensive compared to traditional HTML parsers and can still be detected by anti-bot solutions. This results in performance issues and potential blocks. Forget about these problems with a browser built for web scraping, such as Scraping Browser.
Create a free Bright Data account today to test our proxies and scraping solutions!
No credit card required