A Guide to Playwright Web Scraping in 2025

Web scraping with Playwright: A step-by-step guide to extracting data using this powerful tool.
17 min read
Web Scraping with Playwright blog image

In this blog post, you will learn:

  • What Playwright is
  • The key features it offers for web scraping
  • How to use it to extract data from a dynamic site
  • Advanced Playwright web scraping techniques
  • How it compares to other alternatives
  • Its main limitations and how to overcome them

Let’s dive in!

What Is Playwright?

Playwright is a Node.js library for browser automation developed by Microsoft, designed to control Chromium-based, Firefox, and WebKit browsers. While primarily used for testing, it is also highly effective for web scraping JavaScript-heavy sites, SPAs, and any dynamic content page.

Playwright is officially available in JavaScript, Python, C#, and Java, with additional language support provided by the community. This makes its API not only cross-browser but also easily adaptable across programming languages.

Key features that make Playwright a modern browser automation tool for web scraping include headless mode, network interception, and browser context isolation.

Playwright Web Scraping Features

These are some of the most important Playwright features that support web scraping:

  • Playwright Library API: A powerful browser automation API to interact with web pages as in E2E tests.
  • SPAs and PWA-ready: Playwright provides all the necessary features to scrape data from modern JavaScript-based Single Page Applications (SPAs) and Progressive Web Apps (PWAs).
  • Multiple locator methods: Support for various selector syntaxes, including CSS selectors, XPath expressions, element text content, and more.
  • Auto-waiting mechanism: It automatically waits for elements to be visible, stable, and ready for interaction, reducing errors caused by partial data loads.
  • Screenshots: Capture full-page or element-specific screenshots to aid in debugging and visual data extraction.

Web Scraping With Playwright: Step-By-Step Guide

In this guided section, you will learn how to perform web scraping with Playwright. The target site will be the following dynamic page from Scraping Course:

The target page

By inspecting the “Network” tab in DevTools, you can see that this page loads data dynamically via an AJAX request:

The AJAX request made by the page to retrieve data dynamically

Then, the page renders that data in the browser using JavaScript. This makes it a perfect example of a dynamic site that requires a browser automation tool like Playwright for scraping. A simple HTTP client combined with an HTML parser would not be enough.

In detail, the Playwright web scraping script will:

  • Navigate to the target page
  • Wait for the products to load
  • Scrape the required data
  • Export the data to CSV

Time to build it!

Note: Playwright provides the same browser automation API in Python, C#, and Java. While this guide uses JavaScript—the primary language supported by Playwright along with TypeScript—you can easily adapt the script we are going to build to any of the other supported programming languages.

Step #1: Project Setup

First, make sure you have the latest LTS version of Node.js installed on your machine. Otherwise, download it from the official site and follow the wizard.

Create a folder for your JavaScript project and enter it in the terminal:

mkdir playwright-web-scraper
cd playwright-web-scraper

Launch the command below to initialize a Node.js project inside it:

npm init -y

Open the Playwright web scraping project with your favorite JavaScript IDE, and add index.js to it:

Playwright project structure

Wonderful! You are ready to get started with Playwright.

Step #2: Playwright Installation and Setup

Install Playwright by adding the playwright npm package to your project’s dependencies:

npm install playwright

This will take a while, so be patient.

Now, keep in mind that each version of Playwright needs specific versions of browser binaries to work. Install them with:

npx playwright install

The Playwright CLI will download Chromium, Firefox, and other required system dependencies for you.

To get started with Playwright, import chromium in index.js:

const { chromium } = require("playwright");

Playwright is typically initialized in Test mode, which provides a fully managed end-to-end test runner along with additional features. That is not necessary for web scraping.

Instead, using the Playwright library as above is sufficient. This offers unified APIs for launching and interacting with browsers—such as the chromium object in this case.

Then, create an async function where to create a browser instance:

const { chromium } = require("playwright");

(async () => {
  // launch a new Chromium browser instance
  const browser = await chromium.launch();
  const page = await browser.newPage();

  // browser automation for web scraping...

  // close the browser and release resources
  await browser.close();
})();

launch() creates a new browser instance, running in headless mode by default. Then, the newPage() method opens a new page within that browser context. Finally, close() is used to properly shut down the browser and free up resources.

To run the browser in headed mode—useful for debugging and visually tracking the script’s actions—set the headless option to false:

const browser = await chromium.launch({
  headless: false,
});

Great! Your Playwright scraping setup is complete.

Step #3: Visit the Target Page

Use the goto() method to instruct the browser to connect to the target page:

await page.goto("https://www.scrapingcourse.com/javascript-rendering");

Configure the script to work in headed mode. Next, add a breakpoint before the close() method and run the script in the debugger. This is what you should be seeing:

The target page loaded by Playwright

Amazing! The Playwright scraping navigation logic works like a charm.

Step #4: Inspect the Target Page

Before writing the scraping logic, you need to familiarize yourself with the site’s structure and the DOM of the target page. To do so, open the website in incognito mode in your browser.

Since you want to extract data from each product on the page, inspect a product’s HTML element. Right-click on the first product and select the “Inspect” option. This will open the DevTools panel below:

The DevTools section on the product element

Take a close look at the HTML of the product element. You will notice that all elements have a data-testid attribute. These attributes are mostly used for testing and tend to remain stable over time, making them reliable selectors for web scraping.

On this page, notice that all products are contained within [data-testid="product-grid"]. In particular, each product is represented by a [data-content="product-item"] node. Each of these includes:

  • The product URL in [data-testid="product-link"].
  • The product image in [data-testid="product-image"].
  • The product name in [data-content="product-name"].
  • The product price in [data-testid="product-price"].

Perfect! You now have all the necessary information to do web scraping with Playwright.

Step #5: Implement the Scraping Logic

Since you want to scrape a list of products, first initialize an array where to store the scraped data:

products = []

Remember, the products are loaded dynamically via AJAX. Initially, the HTML elements for all products are present on the page, but they do not contain any data:

The original HTML code of the product grid

To ensure the AJAX request has been completed and the data has been rendered, you can wait for the href attribute of the <a> tag inside [data-content="product-item"] to be populated. Here is how you can achieve that

await page.waitForSelector('[data-content="product-item"] a[href]:not([href=""])');

The waitForSelector() method waits for the the specified selector to be valid on the page’s DOM.

The page renders all products at once, so waiting for the first one to appear is enough. On more complex pages, you may need to customize this logic to ensure all products are fully loaded before proceeding.

Now, use the locator() method to select all product items on the page:

const productItemsLocator = page.locator('[data-content="product-item"]');

The locator() method applies the specified locator strategy and can reference one or more DOM elements. Playwright supports both CSS selectors and XPath expression, and it auto-detects them if you omit the css= or xpath= prefix.

Note: Locators are more efficient and stable for interacting with the page compared to traditional element selector methods (e.g., $() and $$()), which are currently deprecated.

Use the all() method to access the list of elements in the locator and iterate through them:

for (const productLocator of await productItemsLocator.all()) {
    // scraping logic...
}

Now that you have the current product element, you can extract the required data (URL, image, name, price) as below:

const urlElement = await productLocator.locator(
  '[data-testid="product-link"]'
);
const url = await urlElement.getAttribute("href");

const imageElement = await productLocator.locator(
  '[data-testid="product-image"]'
);
const image = await imageElement.getAttribute("src");

const nameElement = await productLocator.locator(
  '[data-content="product-name"]'
);
const name = await nameElement.textContent();

const priceElement = await productLocator.locator(
  '[data-testid="product-price"]'
);
const price = await priceElement.textContent();

As the names suggest:

  • textContent() retrieves the text inside an element;
  • getAttribute() accesses the data within an element’s attribute.

Once you have extracted the data, create an object for the product and add it to the products array:

const product = {
  "url": url,
  "image": image,
  "name": name,
  "price": price,
};
products.push(product);

Terrific! The data extraction logic is finalized.

Step #6: Export to CSV

Node.js provides the tools to work with CSV files, but using a specialized library like csv-writer makes the process much easier. To add it to your project’s dependencies, run the following command:

npm install csv-writer

Then, import it in your index.js file:

const createCsvWriter = require("csv-writer").createObjectCsvWriter;

Next, use it to populate a CSV file with the scraped data:

const csvWriter = createCsvWriter({
  path: "products.csv",
  header: [
    { id: "url", title: "URL" },
    { id: "image", title: "Image" },
    { id: "name", title: "Name" },
    { id: "price", title: "Price" },
  ],
});

// write the products data to the CSV
await csvWriter.writeRecords(products);

Fantastic! You started with raw data from a webpage, and now you have it neatly organized in a CSV file.

Step #7: Put It All Together

Your final Playwright web scraping script should contain:

const { chromium } = require("playwright");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;

(async () => {
  // launch a new Chromium browser instance
  const browser = await chromium.launch({
    headless: false, // comment out in production
  });
  const page = await browser.newPage();

  // connect to the target page
  await page.goto("https://www.scrapingcourse.com/javascript-rendering");

  /// where to store the scraped data
  const products = [];

  // wait for the procuts to be rendered
  await page.waitForSelector(
    '[data-content="product-item"] a[href]:not([href=""])'
  );

  // get the list of product items
  const productItemsLocator = page.locator('[data-content="product-item"]');

  // iterate over all product elements
  for (const productLocator of await productItemsLocator.all()) {
    // scraping logic
    const urlElement = await productLocator.locator(
      '[data-testid="product-link"]'
    );
    const url = await urlElement.getAttribute("href");

    const imageElement = await productLocator.locator(
      '[data-testid="product-image"]'
    );
    const image = await imageElement.getAttribute("src");

    const nameElement = await productLocator.locator(
      '[data-content="product-name"]'
    );
    const name = await nameElement.textContent();

    const priceElement = await productLocator.locator(
      '[data-testid="product-price"]'
    );
    const price = await priceElement.textContent();

    // populate a new product object
    // and add it to the array
    const product = {
      "url": url,
      "image": image,
      "name": name,
      "price": price,
    };
    products.push(product);
  }

  // configurations for the CSV output file
  const csvWriter = createCsvWriter({
    path: "products.csv",
    header: [
      { id: "url", title: "URL" },
      { id: "image", title: "image" },
      { id: "name", title: "name" },
      { id: "price", title: "price" },
    ],
  });

  // write the products data to the CSV
  await csvWriter.writeRecords(products);

  // close the browser and release resources
  await browser.close();
})();

Way to go! In less than 100 lines of code, you just built a script to perform web scraping with Playwright.

Time to test it by launching this command:

node index.js

The script will take a while, then a products.csv file will appear in the project’s directory. Open it and you will see:

The output file with the scraped data

Et voilà! The output CSV file stores the same data as on the target page, but in a structured format.

Playwright Scraping: Advanced Features

You just learned the basics of web scraping using Playwright through a guided tutorial. Still, many more advanced scenarios are possible. Time time to explore them!

User Interaction

Over the years, web pages have become more complex than ever. They now feature advanced navigation and interaction patterns such as dynamic dropdowns, “Load More” buttons, and infinite scrolling.

Find out more in our guide on how to scrape sites with complex navigation.

Playwright is designed to handle most of those scenarios through dedicated methods. For example, you can select a “Load More” button and click it with:

await page.locator(':has-text("Load More")').click();

The click() method simulates a click on the specified element, triggering any associated events just as if a real user had clicked it.

Note: Various checks are performed before interacting with elements. Playwright automatically waits for these conditions to be met before executing an action. If the conditions are not satisfied within the specified timeout, the action fails with a TimeoutError. For example, when calling page.click(), Playwright ensures that:

  • The selector resolves to exactly one element
  • The element is visible
  • The element is stable (not animating or has completed its animation)
  • The element can receive events (not hidden behind other elements)
  • The element is enabled

Next, you can fill out and submit a form with:

await page.fill('input[name="email"]', "user@example.com");
await page.fill('input[name="password"]', "mysecurepassword");
await page.click('button[type="submit"]');

The fill() method sets the value of an input field, while click() is used to submit the form by clicking the submit button.

Additionally, you can select an option from a dropdown using selectOption():

await page.selectOption("#country", "USA");

selectOption() selects a dropdown item by matching it with the provided value, label, or index. The effect on the page will be the same as a user manually choosing the specified option from the list.

JavaScript Script Execution

Some sites require complex interactions that cannot be handled with standard Playwright methods alone. In such cases, you can run custom JavaScript directly on the page to simulate user behavior.

For example, to scroll down the page, you can execute the following script:

window.scrollTo(0, document.body.scrollHeight);

Playwright enables you to launch JavaScript scripts on the page via the evaluate() method:

await page.evaluate(() => {
  window.scrollTo(0, document.body.scrollHeight);
});

This injects and runs JavaScript inside the page, scrolling to the bottom just like a real user would. Custom scripts like that help you interact with pages that require particularly complex UI behaviors.

User-Agent Customization

User-Agent is one of the most important HTTP headers for web scraping. By default, Playwright uses the User-Agent string of the controlled browser, which typically looks like this:

"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/133.0.6943.16 Safari/537.36"

The problem is that “HeadlessChrome” in the string can flag your requests as automated because real users do not run a browser without a GUI.

To avoid detection, override the User-Agent like this:

const browser = await chromium.launch();

// set a custom user agent
const customUserAgent =
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.6834.160 Safari/537.36";
const context = await browser.newContext({
  userAgent: customUserAgent,
});

// create a new page within the customized context
const page = await context.newPage();

Note that you must set the User-Agent header in a dedicated browser context and use that to open a new page.

Request Interception

On most dynamic sites, data is retrieved through AJAX requests instead of being directly embedded in the initial HTML. Playwright allows you to intercept these requests and inspect their responses through the Network API.

For example, intercept a request to /ajax/products/json and print its data with the following code:

await page.route("**/ajax/products/json", async (route, request) => {
  // access the requests's response and print it
  const response = await route.fetch();
  const jsonData = await response.json();

  console.log("intercepted data:", jsonData);

  // continue the request without modification
  await route.continue();
});

route() intercepts the AJAX request and enables you to access its responses.

Intercepted responses usually contain the data that is dynamically loaded onto the product page. By accessing this data directly from the site’s server, you can retrieve information without waiting for the page to fully render.

With this method, you can also modify the behavior of these requests or even block them entirely.

Proxy Integration

To protect your IP and identity while using Playwright, you can integrate a proxy server into your script. This helps mask your real IP address and bypass rate limiters.

Use the following code to configure Playwright to route traffic through a proxy:

const browser = await chromium.launch({
  proxy: {
    server: "http://your-proxy-server.com:PORT", // proxy server URL
    username: "your-username", // optional proxy username credential
    password: "your-password", // optional proxy password credential
  },
});

For more guidance, refer to our Playright proxy integration guide.

If you are seeking reliable proxies, keep in mind that Bright Data’s proxy network is trusted by Fortune 500 companies and over 20,000 customers worldwide. This extensive network includes:

Playwright Alternatives for Web Scraping

The two most popular Playwright alternatives when it comes to web scraping are:

  • Puppeteer: A JavaScript browser automation library maintained by Google. It is well-suited for tasks such as automated testing, scraping, and generating screenshots. Puppeteer provides a powerful API for controlling headless Chrome, with Firefox support that has been added recently.
  • Selenium: A widely-used browser automation tool that supports multiple browsers, including Chrome, Firefox, Internet Explorer, and Safari. It is used for web scraping, automation, and testing, and has been around for a lot of years. Although it offers broad language support, it is typically slower and less modern compared to Playwright and Puppeteer.

You can compare these browser automation solutions in the following table:

Feature Playwright Puppeteer Selenium
GitHub Stars 69k stars 89.6k stars 31.5k stars
Developed By Microsoft Google Community and Sponsors
Browser Support Chromium-based browsers, Firefox, WebKit Chromium-based browsers, Firefox Chromium-based browsers, Firefox, IE, WebKit
Automation API Rich Quite rich Focused on the basics
Developer Experience Great Good Decent
Documentation Excellent Excellent Decent
Programming Languages JavaScript, Python, C#, Java JavaScript Java, Python, C#, Ruby, JavaScript
Community Medium Large Very large
Performance Fast Fast Medium/Slow

For more information, refer to these comparison articles:

Biggest Playwright Limitations

Playwright is a feature-rich tool for web scraping, but it comes with a few key limitations:

  • Performance issues: Browsers, even in headless mode, consume significant system resources—like RAM and CPU. Running multiple instances can slow down even high-performance servers and significantly impact efficiency.
  • Instability: The connection with local browsers can become unstable over time, especially in long-running sessions or when managing multiple tabs.
  • Anti-Bot systems: Automated browsers have subtle configuration differences compared to real user sessions. These can trigger anti-bot detection, leading to CAPTCHAs, IP bans, and other blocks. To reduce detection risks, explore our tutorial on the Playwright Stealth plugin and learn how to bypass CAPTCHAs with Playwright. Yet, those tricks may not be enough.

Switching between browser automation libraries like Selenium or Puppeteer will not resolve these challenges. The reason is that they stem from the browser itself, not the library. Browsers require high system resources, are prone to detection, and need special configurations for scraping at scale.

The real solution lies in using a cloud-based browser optimized for web scraping, such as Playwright Web Scraping Browser. This scalable solution includes built-in anti-bot bypass mechanisms, automatic IP rotation, browser fingerprinting management, CAPTCHA solving, and automated retries—ensuring reliable and efficient data extraction.

That is everything you need for seamless and effective web scraping with Playwright!

Conclusion

In this article, you learned why Playwright is a powerful tool for web scraping, especially when handling dynamic sites. With the step-by-step guide provided here, you now understand how to perform web scraping against JavaScript-rendered content. You also explored more advanced Playwright web scraping scenarios.

However, no matter how well you configure your script, headless browsers consume significant system resources and can still be detected by anti-bot solutions. This can lead to performance issues and blocks. Avoid these issues with a dedicated browser optimized for web scraping, such as Scraping Browser.

Create a free Bright Data account today to explore our proxies and scraping solutions!

No credit card required