In this blog post, you will learn:
- What Playwright is
- The key features it offers for web scraping
- How to use it to extract data from a dynamic site
- Advanced Playwright web scraping techniques
- How it compares to other alternatives
- Its main limitations and how to overcome them
Let’s dive in!
What Is Playwright?
Playwright is a Node.js library for browser automation developed by Microsoft, designed to control Chromium-based, Firefox, and WebKit browsers. While primarily used for testing, it is also highly effective for web scraping JavaScript-heavy sites, SPAs, and any dynamic content page.
Playwright is officially available in JavaScript, Python, C#, and Java, with additional language support provided by the community. This makes its API not only cross-browser but also easily adaptable across programming languages.
Key features that make Playwright a modern browser automation tool for web scraping include headless mode, network interception, and browser context isolation.
Playwright Web Scraping Features
These are some of the most important Playwright features that support web scraping:
- Playwright Library API: A powerful browser automation API to interact with web pages as in E2E tests.
- SPAs and PWA-ready: Playwright provides all the necessary features to scrape data from modern JavaScript-based Single Page Applications (SPAs) and Progressive Web Apps (PWAs).
- Multiple locator methods: Support for various selector syntaxes, including CSS selectors, XPath expressions, element text content, and more.
- Auto-waiting mechanism: It automatically waits for elements to be visible, stable, and ready for interaction, reducing errors caused by partial data loads.
- Screenshots: Capture full-page or element-specific screenshots to aid in debugging and visual data extraction.
Web Scraping With Playwright: Step-By-Step Guide
In this guided section, you will learn how to perform web scraping with Playwright. The target site will be the following dynamic page from Scraping Course:
By inspecting the “Network” tab in DevTools, you can see that this page loads data dynamically via an AJAX request:
Then, the page renders that data in the browser using JavaScript. This makes it a perfect example of a dynamic site that requires a browser automation tool like Playwright for scraping. A simple HTTP client combined with an HTML parser would not be enough.
In detail, the Playwright web scraping script will:
- Navigate to the target page
- Wait for the products to load
- Scrape the required data
- Export the data to CSV
Time to build it!
Note: Playwright provides the same browser automation API in Python, C#, and Java. While this guide uses JavaScript—the primary language supported by Playwright along with TypeScript—you can easily adapt the script we are going to build to any of the other supported programming languages.
Step #1: Project Setup
First, make sure you have the latest LTS version of Node.js installed on your machine. Otherwise, download it from the official site and follow the wizard.
Create a folder for your JavaScript project and enter it in the terminal:
Launch the command below to initialize a Node.js project inside it:
Open the Playwright web scraping project with your favorite JavaScript IDE, and add index.js
to it:
Wonderful! You are ready to get started with Playwright.
Step #2: Playwright Installation and Setup
Install Playwright by adding the playwright
npm package to your project’s dependencies:
This will take a while, so be patient.
Now, keep in mind that each version of Playwright needs specific versions of browser binaries to work. Install them with:
The Playwright CLI will download Chromium, Firefox, and other required system dependencies for you.
To get started with Playwright, import chromium
in index.js
:
Playwright is typically initialized in Test mode, which provides a fully managed end-to-end test runner along with additional features. That is not necessary for web scraping.
Instead, using the Playwright library as above is sufficient. This offers unified APIs for launching and interacting with browsers—such as the chromium
object in this case.
Then, create an async
function where to create a browser instance:
launch()
creates a new browser instance, running in headless mode by default. Then, the newPage()
method opens a new page within that browser context. Finally, close()
is used to properly shut down the browser and free up resources.
To run the browser in headed mode—useful for debugging and visually tracking the script’s actions—set the headless
option to false
:
Great! Your Playwright scraping setup is complete.
Step #3: Visit the Target Page
Use the goto()
method to instruct the browser to connect to the target page:
Configure the script to work in headed mode. Next, add a breakpoint before the close()
method and run the script in the debugger. This is what you should be seeing:
Amazing! The Playwright scraping navigation logic works like a charm.
Step #4: Inspect the Target Page
Before writing the scraping logic, you need to familiarize yourself with the site’s structure and the DOM of the target page. To do so, open the website in incognito mode in your browser.
Since you want to extract data from each product on the page, inspect a product’s HTML element. Right-click on the first product and select the “Inspect” option. This will open the DevTools panel below:
Take a close look at the HTML of the product element. You will notice that all elements have a data-testid attribute. These attributes are mostly used for testing and tend to remain stable over time, making them reliable selectors for web scraping.
On this page, notice that all products are contained within [data-testid="product-grid"]
. In particular, each product is represented by a [data-content="product-item"]
node. Each of these includes:
- The product URL in
[data-testid="product-link"]
. - The product image in
[data-testid="product-image"]
. - The product name in
[data-content="product-name"]
. - The product price in
[data-testid="product-price"]
.
Perfect! You now have all the necessary information to do web scraping with Playwright.
Step #5: Implement the Scraping Logic
Since you want to scrape a list of products, first initialize an array where to store the scraped data:
Remember, the products are loaded dynamically via AJAX. Initially, the HTML elements for all products are present on the page, but they do not contain any data:
To ensure the AJAX request has been completed and the data has been rendered, you can wait for the href
attribute of the <a>
tag inside [data-content="product-item"]
to be populated. Here is how you can achieve that
The waitForSelector()
method waits for the the specified selector to be valid on the page’s DOM.
The page renders all products at once, so waiting for the first one to appear is enough. On more complex pages, you may need to customize this logic to ensure all products are fully loaded before proceeding.
Now, use the locator()
method to select all product items on the page:
The locator()
method applies the specified locator strategy and can reference one or more DOM elements. Playwright supports both CSS selectors and XPath expression, and it auto-detects them if you omit the css=
or xpath=
prefix.
Note: Locators are more efficient and stable for interacting with the page compared to traditional element selector methods (e.g., $()
and $$()
), which are currently deprecated.
Use the all()
method to access the list of elements in the locator and iterate through them:
Now that you have the current product element, you can extract the required data (URL, image, name, price) as below:
As the names suggest:
textContent()
retrieves the text inside an element;getAttribute()
accesses the data within an element’s attribute.
Once you have extracted the data, create an object for the product and add it to the products
array:
Terrific! The data extraction logic is finalized.
Step #6: Export to CSV
Node.js provides the tools to work with CSV files, but using a specialized library like csv-writer
makes the process much easier. To add it to your project’s dependencies, run the following command:
Then, import it in your index.js
file:
Next, use it to populate a CSV file with the scraped data:
Fantastic! You started with raw data from a webpage, and now you have it neatly organized in a CSV file.
Step #7: Put It All Together
Your final Playwright web scraping script should contain:
Way to go! In less than 100 lines of code, you just built a script to perform web scraping with Playwright.
Time to test it by launching this command:
The script will take a while, then a products.csv
file will appear in the project’s directory. Open it and you will see:
Et voilà! The output CSV file stores the same data as on the target page, but in a structured format.
Playwright Scraping: Advanced Features
You just learned the basics of web scraping using Playwright through a guided tutorial. Still, many more advanced scenarios are possible. Time time to explore them!
User Interaction
Over the years, web pages have become more complex than ever. They now feature advanced navigation and interaction patterns such as dynamic dropdowns, “Load More” buttons, and infinite scrolling.
Find out more in our guide on how to scrape sites with complex navigation.
Playwright is designed to handle most of those scenarios through dedicated methods. For example, you can select a “Load More” button and click it with:
The click()
method simulates a click on the specified element, triggering any associated events just as if a real user had clicked it.
Note: Various checks are performed before interacting with elements. Playwright automatically waits for these conditions to be met before executing an action. If the conditions are not satisfied within the specified timeout
, the action fails with a TimeoutError
. For example, when calling page.click()
, Playwright ensures that:
- The selector resolves to exactly one element
- The element is visible
- The element is stable (not animating or has completed its animation)
- The element can receive events (not hidden behind other elements)
- The element is enabled
Next, you can fill out and submit a form with:
The fill()
method sets the value of an input field, while click()
is used to submit the form by clicking the submit button.
Additionally, you can select an option from a dropdown using selectOption()
:
selectOption()
selects a dropdown item by matching it with the provided value, label, or index. The effect on the page will be the same as a user manually choosing the specified option from the list.
JavaScript Script Execution
Some sites require complex interactions that cannot be handled with standard Playwright methods alone. In such cases, you can run custom JavaScript directly on the page to simulate user behavior.
For example, to scroll down the page, you can execute the following script:
Playwright enables you to launch JavaScript scripts on the page via the evaluate()
method:
This injects and runs JavaScript inside the page, scrolling to the bottom just like a real user would. Custom scripts like that help you interact with pages that require particularly complex UI behaviors.
User-Agent Customization
User-Agent
is one of the most important HTTP headers for web scraping. By default, Playwright uses the User-Agent
string of the controlled browser, which typically looks like this:
The problem is that “HeadlessChrome” in the string can flag your requests as automated because real users do not run a browser without a GUI.
To avoid detection, override the User-Agent
like this:
Note that you must set the User-Agent
header in a dedicated browser context and use that to open a new page.
Request Interception
On most dynamic sites, data is retrieved through AJAX requests instead of being directly embedded in the initial HTML. Playwright allows you to intercept these requests and inspect their responses through the Network API.
For example, intercept a request to /ajax/products/json
and print its data with the following code:
route()
intercepts the AJAX request and enables you to access its responses.
Intercepted responses usually contain the data that is dynamically loaded onto the product page. By accessing this data directly from the site’s server, you can retrieve information without waiting for the page to fully render.
With this method, you can also modify the behavior of these requests or even block them entirely.
Proxy Integration
To protect your IP and identity while using Playwright, you can integrate a proxy server into your script. This helps mask your real IP address and bypass rate limiters.
Use the following code to configure Playwright to route traffic through a proxy:
For more guidance, refer to our Playright proxy integration guide.
If you are seeking reliable proxies, keep in mind that Bright Data’s proxy network is trusted by Fortune 500 companies and over 20,000 customers worldwide. This extensive network includes:
- Datacenter proxies: Over 770,000 datacenter IPs.
- Residential proxies: Over 72M residential IPs in more than 195 countries.
- ISP proxies: Over 700,000 ISP IPs.
- Mobile proxies: Over 7M mobile IPs.
Playwright Alternatives for Web Scraping
The two most popular Playwright alternatives when it comes to web scraping are:
- Puppeteer: A JavaScript browser automation library maintained by Google. It is well-suited for tasks such as automated testing, scraping, and generating screenshots. Puppeteer provides a powerful API for controlling headless Chrome, with Firefox support that has been added recently.
- Selenium: A widely-used browser automation tool that supports multiple browsers, including Chrome, Firefox, Internet Explorer, and Safari. It is used for web scraping, automation, and testing, and has been around for a lot of years. Although it offers broad language support, it is typically slower and less modern compared to Playwright and Puppeteer.
You can compare these browser automation solutions in the following table:
Feature | Playwright | Puppeteer | Selenium |
---|---|---|---|
GitHub Stars | 69k stars | 89.6k stars | 31.5k stars |
Developed By | Microsoft | Community and Sponsors | |
Browser Support | Chromium-based browsers, Firefox, WebKit | Chromium-based browsers, Firefox | Chromium-based browsers, Firefox, IE, WebKit |
Automation API | Rich | Quite rich | Focused on the basics |
Developer Experience | Great | Good | Decent |
Documentation | Excellent | Excellent | Decent |
Programming Languages | JavaScript, Python, C#, Java | JavaScript | Java, Python, C#, Ruby, JavaScript |
Community | Medium | Large | Very large |
Performance | Fast | Fast | Medium/Slow |
For more information, refer to these comparison articles:
Biggest Playwright Limitations
Playwright is a feature-rich tool for web scraping, but it comes with a few key limitations:
- Performance issues: Browsers, even in headless mode, consume significant system resources—like RAM and CPU. Running multiple instances can slow down even high-performance servers and significantly impact efficiency.
- Instability: The connection with local browsers can become unstable over time, especially in long-running sessions or when managing multiple tabs.
- Anti-Bot systems: Automated browsers have subtle configuration differences compared to real user sessions. These can trigger anti-bot detection, leading to CAPTCHAs, IP bans, and other blocks. To reduce detection risks, explore our tutorial on the Playwright Stealth plugin and learn how to bypass CAPTCHAs with Playwright. Yet, those tricks may not be enough.
Switching between browser automation libraries like Selenium or Puppeteer will not resolve these challenges. The reason is that they stem from the browser itself, not the library. Browsers require high system resources, are prone to detection, and need special configurations for scraping at scale.
The real solution lies in using a cloud-based browser optimized for web scraping, such as Playwright Web Scraping Browser. This scalable solution includes built-in anti-bot bypass mechanisms, automatic IP rotation, browser fingerprinting management, CAPTCHA solving, and automated retries—ensuring reliable and efficient data extraction.
That is everything you need for seamless and effective web scraping with Playwright!
Conclusion
In this article, you learned why Playwright is a powerful tool for web scraping, especially when handling dynamic sites. With the step-by-step guide provided here, you now understand how to perform web scraping against JavaScript-rendered content. You also explored more advanced Playwright web scraping scenarios.
However, no matter how well you configure your script, headless browsers consume significant system resources and can still be detected by anti-bot solutions. This can lead to performance issues and blocks. Avoid these issues with a dedicated browser optimized for web scraping, such as Scraping Browser.
Create a free Bright Data account today to explore our proxies and scraping solutions!
No credit card required