GitHub Datasets

GitHub datasets provide a dynamic source of data that fuels innovation, enabling businesses and researchers to extract valuable insights

Get dataset
  • Hundreds of thousands of records available
  • Tap into all major datapoints on Github
  • Free Github data samples for download
GitHub dataset hero image

GitHub dataset sample

The GitHub repository dataset provides essential insights into the world of open-source software. With comprehensive information on coding languages, repository sizes, and user contributions, this dataset allows users to delve into the intricacies of software development.

Popular Github datasets

Github repository

The Github repository dataset includes major data points such as URL, ID, code language, number of lines, user name, user URL, size, size unit, number of issues, and much more.

Github repository Rust code

The Github repository Rust code subset includes major data points such as URL, ID, code language, number of lines, user name, user URL, size, size unit, number of issues, and much more.

Github repository 100+ lines

The Github repository 100+ lines subset includes major data points such as URL, ID, code language, number of lines, user name, user URL, size, size unit, number of issues, and much more.

Datasets Pricing

Refresh rate
200K
500K
1M
5M
20M
Complete Dataset
3TB
  • Clean and validated
  • Refreshed monthly
  • JSON/CSV/Parquet

GitHub datasets tailored to your needs

Get easy to use, well-structured datasets for any use case

Data subscription

Subscribe to access datasets at a significantly reduced cost.

File output formats

JSON, NDJSON, JSON Lines, CSV, Parquet. Optional .gz compression.

Flexible delivery

Snowflake, Amazon S3 bucket, Google Cloud, Azure, and SFTP.

Scalable data

Scale without worrying about infra, proxy servers, or blocks.

Cost savings

Customize any dataset using filters and formatting options.

Code maintenance

Datasets are maintained based on website structure changes.

Simplified integrations

Benefit from integrations with Snowflake and AWS.

24/7 support

A dedicated team of data professionals is here to help.

Leaders in compliance

Data is ethically obtained and compliant with all privacy laws.

Get structured and reliable GitHub data

We’ll provide the data while you focus on the rest

High-volume web data

With our unblocking capabilities and round-the-clock IP rotation we ensure access to all data points on a website.

Data for immediate use

Every aspect of the data collection process is thoroughly validated as part of our robust data validation process.

Automated data flow

Create custom schedules to automate data delivery and watch the data flow seamlessly into your storage.

How companies use GitHub datasets

Developer activity

Use GitHub datasets to track the progress and health of open-source projects. Data points such as commit histories, pull requests, and issue discussions provide insight into project momentum and developer engagement. Businesses can use the data to identify potential collaborations or keep up with technological trends.
Get dataset
Developer activity

Community involvement

Assess the popularity and community support of open-source projects by analyzing GitHub datasets that include star and fork counts. These metrics help businesses gauge the interest and potential reliability of projects, informing decisions on which technologies to adopt or contribute to.
Get dataset
Community involvement

Improve engagement

Leverage publicly accessible GitHub user profile data to cultivate advocacy and engagement within the open-source community. By identifying and connecting with users who actively star and contribute to repositories in your domain, you can build a network of advocates who can amplify your projects and drive collaborative development.
Get dataset
Improve engagement

GitHub Dataset FAQs

The GitHub dataset includes different data points that fit your needs. Some of the data points include: URL, ID, code, code language, number of lines, user name, user URL, size, size unit, size number, number of projects, number of forks, number of stars, and a lot more.

Yes, you can get updates to your GitHub dataset on a daily, weekly, monthly, or custom basis.

Yes, you can purchase a GitHub subset that will include only the data points you need. By purchasing a subset, cost is reduced substantially.

Dataset formats are JSON, NDJSON, JSON Lines, CSV, or Parquet. Optionally, files can be compressed to .gz.

If you don’t want to purchase a dataset, you can start scraping GitHub data using our GitHub Scraper API.

Yes, you can request sample data to evaluate the quality and relevance of the information provided. This is a great way to ensure it meets your needs before committing to a full dataset.

Yes, you can request specific data points from the GitHub dataset tailored to your unique needs, ensuring you receive precisely the information you require for your projects.

Absolutely, the GitHub dataset offers seamless API integration, allowing you to effortlessly integrate the data into your CRM, analytics tools, or any other systems you use, streamlining your operations.

Get your GitHub dataset today.