Extract GitHub Repository Data Without Hitting Rate Limits

GitHub's REST API is powerful but has aggressive rate limits: 60 requests per hour without a token, 5,000 with one. If you're doing any serious data extraction — searching repos, pulling contributor lists, exporting stargazers — you'll hit those limits fast.

I built a scraper that handles this properly. Here's what I learned and how you can extract GitHub data at scale without getting blocked.

The Rate Limit Problem

GitHub's API returns a 429 Too Many Requests once you exceed your quota. The response headers tell you exactly when your limit resets:

X-RateLimit-Limit: 60
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1719324000

Without a personal access token, 60 requests per hour is nothing. A single search query + paginating through results + fetching user profiles can burn through that in minutes.

With a token (free to generate at github.com/settings/tokens), you get 5,000/hr — much more workable, but still requires careful request management.

How the Scraper Handles It

The GitHub Scraper I built handles rate limits with:

Retry with backoff — When hitting 429, it reads the X-RateLimit-Reset header and waits until the limit resets
Request budgeting — Tracks remaining requests and slows down before hitting zero
Pagination via Link headers — GitHub uses Link: <url>; rel="next" headers, not page numbers

# Simplified version of the retry logic
async def request_with_retry(url, headers):
    resp = await client.get(url, headers=headers)
    if resp.status_code == 429:
        reset_time = int(resp.headers.get("X-RateLimit-Reset", 0))
        wait = max(reset_time - time.time(), 1)
        await asyncio.sleep(wait)
        return await client.get(url, headers=headers)
    return resp

What You Can Extract

The scraper supports 6 modes:

Search repos — Find repos by keyword, language, and star count:

{"scrapeType":"search_repos","searchQuery":"web scraping language:python stars:>100","maxItems":50}

Scrape profiles — Get user details including email, company, location:

{"scrapeType":"profiles","usernames":["torvalds","gvanrossum"],"enrichProfiles":true}

Export contributors — Who's building a project:

{"scrapeType":"contributors","repos":["scrapy/scrapy","microsoft/playwright"]}

Stargazer export — Everyone who starred a repo (great for developer lead gen):

{"scrapeType":"stargazers","repos":["fastapi/fastapi"]}

Example: Finding Top Python Scraping Libraries

Here's a real search result:

{"full_name":"nicoleahmed/Scrapling","stars":66129,"forks":2840,"language":"Python","topics":["scraping","web-scraping","python"],"license":"BSD-3-Clause","description":"Undetected, lightweight, and adaptive web scraping...","url":"https://github.com/nicoleahmed/Scrapling"}

You get stars, forks, language, topics, license, and description — all the metadata you need for research or comparison.

Profile Enrichment

The enrichProfiles option fetches full user details for every username. This is especially useful for contributor and stargazer exports:

{"username":"torvalds","name":"Linus Torvalds","email":"torvalds@linux-foundation.org","company":"Linux Foundation","location":"Portland, OR","followers":228000,"public_repos":16,"bio":""}

Note: email and company are only available if the user has made them public on their profile.

Token or No Token?

The scraper works both ways:

Without token: 60 req/hr, good for small extractions (< 50 items)
With token: 5,000 req/hr, needed for bulk exports. The token input field is marked as secret so it's never logged or exposed

Generate a token at github.com/settings/tokens — you only need the public_repo scope for reading public data.

Try It

Run it on Apify: github-scraper

The default input searches for "web scraping language:python stars:>100" and returns the top results — you can see the output format immediately.

Or call it via API:

curl -X POST "https://api.apify.com/v2/acts/ambitious_door~github-scraper/runs" \
  -H "Authorization: Bearer YOUR_APIFY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"scrapeType": "search_repos", "searchQuery": "machine learning stars:>1000", "maxItems": 20}'

Built with Python and httpx. Source on GitHub.