GitHub's REST API is powerful but has aggressive rate limits: 60 requests per hour without a token, 5,000 with one. If you're doing any serious data extraction — searching repos, pulling contributor lists, exporting stargazers — you'll hit those limits fast.
I built a scraper that handles this properly. Here's what I learned and how you can extract GitHub data at scale without getting blocked.
The Rate Limit Problem
GitHub's API returns a 429 Too Many Requests once you exceed your quota. The response headers tell you exactly when your limit resets:
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1719324000
Without a personal access token, 60 requests per hour is nothing. A single search query + paginating through results + fetching user profiles can burn through that in minutes.
With a token (free to generate at github.com/settings/tokens), you get 5,000/hr — much more workable, but still requires careful request management.
How the Scraper Handles It
The GitHub Scraper I built handles rate limits with:
-
Retry with backoff — When hitting 429, it reads the
X-RateLimit-Resetheader and waits until the limit resets - Request budgeting — Tracks remaining requests and slows down before hitting zero
-
Pagination via Link headers — GitHub uses
Link: <url>; rel="next"headers, not page numbers
# Simplified version of the retry logic
async def request_with_retry(url, headers):
resp = await client.get(url, headers=headers)
if resp.status_code == 429:
reset_time = int(resp.headers.get("X-RateLimit-Reset", 0))
wait = max(reset_time - time.time(), 1)
await asyncio.sleep(wait)
return await client.get(url, headers=headers)
return resp
What You Can Extract
The scraper supports 6 modes:
Search repos — Find repos by keyword, language, and star count:
{"scrapeType":"search_repos","searchQuery":"web scraping language:python stars:>100","maxItems":50}
Scrape profiles — Get user details including email, company, location:
{"scrapeType":"profiles","usernames":["torvalds","gvanrossum"],"enrichProfiles":true}
Export contributors — Who's building a project:
{"scrapeType":"contributors","repos":["scrapy/scrapy","microsoft/playwright"]}
Stargazer export — Everyone who starred a repo (great for developer lead gen):
{"scrapeType":"stargazers","repos":["fastapi/fastapi"]}
Example: Finding Top Python Scraping Libraries
Here's a real search result:
{"full_name":"nicoleahmed/Scrapling","stars":66129,"forks":2840,"language":"Python","topics":["scraping","web-scraping","python"],"license":"BSD-3-Clause","description":"Undetected, lightweight, and adaptive web scraping...","url":"https://github.com/nicoleahmed/Scrapling"}
You get stars, forks, language, topics, license, and description — all the metadata you need for research or comparison.
Profile Enrichment
The enrichProfiles option fetches full user details for every username. This is especially useful for contributor and stargazer exports:
{"username":"torvalds","name":"Linus Torvalds","email":"torvalds@linux-foundation.org","company":"Linux Foundation","location":"Portland, OR","followers":228000,"public_repos":16,"bio":""}
Note: email and company are only available if the user has made them public on their profile.
Token or No Token?
The scraper works both ways:
- Without token: 60 req/hr, good for small extractions (< 50 items)
- With token: 5,000 req/hr, needed for bulk exports. The token input field is marked as secret so it's never logged or exposed
Generate a token at github.com/settings/tokens — you only need the public_repo scope for reading public data.
Try It
Run it on Apify: github-scraper
The default input searches for "web scraping language:python stars:>100" and returns the top results — you can see the output format immediately.
Or call it via API:
curl -X POST "https://api.apify.com/v2/acts/ambitious_door~github-scraper/runs" \
-H "Authorization: Bearer YOUR_APIFY_TOKEN" \
-H "Content-Type: application/json" \
-d '{"scrapeType": "search_repos", "searchQuery": "machine learning stars:>1000", "maxItems": 20}'