Substack has quietly become one of the most valuable datasets on the internet. Thousands of newsletters publishing original research, market analysis, investigative journalism, and niche expertise — and almost none of it indexed properly by search engines. If you need to monitor a space, track authors, or build a dataset of newsletter content, scraping Substack is a legitimate and powerful tool.
This guide covers every practical approach in 2026: the undocumented JSON API, sitemap-based crawling, rate limiting, and managed alternatives.
The Undocumented Substack JSON API
Substack does not publish an official API, but every newsletter exposes a consistent set of JSON endpoints that power their web UI. These are public — no authentication required for public newsletters.
Fetching Posts
Every Substack publication has this endpoint:
GET https://{publication}.substack.com/api/v1/posts?limit=12&offset=0
Parameters:
-
limit— number of posts per page (max 12 in practice, though the field accepts higher values) -
offset— pagination cursor
Example response structure:
[{"id":123456789,"title":"My Newsletter Post","slug":"my-newsletter-post","post_date":"2026-03-15T10:00:00.000Z","description":"Post subtitle or preview...","canonical_url":"https://example.substack.com/p/my-newsletter-post","audience":"everyone","paywall_type":"regular","reactions":{"❤":142},"comment_count":23,"author":{"id":9876543,"name":"Jane Author","handle":"janeauthor","photo_url":"https://..."}}]
Paginating Through All Posts
Here is a complete Python scraper that pages through every post on a publication:
import requests
import time
from typing import Generator
def fetch_all_posts(publication: str) -> Generator[dict, None, None]:
"""
Paginate through all posts for a Substack publication.
Args:
publication: The subdomain, e.g. "platformer" for platformer.substack.com
"""
base_url = f"https://{publication}.substack.com/api/v1/posts"
offset = 0
limit = 12
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (compatible; research-bot/1.0)"
})
while True:
resp = session.get(
base_url,
params={"limit": limit, "offset": offset},
timeout=10
)
resp.raise_for_status()
posts = resp.json()
if not posts:
break
yield from posts
if len(posts) < limit:
# Last page
break
offset += limit
time.sleep(1.0) # Be polite
# Usage
for post in fetch_all_posts("platformer"):
print(post["title"], post["post_date"])
Fetching Subscriber Counts and Publication Metadata
The /api/v1/publication endpoint returns metadata including subscriber count (when the author has made it public):
GET https://{publication}.substack.com/api/v1/publication
def fetch_publication_info(publication: str) -> dict:
url = f"https://{publication}.substack.com/api/v1/publication"
resp = requests.get(url, timeout=10)
resp.raise_for_status()
return resp.json()
info = fetch_publication_info("platformer")
print(f"Name: {info['name']}")
print(f"Subscriber count: {info.get('subscriber_count', 'hidden')}")
print(f"Description: {info.get('description', '')}")
print(f"Custom domain: {info.get('custom_domain', 'none')}")
Key fields in the response:
-
subscriber_count— integer, only present when the author has enabled public subscriber display -
paid_subscriber_count— paid subscribers (also optional) -
author_id,name,subdomain -
theme_var_background_pop— the publication color theme (yes, this is in there)
Searching Across All Substacks
There is a cross-platform search endpoint:
GET https://substack.com/api/v1/search/substacks?query={term}&limit=24
This searches newsletter names and descriptions globally. Useful for building a list of publications in a niche before scraping them individually.
def search_substacks(query: str, max_results: int = 100) -> list[dict]:
results = []
offset = 0
limit = 24
while len(results) < max_results:
resp = requests.get(
"https://substack.com/api/v1/search/substacks",
params={"query": query, "limit": limit, "offset": offset},
timeout=10
)
resp.raise_for_status()
data = resp.json()
publications = data.get("substacks", [])
if not publications:
break
results.extend(publications)
offset += limit
time.sleep(0.8)
return results[:max_results]
# Find all AI newsletters
for pub in search_substacks("artificial intelligence", max_results=50):
print(pub["subdomain"], "-", pub.get("name"))
Sitemap + BeautifulSoup: The Fallback Approach
If a publication uses a custom domain (not *.substack.com) or if the JSON API returns unexpected results, the sitemap approach is more reliable.
Every Substack publication generates an XML sitemap:
https://{publication}.substack.com/sitemap.xml
import requests
from bs4 import BeautifulSoup
def fetch_post_urls_from_sitemap(publication: str) -> list[str]:
sitemap_url = f"https://{publication}.substack.com/sitemap.xml"
resp = requests.get(sitemap_url, timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "xml")
urls = [loc.text for loc in soup.find_all("loc")]
# Filter to only post URLs (exclude /about, /archive, etc.)
post_urls = [u for u in urls if "/p/" in u]
return post_urls
def scrape_post_content(url: str) -> dict:
resp = requests.get(url, timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
# Extract Open Graph metadata — reliable across Substack versions
title = soup.find("meta", property="og:title")
description = soup.find("meta", property="og:description")
author = soup.find("meta", attrs={"name": "author"})
# Post body
content_div = soup.find("div", class_="available-content")
return {
"url": url,
"title": title["content"] if title else None,
"description": description["content"] if description else None,
"author": author["content"] if author else None,
"content": content_div.get_text() if content_div else None,
}
# Usage
for url in fetch_post_urls_from_sitemap("platformer"):
post = scrape_post_content(url)
print(post["title"])
time.sleep(1.5)
The sitemap approach works even for paywalled publications — you still get titles, descriptions, and metadata from the public portion, even if the body is gated.
Rate Limiting Strategies
Substack does not aggressively block scrapers, but responsible scraping matters both ethically and practically.
Safe defaults:
- 1 request/second for API endpoints
- 1.5–2 seconds between HTML page fetches
- Max 500 posts per session before a 10-minute break
Handling 429 responses:
import time
import random
def get_with_backoff(session: requests.Session, url: str, **kwargs) -> requests.Response:
"""GET with exponential backoff on rate limit."""
delay = 1.0
for attempt in range(5):
resp = session.get(url, **kwargs)
if resp.status_code == 429:
retry_after = int(resp.headers.get("Retry-After", delay * 2))
print(f"Rate limited. Waiting {retry_after}s...")
time.sleep(retry_after + random.uniform(0, 1))
delay *= 2
continue
resp.raise_for_status()
return resp
raise Exception(f"Failed after retries: {url}")
Rotating User-Agent strings:
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
]
session.headers["User-Agent"] = random.choice(USER_AGENTS)
Hosted Alternative: The Data Collector API
If you do not want to maintain your own scraper, The Data Collector provides a free hosted Substack search endpoint at https://frog03-20494.wykr.es.
Get a free API key instantly:
curl -X POST https://frog03-20494.wykr.es/api/register \
-H "Content-Type: application/json" \
-d '{"email": "you@example.com"}'
You get 100 free API calls — enough to prototype a research tool or monitor a niche of newsletters without managing any infrastructure. No credit card required, no waitlist.
Managed Scraping: Apify Actors
For production workloads — thousands of publications, scheduled runs, structured exports to CSV/JSON/Google Sheets — managed actors are the better choice.
The cryptosignals collection on Apify includes Substack scrapers that handle pagination, rate limiting, and proxy rotation automatically. Define your target publications, set a schedule, and get clean structured data without managing infrastructure.
Apify's free tier covers light usage; paid plans start at $49/month for heavier workloads.
Putting It All Together
A complete research workflow for mapping a newsletter niche:
import requests
import time
import json
def map_newsletter_niche(topic: str, output_file: str = "newsletters.json"):
"""Discover and profile newsletters in a topic area."""
print(f"Searching for '{topic}' newsletters...")
publications = search_substacks(topic, max_results=50)
results = []
for pub in publications:
subdomain = pub["subdomain"]
print(f"Profiling {subdomain}...")
try:
info = fetch_publication_info(subdomain)
recent_posts = list(fetch_all_posts(subdomain))[:5] # Last 5 posts
results.append({
"subdomain": subdomain,
"name": info.get("name"),
"subscriber_count": info.get("subscriber_count"),
"description": info.get("description"),
"recent_posts": [
{"title": p["title"], "date": p["post_date"]}
for p in recent_posts
]
})
except Exception as e:
print(f" Failed for {subdomain}: {e}")
time.sleep(1.0)
with open(output_file, "w") as f:
json.dump(results, f, indent=2)
print(f"Done. {len(results)} newsletters saved to {output_file}")
map_newsletter_niche("machine learning")
Key Takeaways
- The
/api/v1/postsand/api/v1/publicationendpoints are public, consistent, and well-structured — use them first - Subscriber counts are only available when the author has enabled public display (roughly 40–60% of large newsletters do this)
- The sitemap + BeautifulSoup approach is slower but more robust for custom domains and HTML-level metadata extraction
- Stay under 1 req/sec on JSON endpoints, 0.5 req/sec on HTML — Substack is lenient but not infinitely so
- For hosted search without infrastructure overhead, try The Data Collector API at
https://frog03-20494.wykr.es(100 free calls, instant key) - For scheduled, large-scale production runs, Apify actors at
https://apify.com/cryptosignalshandle the heavy lifting
Happy scraping — responsibly.