How to Scrape Goodreads Data in 2026: Complete Guide

python dev.to

Why Scrape Goodreads?

Goodreads is the world's largest book community — 150+ million members, 3.5 billion books catalogued, and millions of reviews. Whether you're building a book recommendation engine, analyzing reading trends, tracking author performance, or researching the publishing market, Goodreads data is invaluable.

But Goodreads has no public API (they shut it down in December 2020). That means scraping is the only way to access structured book data at scale.

In this guide, I'll show you how to scrape Goodreads books, reviews, and author data using Python — including a ready-to-use solution that handles anti-bot detection, pagination, and data formatting.

What Data Can You Extract from Goodreads?

Here's what's available:

  • Book details: title, author, ISBN/ISBN-13, publisher, publication date, page count, edition, format
  • Ratings & reviews: average rating, total ratings, total reviews, star distribution, individual review text
  • Author info: name, bio, follower count, book count
  • Genres & shelves: genre tags, popular shelves, reading lists
  • Search results: keyword search, genre browsing, bestseller lists

Method 1: DIY with Python + BeautifulSoup

You can scrape Goodreads yourself with requests and BeautifulSoup:

import requests
from bs4 import BeautifulSoup
import time
import json

def scrape_goodreads_book(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
    }
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, "html.parser")

    title = soup.select_one("h1.Text__title1").text.strip()
    author = soup.select_one("span.ContributorLink__name").text.strip()
    rating = soup.select_one("div.RatingStatistics__rating").text.strip()

    return {
        "title": title,
        "author": author,
        "rating": float(rating),
        "url": url
    }

# Example usage
book = scrape_goodreads_book("https://www.goodreads.com/book/show/5907.The_Hobbit")
print(json.dumps(book, indent=2))
Enter fullscreen mode Exit fullscreen mode

The problem: Goodreads uses heavy JavaScript rendering, dynamic class names, and aggressive rate limiting. Your DIY scraper will break within weeks as selectors change, and you'll get blocked after a few hundred requests.

Method 2: Using the Apify Goodreads Scraper (Recommended)

A more reliable approach is using a managed scraper that handles anti-bot detection, retries, and proxy rotation automatically.

The Goodreads Scraper on Apify extracts structured book data with zero configuration:

from apify_client import ApifyClient

# Initialize the Apify client
client = ApifyClient("YOUR_APIFY_TOKEN")

# Configure the scraper
run_input = {
    "searchTerms": ["science fiction 2026"],
    "maxResults": 50,
    "includeReviews": True
}

# Run the actor
run = client.actor("cryptosignals/goodreads-scraper").call(run_input=run_input)

# Fetch results
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(f"{item['title']} by {item['author']}{item['rating']}/5 ({item['ratingsCount']} ratings)")
Enter fullscreen mode Exit fullscreen mode

Install the client first:

pip install apify-client
Enter fullscreen mode Exit fullscreen mode

Sample Output

Here's what the structured output looks like:

{"title":"Project Hail Mary","author":"Andy Weir","rating":4.52,"ratingsCount":1245678,"reviewsCount":89432,"isbn":"0593135202","isbn13":"9780593135204","pages":496,"publisher":"Ballantine Books","publishDate":"2021-05-04","genres":["Science Fiction","Fiction","Audiobook","Space"],"description":"Ryland Grace is the sole survivor on a desperate...","url":"https://www.goodreads.com/book/show/54493401"}
Enter fullscreen mode Exit fullscreen mode

Clean, structured JSON with every field you need — no parsing HTML, no broken selectors, no proxy management.

Use Cases for Goodreads Data

1. Book Recommendation Engines

Scrape ratings, genres, and review sentiment to build collaborative filtering models. Combine with user shelf data to find "readers who liked X also liked Y" patterns.

2. Publishing Market Research

Track which genres are trending, which debut authors are gaining traction, and what publication formats (hardcover vs. ebook vs. audio) are growing. Invaluable for publishers and literary agents.

3. Author Analytics

Monitor an author's rating trajectory over time, track review sentiment, compare performance across titles. Useful for marketing teams and self-published authors.

4. Academic Research

Study reading trends, cultural preferences across regions, or the impact of book-to-film adaptations on ratings. Goodreads data has been used in hundreds of published papers.

5. Competitive Intelligence for Booksellers

Track competitor titles' performance, identify underserved niches, and optimize inventory based on real reader demand rather than publisher push.

Cost Comparison: Goodreads Data Sources

Method Cost Reliability Speed
DIY scraper Free (your time) Low — breaks often Slow — rate limited
Goodreads API Dead (shut down 2020) N/A N/A
Apify Goodreads Scraper $0.01/result, first 100 free High — maintained Fast — parallel
Data brokers $200-500/dataset Medium One-time dump
Manual collection Free High Extremely slow

At $0.01 per result, scraping 1,000 books costs $9. That's less than a single Goodreads premium membership used to cost.

Advanced: Scraping Goodreads Reviews at Scale

Reviews are the most valuable Goodreads data for NLP and sentiment analysis. Here's how to extract them:

from apify_client import ApifyClient
import pandas as pd

client = ApifyClient("YOUR_APIFY_TOKEN")

# Scrape reviews for a specific book
run_input = {
    "bookUrls": ["https://www.goodreads.com/book/show/5907.The_Hobbit"],
    "includeReviews": True,
    "maxReviews": 500
}

run = client.actor("cryptosignals/goodreads-scraper").call(run_input=run_input)

# Load into pandas for analysis
results = list(client.dataset(run["defaultDatasetId"]).iterate_items())
df = pd.DataFrame(results)

# Basic sentiment breakdown
print(f"Average rating: {df['rating'].mean():.2f}")
print(f"5-star reviews: {len(df[df['rating']==5])}")
print(f"1-star reviews: {len(df[df['rating']==1])}")
Enter fullscreen mode Exit fullscreen mode

Tips for Scraping Goodreads Effectively

  1. Start with search terms, not URLs. The scraper can find books by keyword, which is faster than collecting individual book URLs.

  2. Use the free tier to test. Every run includes 100 free results — enough to validate your data pipeline before committing.

  3. Export to CSV for spreadsheets. Apify lets you download results as CSV, JSON, or Excel directly from the dashboard.

  4. Schedule recurring scrapes. Set up daily or weekly runs to track how ratings and review counts change over time.

  5. Respect the platform. Don't scrape faster than necessary. The managed scraper handles rate limiting automatically.

Getting Started

  1. Create a free Apify account
  2. Go to the Goodreads Scraper
  3. Enter your search terms or book URLs
  4. Click Start and get structured data in minutes

No credit card needed for the free tier. First 100 results per run are always free.


Built by CryptoSignals on Apify. Have questions or feature requests? Open an issue on the actor page.

Source: dev.to

arrow_back Back to Tutorials