WordPress search works well when your content lives inside posts and pages.
But if important information is inside PDF files, WordPress often misses it completely.
A visitor may search for a phrase that clearly exists inside a PDF manual, report, form, or policy document. But WordPress returns no result.
The PDF is uploaded.
The content exists.
But search still cannot find it.
So why does this happen?
WordPress Search Does Not Read PDF Content
Default WordPress search mainly checks database fields like:
post_title
post_content
post_excerpt
When you publish a post, the content is stored in post_content, so WordPress can search it.
But when you upload a PDF, WordPress treats it as a media attachment. It stores the file title, URL, MIME type, upload date, and some metadata.
What it does not do by default is open the PDF, read the text inside it, and save that text into post_content.
So this kind of situation is common:
PDF file: employee-handbook.pdf
Text inside PDF: "Remote work requests must be approved by HR"
WordPress searchable content: filename and attachment data only
Search query: "remote work requests"
Result: no match
The text is inside the file, but not inside the database fields WordPress normally searches.
That is the main reason PDF search fails.
Not All PDFs Are the Same
Before fixing PDF search, you need to understand one important difference.
There are two common types of PDFs:
- Text-based PDFs
- Scanned PDFs
They may look the same in the browser, but technically they behave very differently.
Text-Based PDFs
A text-based PDF contains real selectable text.
If you can open a PDF, select a sentence, copy it, and paste it into a text editor, it probably has a text layer.
These PDFs are usually exported from tools like Word, Google Docs, InDesign, or reporting software.
For these files, a plugin can use a PDF parser to extract the text.
In PHP, one common library is smalot/pdfparser.
use Smalot\PdfParser\Parser;
$parser = new Parser();
$pdf = $parser->parseFile($file_path);
$text = $pdf->getText();
if (! empty($text)) {
// Store extracted text for search
}
After extraction, the text needs to be stored somewhere searchable.
For small use cases, post meta may work. But for better control and performance, a custom database table is usually better.
Example:
CREATE TABLE wp_pdf_search_index (
id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
attachment_id BIGINT UNSIGNED NOT NULL,
extracted_text LONGTEXT NOT NULL,
indexed_at DATETIME NOT NULL,
FULLTEXT KEY pdf_text_index (extracted_text)
);
Then the plugin can search the extracted PDF content:
SELECT attachment_id
FROM wp_pdf_search_index
WHERE MATCH(extracted_text)
AGAINST ('remote work requests' IN NATURAL LANGUAGE MODE);
So the PDF itself is not searched directly every time.
Instead, the plugin extracts the PDF text once, stores it, and searches the stored index.
Showing PDF Results in WordPress Search
Once PDF content is indexed, the next step is showing matching PDF files in the search results.
A plugin can hook into WordPress search using pre_get_posts, or it can provide a separate PDF search form.
For example, a shortcode-based PDF search form can be placed on a resource page:
[webequipe_pdf_search_form]
[webequipe_pdf_search_form placeholder="Search PDFs…" button_text="Search" results_per_page="10"]
This is useful for documentation pages, school notice sections, report libraries, member portals, or product manual pages.
The Scanned PDF Problem
Text extraction works only when the PDF contains real text.
But many PDFs are scanned documents.
Examples:
Old reports
Signed forms
Scanned notices
Paper manuals
Archive documents
Government or legal documents
A scanned PDF may look readable to a human, but technically each page is just an image.
So when a parser runs this:
$text = $pdf->getText();
It may return:
''
There are visible words on the page, but no machine-readable text layer.
This is where normal PDF parsing stops.
OCR Is Needed for Scanned PDFs
For scanned PDFs, the solution is OCR.
OCR means Optical Character Recognition. It reads text from images and converts it into machine-readable text.
The flow looks like this:
Google Vision API is one common OCR option. It can read pixel-based document images and return detected text.
A simplified flow:
$image_path = '/tmp/page-1.png';
$ocr_text = run_google_vision_ocr($image_path);
if (! empty($ocr_text)) {
index_pdf_text($attachment_id, $ocr_text);
}
The key idea is simple:
Text-based PDFs need parsing.
Scanned PDFs need OCR first.
If a search plugin treats both the same way, scanned PDFs will usually fail.
A Simple Detection Strategy
Most users do not know whether their PDF is text-based or scanned.
So the plugin should decide automatically.
A practical approach is:
$text = extract_pdf_text($file_path);
if (strlen(trim($text)) > 100) {
index_pdf_text($attachment_id, $text);
} else {
queue_pdf_for_ocr($attachment_id);
}
First, try normal extraction.
If enough text is found, index it.
If the result is empty or too short, send the file to OCR.
For large PDFs or OCR processing, this should run in the background. Otherwise, uploads can become slow and PHP timeouts can happen.
A better flow is:
In WordPress, this can be handled with WP-Cron, Action Scheduler, or a custom queue system.
Final Thoughts
WordPress search ignores PDFs because PDF content is not stored in the database fields WordPress normally searches.
To fix it, you need a separate indexing pipeline.
For text-based PDFs, extract the text and store it in a searchable index.
For scanned PDFs, run OCR first, then store the detected text.
That is the technical foundation of PDF search in WordPress.
We built this into WebEquipe PDF Search. The free version helps index text-based PDFs, and the Pro version adds OCR support for scanned PDFs.
But the bigger idea is simple:
Uploading a PDF does not automatically make its content searchable.
If your WordPress site depends on PDFs, those files need to be treated as part of your search index.