When you run a video aggregator that pulls trending content from nine Asia-Pacific regions every four hours, search becomes the place where users either find what they want or leave for YouTube directly. At TopVideoHub we aggregate clips across Japanese, Korean, Mandarin, Vietnamese, Thai, and English titles in the same index, and our original SQLite FTS5 setup with a CJK tokenizer worked well for exact substring matches but collapsed the moment a user typed blakpink instead of blackpink, or searched for Aimyon when our database stored あいみょん.
This post walks through the migration we did from SQLite FTS5 to OpenSearch for typo-tolerant search, while keeping FTS5 as a fallback. Running on a budget LiteSpeed host means we could not just throw a five-node Elasticsearch cluster at the problem — we had to be careful about heap pressure, indexing latency, and how Cloudflare caches the search responses.
Why FTS5 was not enough for fuzzy CJK queries
SQLite FTS5 with the unicode61 tokenizer plus a CJK bigram tokenizer handles tokenization across scripts well enough. Our migration looked like this:
CREATE VIRTUAL TABLE videos_fts USING fts5(
title,
channel_name,
description,
tokenize = 'unicode61 remove_diacritics 2',
content = 'videos',
content_rowid = 'id'
);
This gave us fast prefix and substring matching. The problem is that FTS5 does not implement fuzziness in the Damerau-Levenshtein sense. If a user searched for blakpink, FTS5 returned zero results because the trigram split (bla, lak, akp, kpi, pin, ink) shared almost no overlap with the indexed blackpink tokens.
We tried compensating with edge n-grams stored as a denormalized column, then with manual misspelling synonym dictionaries. Both approaches worked for narrow cases but exploded index size and required hand-maintained mappings for every new artist or trending phrase. For CJK content, n-grams of length 1-2 produced massive recall but garbage precision — searching for 新 returned 40,000 plus videos because the character appears in roughly every fourth Chinese title.
The decision point was clear. We needed an engine that supports proper fuzziness with edit distance, language-aware analyzers, and the ability to weight matches across multiple fields. OpenSearch fit the bill, especially because we could run a single-node deployment on a small VPS we already had provisioned for blog ingestion. Self-hosted, no managed-service bill, and the OpenSearch 2.x line ships with the analyzers we needed out of the box.
Index design for multi-language content
The first real engineering decision was the index mapping. CJK languages do not use whitespace as a token boundary, so the standard analyzer is useless for Japanese, Chinese, and Korean. OpenSearch ships with the ICU analysis plugin, and there are well-maintained kuromoji (Japanese), nori (Korean), and smartcn (Chinese) plugins.
Rather than maintain one analyzer per language and a router on the application side, we used the multi-fields pattern. The same source text is indexed under several analyzers and we search them with multi_match. The query engine picks the best match per shard, and we score-weight them on the application side.
Here is the mapping we settled on after about two weeks of tuning:
{"settings":{"number_of_shards":1,"number_of_replicas":0,"analysis":{"analyzer":{"title_standard":{"type":"custom","tokenizer":"icu_tokenizer","filter":["icu_folding","lowercase"]},"title_edge_ngram":{"type":"custom","tokenizer":"icu_tokenizer","filter":["icu_folding","lowercase","edge_ngram_filter"]},"title_cjk_bigram":{"type":"custom","tokenizer":"icu_tokenizer","filter":["cjk_bigram","lowercase"]}},"filter":{"edge_ngram_filter":{"type":"edge_ngram","min_gram":2,"max_gram":15}}}},"mappings":{"properties":{"video_id":{"type":"keyword"},"title":{"type":"text","analyzer":"title_standard","fields":{"edge":{"type":"text","analyzer":"title_edge_ngram","search_analyzer":"title_standard"},"cjk":{"type":"text","analyzer":"title_cjk_bigram"},"keyword":{"type":"keyword","ignore_above":256}}},"channel_name":{"type":"text","analyzer":"title_standard"},"region":{"type":"keyword"},"category_id":{"type":"integer"},"published_at":{"type":"date"},"view_count":{"type":"long"},"duration_seconds":{"type":"integer"}}}}
A few design choices worth explaining:
- We dropped
number_of_replicasto zero because a single-node deployment cannot replicate, and the default of one leaves the cluster in yellow status forever and breaks health-check scripts. -
icu_foldinghandles diacritic stripping for Vietnamese (hóabecomeshoa) and width normalization for fullwidth Latin (HELLObecomeshello). - The
cjk_bigramfilter splits CJK characters into overlapping pairs, so日本語indexes as日本,本語. This is the standard approach when you do not want to ship and version a heavy morphological dictionary. -
search_analyzerdiffers fromanalyzeron theedgesubfield. We index with edge n-grams but search without them. Otherwise every search term would also explode into n-grams on the query side, and the relevance scoring would be unusable.
Indexing pipeline from PHP 8.4
Our existing fetch cron pulls from the YouTube Data API every 2-7 hours depending on the site, normalizes the payload, and writes to SQLite. The OpenSearch indexing hooks into the same path as a fire-and-forget bulk step at the end of each fetch cycle. If OpenSearch is unavailable, we log and move on — SQLite remains the source of truth.
<?php
declare(strict_types=1);
namespace App\Search;
final class OpenSearchIndexer
{
private const ENDPOINT = 'http://127.0.0.1:9200';
private const INDEX = 'videos_v3';
private const BATCH_SIZE = 200;
public function __construct(
private readonly \PDO $db,
private readonly \App\Logger $log,
) {}
public function indexBatch(array $videoIds): int
{
if ($videoIds === []) {
return 0;
}
$placeholders = implode(',', array_fill(0, count($videoIds), '?'));
$stmt = $this->db->prepare(
"SELECT id, video_id, title, channel_name, region,
category_id, published_at, view_count, duration_seconds
FROM videos WHERE id IN ($placeholders)"
);
$stmt->execute($videoIds);
$bulk = '';
$count = 0;
while ($row = $stmt->fetch(\PDO::FETCH_ASSOC)) {
$bulk .= json_encode([
'index' => [
'_index' => self::INDEX,
'_id' => $row['video_id'],
],
], JSON_THROW_ON_ERROR) . "\n";
$bulk .= json_encode([
'video_id' => $row['video_id'],
'title' => $row['title'],
'channel_name' => $row['channel_name'],
'region' => $row['region'],
'category_id' => (int)$row['category_id'],
'published_at' => $row['published_at'],
'view_count' => (int)$row['view_count'],
'duration_seconds' => (int)$row['duration_seconds'],
], JSON_THROW_ON_ERROR | JSON_UNESCAPED_UNICODE) . "\n";
$count++;
}
$response = $this->postBulk($bulk);
if ($response === null) {
$this->log->warn('opensearch.bulk.failed', ['count' => $count]);
return 0;
}
if (!empty($response['errors'])) {
$this->logBulkErrors($response['items']);
}
return $count;
}
private function postBulk(string $payload): ?array
{
$ch = curl_init(self::ENDPOINT . '/_bulk');
curl_setopt_array($ch, [
CURLOPT_POST => true,
CURLOPT_POSTFIELDS => $payload,
CURLOPT_HTTPHEADER => ['Content-Type: application/x-ndjson'],
CURLOPT_RETURNTRANSFER => true,
CURLOPT_TIMEOUT => 8,
CURLOPT_CONNECTTIMEOUT => 2,
]);
$body = curl_exec($ch);
$code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($code !== 200 || $body === false) {
return null;
}
return json_decode($body, true, 512, JSON_THROW_ON_ERROR);
}
private function logBulkErrors(array $items): void
{
$failures = [];
foreach ($items as $item) {
$op = $item['index'] ?? $item['create'] ?? null;
if ($op === null || ($op['status'] ?? 200) < 400) {
continue;
}
$failures[] = [
'id' => $op['_id'] ?? null,
'status' => $op['status'],
'error' => $op['error']['reason'] ?? 'unknown',
];
}
if ($failures !== []) {
$this->log->warn('opensearch.bulk.items_failed', $failures);
}
}
}
The eight-second timeout matters. PHP-FPM under LiteSpeed has a 180-second ceiling, but if OpenSearch is overloaded we want to surface failures fast, not hold the worker. We batch in 200-document chunks because smaller batches give us better backpressure when JVM heap is tight, and the bulk endpoint copes well at this size.
One thing we learned the hard way: do not use the external versioning mode unless you have a real monotonic version. We tried using published_at as the version field, but YouTube backfills publishedAt sometimes when a video gets re-uploaded under the same ID, and the new value was occasionally smaller than the old one. Documents stopped updating silently. Default internal versioning is the safe choice for a feed-style index.
Building the search query
The interesting part is the query side. A user types something — possibly with typos, possibly in CJK, possibly mixed (BTS dynamite live). We want fuzzy matching for Latin terms, exact bigram matching for CJK, and edge n-gram matching for prefix-as-you-type behavior. All three need to score together so the best result wins.
We use a single multi_match with best_fields type, plus a fuzziness parameter that only kicks in for Latin scripts.
<?php
declare(strict_types=1);
namespace App\Search;
final class OpenSearchQuery
{
private const ENDPOINT = 'http://127.0.0.1:9200';
private const INDEX = 'videos_v3';
public function search(string $term, int $limit = 24, ?string $region = null): array
{
$term = trim($term);
if ($term === '') {
return [];
}
$isCjk = preg_match('/[\x{3000}-\x{9FFF}\x{AC00}-\x{D7AF}]/u', $term) === 1;
$fuzziness = $isCjk ? '0' : 'AUTO:4,7';
$query = [
'size' => $limit,
'query' => [
'function_score' => [
'query' => [
'bool' => [
'should' => [
[
'multi_match' => [
'query' => $term,
'fields' => ['title^3', 'channel_name^1.5'],
'type' => 'best_fields',
'fuzziness' => $fuzziness,
'prefix_length' => 1,
'max_expansions' => 30,
],
],
[
'match' => [
'title.edge' => [
'query' => $term,
'boost' => 2.0,
],
],
],
[
'match' => [
'title.cjk' => [
'query' => $term,
'boost' => $isCjk ? 4.0 : 0.5,
],
],
],
],
'minimum_should_match' => 1,
'filter' => $this->buildFilters($region),
],
],
'functions' => [
[
'gauss' => [
'published_at' => [
'origin' => 'now',
'scale' => '30d',
'decay' => 0.5,
],
],
],
[
'field_value_factor' => [
'field' => 'view_count',
'modifier' => 'log1p',
'factor' => 0.2,
'missing' => 0,
],
],
],
'score_mode' => 'sum',
'boost_mode' => 'multiply',
],
],
];
return $this->execute($query);
}
private function buildFilters(?string $region): array
{
$filters = [
['range' => ['duration_seconds' => ['gte' => 30, 'lte' => 1800]]],
];
if ($region !== null) {
$filters[] = ['term' => ['region' => $region]];
}
return $filters;
}
private function execute(array $query): array
{
$ch = curl_init(self::ENDPOINT . '/' . self::INDEX . '/_search');
curl_setopt_array($ch, [
CURLOPT_POST => true,
CURLOPT_POSTFIELDS => json_encode($query, JSON_UNESCAPED_UNICODE | JSON_THROW_ON_ERROR),
CURLOPT_HTTPHEADER => ['Content-Type: application/json'],
CURLOPT_RETURNTRANSFER => true,
CURLOPT_TIMEOUT => 2,
CURLOPT_CONNECTTIMEOUT => 1,
]);
$body = curl_exec($ch);
$code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($code !== 200 || $body === false) {
throw new SearchUnavailable('OpenSearch returned ' . $code);
}
$decoded = json_decode($body, true, 512, JSON_THROW_ON_ERROR);
return array_map(
fn(array $hit) => ['id' => $hit['_id'], 'score' => $hit['_score']] + $hit['_source'],
$decoded['hits']['hits'] ?? []
);
}
}
The key decisions here:
-
fuzziness: AUTO:4,7means terms of length 1-3 match exactly, length 4-6 allow one edit, length 7 and up allow two edits. For CJK we disable fuzziness entirely because edit distance on logographic characters returns nonsense —日本and本日are one transposition apart but mean different things. -
prefix_length: 1forces the first character to match, which dramatically reduces the candidate set for short queries that would otherwise expand into thousands of fuzzy matches. -
max_expansions: 30caps how many terms each fuzzy query can expand to, protecting us from a pathological single-letter search melting the heap. -
function_scoredecays results by publish date with a 30-day Gaussian half-life and boosts by view count (log1pkeeps mega-viral videos from completely dominating). - The CJK subfield is boosted to 4.0 for CJK queries and 0.5 otherwise. This stops
k-popfrom accidentally matching every Korean clip through romanization noise. - The filter on
duration_secondsexcludes Shorts (under 30s) and full episodes (over 30min), which is a TopVideoHub-specific UX choice.
Circuit breaker and FTS5 fallback
OpenSearch processes crash. The JVM does long GC pauses. The VPS reboots. In a search UX, going from fuzzy results to no results because the daemon died for 90 seconds is unacceptable. We wrapped the OpenSearch query in a circuit breaker that falls back to SQLite FTS5 on repeated failure.
The pattern is straightforward, but state has to be process-local because we are on PHP — no shared memory unless we want to involve APCu or Redis. A tiny JSON file in /tmp is enough because LiteSpeed pins the PHP worker pool to a single host.
<?php
declare(strict_types=1);
namespace App\Search;
final class SearchService
{
private const CIRCUIT_FILE = '/tmp/opensearch_circuit.json';
private const FAILURE_THRESHOLD = 3;
private const COOLDOWN_SECONDS = 60;
public function __construct(
private readonly OpenSearchQuery $primary,
private readonly Fts5Search $fallback,
private readonly \App\Logger $log,
) {}
public function search(string $term, int $limit = 24, ?string $region = null): array
{
if ($this->circuitOpen()) {
return $this->fallback->search($term, $limit, $region);
}
try {
$results = $this->primary->search($term, $limit, $region);
$this->recordSuccess();
return $results;
} catch (SearchUnavailable $e) {
$this->recordFailure();
$this->log->warn('search.fallback', ['error' => $e->getMessage()]);
return $this->fallback->search($term, $limit, $region);
}
}
private function circuitOpen(): bool
{
$state = $this->readState();
if ($state['failures'] < self::FAILURE_THRESHOLD) {
return false;
}
return (time() - $state['last_failure']) < self::COOLDOWN_SECONDS;
}
private function recordSuccess(): void
{
$this->writeState(['failures' => 0, 'last_failure' => 0]);
}
private function recordFailure(): void
{
$state = $this->readState();
$this->writeState([
'failures' => $state['failures'] + 1,
'last_failure' => time(),
]);
}
private function readState(): array
{
if (!is_file(self::CIRCUIT_FILE)) {
return ['failures' => 0, 'last_failure' => 0];
}
$raw = @file_get_contents(self::CIRCUIT_FILE);
if ($raw === false) {
return ['failures' => 0, 'last_failure' => 0];
}
return json_decode($raw, true) ?: ['failures' => 0, 'last_failure' => 0];
}
private function writeState(array $state): void
{
@file_put_contents(self::CIRCUIT_FILE, json_encode($state), LOCK_EX);
}
}
After three consecutive failures we stop hitting OpenSearch for 60 seconds and route everything to FTS5. After the cooldown we let one request through; if it succeeds, the counter resets. The fallback FTS5 path returns visibly fewer results for typo-heavy queries, but it always returns something, which is the important property.
Cloudflare and LiteSpeed caching of search responses
The search endpoint gets hammered — roughly 8% of pageviews. We could not let every search query reach origin. But search results are user-typed input, so caching the wrong response is a real risk if Vary is wrong.
The approach was three layers:
- Set
Cache-Control: public, max-age=600, stale-while-revalidate=1800on search responses with a non-empty query string. - Cache key includes the normalized query (lowercased, trimmed, NFKC) plus the region cookie.
- LiteSpeed
<IfModule LiteSpeed>block addsCacheLookup public onfor/search, so LSCache kicks in before PHP even boots.
Cloudflare Free does not cache HTML by default. We added a Cache Rule for URI Path starts with "/search" to set Cache Level to Cache Everything with TTL 600. Combined with Vary: Cookie scoped to only the region cookie, this works without leaking sessions across users.
The gotcha was personalization: logged-in users see watch-later markers on result cards, and those would poison the cache. We solved this by rendering the user-specific overlay client-side from a tiny /api/watchlater.json endpoint that the cache layer skips. Nothing about the cached HTML body is user-specific.
Numbers that mattered
After two weeks of running OpenSearch alongside FTS5 with traffic split 90/10 to OpenSearch, the results were:
- Search result CTR went from 31% to 48%, mostly driven by previously-empty result pages now returning content.
- Median search latency: SQLite FTS5 14ms, OpenSearch 22ms, with the Cloudflare layer absorbing roughly 73% of traffic at near-zero ms.
- Zero-result rate dropped from 19% to 4%. The remaining 4% is genuinely unindexable input (single characters, pasted URLs).
- OpenSearch steady-state RAM: 1.4 GB heap, 600 MB OS cache. Indexing about 12,000 new videos per day across nine regions.
The fuzzy queries that drove the most improvement, in rough order of impact:
- Misspellings of artist names (
blakpink,newgens,itzyvariants). - Diacritic-stripped Vietnamese (
hoa minzymatchingHòa Minzy). - Hyphen and space normalization (
k popmatchingk-pop,bts vmatchingBTS-V). - Romanized CJK once we added a romaji subfield (
aimyonmatchingあいみょん), which is a separate post.
Conclusion
OpenSearch was the right tool for the typo-tolerance problem, but the migration was as much about the surrounding pieces as the engine itself. The circuit breaker that protects PHP workers from a sick OpenSearch process, the LSCache plus Cloudflare cache layer that absorbs most of the search traffic before queries even hit Lucene, and the per-field analyzer strategy for CJK content are what made it production-viable on a budget LiteSpeed host. SQLite FTS5 still has a job as the fallback and as the source of truth for exact-substring features like channel-scoped search, but for free-form user queries with typos, edit-distance fuzziness is non-negotiable. If you are running a similar multi-language aggregator and your search box still goes straight to FTS5 or LIKE queries, the migration is doable in a couple of weeks with one engineer. Start with the index mapping, build the indexing pipeline, then layer the circuit breaker before cutting traffic over.