How We Built a 200ms Image Moderation API on Cheap CPUs Using YOLOv8 and ONNX

Moderating user-generated content (UGC) is a necessity for almost any modern web application. But if you rely on major cloud providers like AWS Rekognition or Google Cloud Vision, scaling your platform can quickly lead to eye-watering API bills.

Moreover, hosting heavy PyTorch or TensorFlow models on GPU-enabled servers is a massive overhead for indie projects.

I wanted to solve this. So I spent the last few months building SafeVision — a real-time, CPU-optimized image moderation API that runs in under 200ms on a basic VPS.

Here is the exact architecture and optimization stack I used to make it happen.

The Architecture: Object Detection + Scene Classification

To avoid false positives, a single model isn't enough. We implemented a dual-model consensus engine:

YOLOv8 Object Detector (Hawk Model): Specialized in identifying specific threat objects like weapons, blades, and blood. It returns precise bounding boxes.
EfficientNet Classifier: Evaluates the overall scene context (NSFW, violence, gore). This prevents a medical surgery image from being flagged as a crime scene.
Decision Engine: Merges results from both models based on dynamic threshold rules to make the final "allow" or "block" decision.

The Optimization: Porting to ONNX Runtime

Running PyTorch models on standard CPU servers usually results in terrible latency (often >1.5 seconds per image). To optimize the engine, we did the following:

ONNX Conversion: We converted our trained YOLOv8 and CNN models to ONNX format.
CPU Execution Provider: By using ONNX Runtime optimized for CPU execution, we reduced memory footprint by 70% and cut inference time down to 150ms - 200ms.
Lazy Loading & Caching: Weights are loaded into memory once on startup and cached, avoiding filesystem I/O overhead on incoming requests.

The API and Client-Side Blurring

We built the backend using FastAPI due to its asynchronous performance. Instead of doing heavy image manipulation on the server, the API returns the bounding boxes of the flagged objects:

{"safe":false,"categories":[{"type":"weapon","confidence":0.94,"box":{"x":120,"y":80,"width":250,"height":180}}],"latency_ms":180}

Give it a try!

I have just launched SafeVision on Product Hunt and opened a free Developer Sandbox (1,000 monthly scans).

Check the live demo here: SafeVision

Product Hunt Launch: If you want to support a solo developer building open-alternatives, check us out on Product Hunt: Product Hunt

I would love to hear your feedback on the API latency or how we can optimize ONNX inference even further!