If you're starting a computer vision project today, you're facing a genuinely difficult choice: the object detection landscape has exploded with options. YOLOv5, YOLOv8, YOLOv9, RT-DETR, EfficientDet, DINO... the benchmarks all look impressive, and everyone claims theirs is the best.
Here's a practical guide based on real deployments, not benchmark tables.
The Question You Should Ask First
Before comparing models, clarify your constraints:
- Where does inference run? Edge device (mobile, embedded), on-premise server, or cloud?
- What's the latency requirement? Real-time preview (< 100ms) vs. batch processing (seconds acceptable)?
- What's your dataset size? Hundreds of images vs. tens of thousands?
- What accuracy floor do you need? 90%? 99%? Industry-specific tolerances?
The "best" model is always relative to these constraints. A model that wins on COCO benchmarks might be useless for your specific task.
YOLOv5: Still a Solid Choice
YOLOv5 (Ultralytics, 2020) remains one of the most widely deployed detection models in production — not because it's the newest or theoretically best, but because of practical reasons:
Strengths:
- Mature, battle-tested codebase
- Excellent documentation and community
- Clean PyTorch implementation — easy to debug
- Straightforward TFLite/ONNX export
- Multiple size variants (n/s/m/l/x) for different hardware targets
Weaknesses:
- Not the most accurate on small objects without tuning
- Slower than YOLOv8 on equivalent hardware
- No built-in rotated bounding box support
Best for: Production deployments where stability matters, especially if you need on-device inference and have time to fine-tune.
YOLOv8: The Modern Default
YOLOv8 (Ultralytics, 2023) is what we reach for first on new projects. It improves on YOLOv5 in almost every measurable way:
- ~5-15% mAP improvement on standard benchmarks
- Better small object detection (anchor-free architecture)
- Faster inference at equivalent model sizes
- Cleaner training API
- Better out-of-the-box results with less tuning
Weaknesses:
- Slightly more complex TFLite export (occasionally introduces quantization bugs)
- Less battle-tested than YOLOv5 in production (though rapidly catching up)
Best for: New projects where you want the best speed/accuracy tradeoff without experimental risk.
EfficientDet: When Accuracy Is Non-Negotiable
EfficientDet (Google, 2020) achieves higher accuracy than YOLO variants at the cost of significant speed. On high-end hardware without latency constraints, it often wins.
Strengths:
- State-of-the-art accuracy on many benchmarks
- Good scaling properties (D0-D7 variants)
- Native TensorFlow/TFLite support
Weaknesses:
- 3-5x slower than YOLOv8 at equivalent accuracy tiers
- More complex training setup
- Harder to debug
Best for: High-accuracy requirements where inference runs on powerful hardware (server-side GPU, not mobile).
MobileNetSSD: The Edge-First Option
If you're deploying on very constrained hardware (microcontrollers, cheap Android devices, Raspberry Pi), MobileNetSSD variants are worth considering:
- Model size: 5-20 MB
- Inference on mid-range Android: 20-50ms
- Accuracy: lower, but acceptable for simple detection tasks
Best for: Inference on extremely constrained hardware where YOLOv5n is still too slow.
RT-DETR and DINO: The Transformer Wave
Transformer-based detectors (RT-DETR, DINO, Grounding DINO) achieve impressive accuracy, especially for complex scenes with many objects. They're also useful for zero-shot and open-vocabulary detection.
The practical problem: they're expensive. Inference times are 5-20x higher than YOLO variants, and they require significantly more GPU memory. Unless you have a specific need for open-vocabulary detection or complex reasoning, they're overkill for most production deployments today.
A Decision Framework
Here's how we actually choose:
On-device (mobile/embedded)?
├── Very constrained (< 100ms on cheap device): MobileNetSSD
└── Mid-range device: YOLOv5s INT8 or YOLOv8n INT8
Server-side inference?
├── Need best accuracy: EfficientDet-D3+ or YOLOv8l
├── Need low latency + good accuracy: YOLOv8m
└── Batch processing, latency not critical: RT-DETR / DINO
New project, no special constraints?
└── Start with YOLOv8s — tune from there
What Benchmark Tables Don't Tell You
A few things you'll only learn by deploying:
Domain gap is the biggest variable. A model that scores 55 mAP on COCO might score 72 mAP on your specific domain after fine-tuning — and the ordering between models can flip completely. Always benchmark on your data.
Quantization affects models differently. INT8 quantization reduces YOLOv5 accuracy by ~2%, but might reduce a poorly-designed model by 8-10%. Test this before committing to a model for mobile deployment.
Inference framework matters. The same model running in TFLite vs. ONNX Runtime vs. Core ML can have dramatically different performance characteristics on the same hardware.
Production monitoring reveals what benchmarks hide. In production, you'll see edge cases: unusual lighting, occlusion, orientation. Track confidence distributions over time. If you start seeing more low-confidence predictions, your data distribution has shifted.
Our Practical Recommendation
For most projects we encounter: start with YOLOv8s, fine-tune on your domain data, and measure what you actually care about (not mAP, but your domain-specific accuracy metric). If you need mobile deployment, export to TFLite INT8 and measure latency on your actual target device.
Don't switch architectures chasing benchmark points. The right model is the one that works for your deployment environment and passes your accuracy requirements — not the one at the top of some paper's comparison table.
Have a computer vision project where model selection is a key question? Reach out — we're happy to discuss your constraints.