There's a gap between "it works in my Jupyter notebook" and "it works reliably in production." That gap is where most ML projects get stuck.
This article covers the practical steps for taking a trained PyTorch model and deploying it as a production-ready inference system serving an Android app. It's the workflow we use repeatedly at AIVerse.
The Stack
- Training: Python, PyTorch, YOLOv5/YOLOv8
- Conversion: ONNX, TensorFlow, TFLite
- Serving (optional): Flask/FastAPI on a Linux server
- Mobile runtime: Android (Java/Kotlin) with TFLite Android library
Phase 1: Clean Up Your Training Artifacts
Before converting anything, your trained model needs to be reproducible and clean:
Checkpoint hygiene:
- Save the final model weights (
best.pt, not justlast.pt) - Save your training configuration (hyperparameters, augmentation settings)
- Record your validation metrics — you'll need these as your baseline to verify conversion accuracy
Validate on held-out data before converting: Run inference on a representative sample of your production data (not your validation set) and check that results match your expectations. If there are surprises here, fix them before converting — they'll be harder to debug in TFLite.
Phase 2: Converting PyTorch → TFLite
The conversion chain for YOLOv5/YOLOv8 to TFLite:
PyTorch (.pt) → ONNX (.onnx) → TensorFlow SavedModel → TFLite (.tflite)
Step 1: Export to ONNX
# For YOLOv5:
python export.py --weights best.pt --include onnx --opset 12
# For YOLOv8:
from ultralytics import YOLO
model = YOLO('best.pt')
model.export(format='onnx', opset=12)
Use opset 12 — it has the best compatibility across TensorFlow conversion tools.
Step 2: ONNX → TensorFlow SavedModel
pip install onnx-tf
python -c "
import onnx
from onnx_tf.backend import prepare
model = onnx.load('best.onnx')
tf_rep = prepare(model)
tf_rep.export_graph('saved_model')
"
Step 3: SavedModel → TFLite
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model('saved_model')
# For INT8 quantization (recommended for mobile):
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset_gen # see below
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.float32 # keep output as float
tflite_model = converter.convert()
with open('model.tflite', 'wb') as f:
f.write(tflite_model)
The representative_dataset_gen function is critical for INT8 quantization — it provides sample inputs so the converter can calibrate the quantization ranges:
def representative_dataset_gen():
for image_path in sample_images[:100]: # 100 samples is usually enough
img = preprocess_image(image_path) # same preprocessing as training
yield [img]
Verify accuracy after each step. Run the same 50-100 test images through PyTorch, ONNX, SavedModel, and TFLite and compare outputs. Each conversion step can introduce subtle differences.
Phase 3: The Android Integration
Add the dependency:
// build.gradle
dependencies {
implementation 'org.tensorflow:tensorflow-lite:2.14.0'
implementation 'org.tensorflow:tensorflow-lite-gpu:2.14.0' // optional GPU delegate
}
Load and run the model:
class MeterDetector(context: Context) {
private val interpreter: Interpreter
init {
val model = FileUtil.loadMappedFile(context, "model.tflite")
val options = Interpreter.Options().apply {
numThreads = 4
// Optional: GPU delegate for faster inference
// addDelegate(GpuDelegate())
}
interpreter = Interpreter(model, options)
}
fun detect(bitmap: Bitmap): List<Detection> {
val input = preprocessBitmap(bitmap) // resize to 640x640, normalize
val output = Array(1) { Array(25200) { FloatArray(85) } } // YOLOv5 output shape
interpreter.run(input, output)
return postprocess(output) // NMS, confidence filtering
}
}
The preprocessing step is critical — it must exactly match what your training pipeline did:
private fun preprocessBitmap(bitmap: Bitmap): ByteBuffer {
val resized = Bitmap.createScaledBitmap(bitmap, 640, 640, true)
val buffer = ByteBuffer.allocateDirect(1 * 640 * 640 * 3 * 4) // 4 bytes per float
buffer.order(ByteOrder.nativeOrder())
val pixels = IntArray(640 * 640)
resized.getPixels(pixels, 0, 640, 0, 0, 640, 640)
for (pixel in pixels) {
buffer.putFloat(((pixel shr 16) and 0xFF) / 255.0f) // R
buffer.putFloat(((pixel shr 8) and 0xFF) / 255.0f) // G
buffer.putFloat((pixel and 0xFF) / 255.0f) // B
}
return buffer
}
If you get unexpectedly poor results on device, preprocessing mismatch is the most common cause. Print the first 10 pixel values from both your Python preprocessing and Android preprocessing and compare them.
Phase 4: The Flask API (Optional)
For cases where on-device inference isn't feasible (model too large, device too slow, or shared inference across multiple clients), a Flask API works well:
from flask import Flask, request, jsonify
import torch
from PIL import Image
import io
app = Flask(__name__)
model = torch.hub.load('ultralytics/yolov5', 'custom', path='best.pt')
model.eval()
@app.route('/detect', methods=['POST'])
def detect():
if 'image' not in request.files:
return jsonify({'error': 'No image'}), 400
image_bytes = request.files['image'].read()
image = Image.open(io.BytesIO(image_bytes))
results = model(image)
detections = results.pandas().xyxy[0].to_dict(orient='records')
return jsonify({'detections': detections})
Deploy this behind Nginx with Gunicorn. Use a process manager (systemd or supervisor) to keep it running.
Phase 5: Monitoring in Production
This is the part most tutorials skip. Your model will degrade over time as real-world data drifts from your training distribution.
What to log:
- Inference latency (p50, p95, p99)
- Confidence score distributions — if mean confidence drops, something has changed
- Input image statistics (mean brightness, contrast) — useful for detecting camera changes
- Error rate from downstream validation (if you have ground truth from re-checks)
Simple anomaly detection: Track a rolling 7-day average of mean confidence. If it drops by more than 10% from baseline, investigate. This catches most distribution shifts before they become user-facing problems.
Model versioning:
Keep your TFLite models versioned (model_v1.tflite, model_v2.tflite). When you update the model, run both in parallel for a week and compare outputs. Only fully switch over when you're confident the new model performs better.
Common Failure Points
From our experience, here's where things typically go wrong:
- Preprocessing mismatch — the #1 cause of "it works in Python but not on Android"
- Output postprocessing — NMS parameters that work in training don't always work in production lighting conditions
- Model quantization errors — some operators don't quantize cleanly; test thoroughly
- Memory leaks —
Bitmapobjects in Android must be recycled explicitly - Thermal throttling — continuous inference on mobile devices causes CPU/GPU throttling after ~10 minutes; test sustained inference, not just peak performance
Stuck at the prototype-to-production transition on your ML project? Contact us — this is exactly the kind of engineering challenge we solve.