From Prototype to Production: Deploying an ML Model on Android

There's a gap between "it works in my Jupyter notebook" and "it works reliably in production." That gap is where most ML projects get stuck.

This article covers the practical steps for taking a trained PyTorch model and deploying it as a production-ready inference system serving an Android app. It's the workflow we use repeatedly at AIVerse.

The Stack

Training: Python, PyTorch, YOLOv5/YOLOv8
Conversion: ONNX, TensorFlow, TFLite
Serving (optional): Flask/FastAPI on a Linux server
Mobile runtime: Android (Java/Kotlin) with TFLite Android library

Phase 1: Clean Up Your Training Artifacts

Before converting anything, your trained model needs to be reproducible and clean:

Checkpoint hygiene:

Save the final model weights (best.pt, not just last.pt)
Save your training configuration (hyperparameters, augmentation settings)
Record your validation metrics — you'll need these as your baseline to verify conversion accuracy

Validate on held-out data before converting: Run inference on a representative sample of your production data (not your validation set) and check that results match your expectations. If there are surprises here, fix them before converting — they'll be harder to debug in TFLite.

Phase 2: Converting PyTorch → TFLite

The conversion chain for YOLOv5/YOLOv8 to TFLite:

PyTorch (.pt) → ONNX (.onnx) → TensorFlow SavedModel → TFLite (.tflite)

Step 1: Export to ONNX

# For YOLOv5:
python export.py --weights best.pt --include onnx --opset 12

# For YOLOv8:
from ultralytics import YOLO
model = YOLO('best.pt')
model.export(format='onnx', opset=12)

Use opset 12 — it has the best compatibility across TensorFlow conversion tools.

Step 2: ONNX → TensorFlow SavedModel

pip install onnx-tf
python -c "
import onnx
from onnx_tf.backend import prepare

model = onnx.load('best.onnx')
tf_rep = prepare(model)
tf_rep.export_graph('saved_model')
"

Step 3: SavedModel → TFLite

import tensorflow as tf

converter = tf.lite.TFLiteConverter.from_saved_model('saved_model')

# For INT8 quantization (recommended for mobile):
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset_gen  # see below
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.float32  # keep output as float

tflite_model = converter.convert()
with open('model.tflite', 'wb') as f:
    f.write(tflite_model)

The representative_dataset_gen function is critical for INT8 quantization — it provides sample inputs so the converter can calibrate the quantization ranges:

def representative_dataset_gen():
    for image_path in sample_images[:100]:  # 100 samples is usually enough
        img = preprocess_image(image_path)  # same preprocessing as training
        yield [img]

Verify accuracy after each step. Run the same 50-100 test images through PyTorch, ONNX, SavedModel, and TFLite and compare outputs. Each conversion step can introduce subtle differences.

Phase 3: The Android Integration

Add the dependency:

// build.gradle
dependencies {
    implementation 'org.tensorflow:tensorflow-lite:2.14.0'
    implementation 'org.tensorflow:tensorflow-lite-gpu:2.14.0'  // optional GPU delegate
}

Load and run the model:

class MeterDetector(context: Context) {
    private val interpreter: Interpreter
    
    init {
        val model = FileUtil.loadMappedFile(context, "model.tflite")
        val options = Interpreter.Options().apply {
            numThreads = 4
            // Optional: GPU delegate for faster inference
            // addDelegate(GpuDelegate())
        }
        interpreter = Interpreter(model, options)
    }
    
    fun detect(bitmap: Bitmap): List<Detection> {
        val input = preprocessBitmap(bitmap)  // resize to 640x640, normalize
        val output = Array(1) { Array(25200) { FloatArray(85) } }  // YOLOv5 output shape
        
        interpreter.run(input, output)
        return postprocess(output)  // NMS, confidence filtering
    }
}

The preprocessing step is critical — it must exactly match what your training pipeline did:

private fun preprocessBitmap(bitmap: Bitmap): ByteBuffer {
    val resized = Bitmap.createScaledBitmap(bitmap, 640, 640, true)
    val buffer = ByteBuffer.allocateDirect(1 * 640 * 640 * 3 * 4)  // 4 bytes per float
    buffer.order(ByteOrder.nativeOrder())
    
    val pixels = IntArray(640 * 640)
    resized.getPixels(pixels, 0, 640, 0, 0, 640, 640)
    
    for (pixel in pixels) {
        buffer.putFloat(((pixel shr 16) and 0xFF) / 255.0f)  // R
        buffer.putFloat(((pixel shr 8) and 0xFF) / 255.0f)   // G
        buffer.putFloat((pixel and 0xFF) / 255.0f)            // B
    }
    return buffer
}

If you get unexpectedly poor results on device, preprocessing mismatch is the most common cause. Print the first 10 pixel values from both your Python preprocessing and Android preprocessing and compare them.

Phase 4: The Flask API (Optional)

For cases where on-device inference isn't feasible (model too large, device too slow, or shared inference across multiple clients), a Flask API works well:

from flask import Flask, request, jsonify
import torch
from PIL import Image
import io

app = Flask(__name__)
model = torch.hub.load('ultralytics/yolov5', 'custom', path='best.pt')
model.eval()

@app.route('/detect', methods=['POST'])
def detect():
    if 'image' not in request.files:
        return jsonify({'error': 'No image'}), 400
    
    image_bytes = request.files['image'].read()
    image = Image.open(io.BytesIO(image_bytes))
    
    results = model(image)
    detections = results.pandas().xyxy[0].to_dict(orient='records')
    
    return jsonify({'detections': detections})

Deploy this behind Nginx with Gunicorn. Use a process manager (systemd or supervisor) to keep it running.

Phase 5: Monitoring in Production

This is the part most tutorials skip. Your model will degrade over time as real-world data drifts from your training distribution.

What to log:

Inference latency (p50, p95, p99)
Confidence score distributions — if mean confidence drops, something has changed
Input image statistics (mean brightness, contrast) — useful for detecting camera changes
Error rate from downstream validation (if you have ground truth from re-checks)

Simple anomaly detection: Track a rolling 7-day average of mean confidence. If it drops by more than 10% from baseline, investigate. This catches most distribution shifts before they become user-facing problems.

Model versioning: Keep your TFLite models versioned (model_v1.tflite, model_v2.tflite). When you update the model, run both in parallel for a week and compare outputs. Only fully switch over when you're confident the new model performs better.

Common Failure Points

From our experience, here's where things typically go wrong:

Preprocessing mismatch — the #1 cause of "it works in Python but not on Android"
Output postprocessing — NMS parameters that work in training don't always work in production lighting conditions
Model quantization errors — some operators don't quantize cleanly; test thoroughly
Memory leaks — Bitmap objects in Android must be recycled explicitly
Thermal throttling — continuous inference on mobile devices causes CPU/GPU throttling after ~10 minutes; test sustained inference, not just peak performance

Stuck at the prototype-to-production transition on your ML project? Contact us — this is exactly the kind of engineering challenge we solve.