Pre-call brief · For Son Nguyen · Inspected.com · 2026-05-07

Son — here's how I'd scope the CV/multimodal contract before our call.

You're hiring a senior CV/multimodal lead for visual-media-vs-reference-doc compliance flagging — both image and video. I read the spec, mapped a path, and put my honest read here so the call can be confirming a direction, not running discovery. If any of this is wrong for your reality, the call corrects it fast. If it's right, we have a head start.

PJ Zonis
PJ Zonis · SideGuy Solutions
Encinitas operator · runs the operator-translation layer · works with senior CV/multimodal devs · 858-461-8054
⚡ TL;DR · 30-second read

The bounded prototype is doable in 4-6 weeks (under your 8-12 estimate) if we pick the right architecture upfront and build eval discipline alongside the model — not after. Two viable architectures (single-tower VLM fine-tune vs hybrid pipeline). My lean: hybrid pipeline for compliance-flagging because the doc-side reasoning is brittle inside a VLM. Honest 80/20 read: the model is the easy 80%; eval rigor is the 20% that kills. Senior multimodal lead range: $15-30K for a 4-6 week scoped engagement.

1The use case, as I read it

If anything below is off, that's the first 5 minutes of the call.
  • Inputs: visual media (image + video) + reference docs (format unknown — likely a mix of structured policy + PDF guidelines).
  • Output: a "flag" decision — whether the media violates compliance rules defined in the reference docs.
  • The flag: probably hybrid — strict rules where they exist (e.g., "no firearms in frame"), model judgment where the rules are interpretive ("brand-safe context"). Worth confirming.
  • Data: proprietary, real, already exists. Labeling status is the first scoping question — labeled / partially labeled / unlabeled changes the architecture choice.
  • End state: production-ready prototype. Not an exploratory notebook — an actual deployed inference path with monitoring + a human-review loop.

2Two architecture options

Both ship to a working prototype. The choice depends on how interpretive the compliance rules are.

Option A · Single-tower vision-language model fine-tune

Take a strong open-weight VLM (Qwen2-VL, Llama 3.2 Vision, InternVL, or LLaVA-OneVision depending on license + your inference hardware), fine-tune end-to-end on (media + doc context) → flag pairs.

  • Best when: compliance rules are interpretive, fuzzy, or change frequently. The model learns the doctrine.
  • Pros: single inference call. Easy to deploy. Strong on visual-textual reasoning out of the box.
  • Cons: opaque decisions — hard to explain *why* it flagged. Brittle on edge cases not in training data. Fine-tune cost scales with media volume.
  • Honest: if the docs are short + the rules are vibe-y, this works.

3The 5-layer prototype path

Sequential, not waterfall — layer 4 starts in week 2, not week 6.
Layer 1 · Week 1

Data strategy

Audit the proprietary data. Define the eval slice before touching a model — what's the held-out test set? What's the rate of confirmed-flag examples? Identify gaps. Decide labeling strategy if labels are partial.

Layer 2 · Week 1-2

Model selection

Run a 3-model bake-off on the eval slice before committing. Open-weight VLMs vs hybrid components. Pick based on accuracy, latency, deployability, and cost-at-inference — not just paper benchmarks.

Layer 3 · Week 2-4

Fine-tuning + decision layer

Fine-tune the chosen architecture on real data. Build the decision layer (hybrid path) or specialize the VLM (single-tower). Continuous eval against the held-out set as we train.

Layer 4 · Week 3-5

Eval discipline

Build the eval harness in parallel — confusion matrix, per-rule slice metrics, false-positive cost vs false-negative cost. This is where most multimodal projects quietly fail. Eval discipline is the deliverable, not the afterthought.

Layer 5 · Week 4-6

Deploy

Inference path on lean infra (your stated preference). Human-review loop for confidence-edge flags. Monitoring on production traffic. Documentation of failure modes + retrain trigger criteria.

4Honest 80/20 — where the 20% kills

The model is the easy part. Here's what eats most prototypes.
⚠ The 20% that kills multimodal compliance prototypes

Eval rigor — and the cost asymmetry of false positives vs false negatives

Most multimodal compliance work over-indexes on overall accuracy. Compliance flagging cares about the COSTS of being wrong in each direction:

  • False positive (flagged but compliant) → human reviewer time + creator-experience hit
  • False negative (missed violation) → compliance failure, possibly regulatory exposure

The right model is the one that minimizes weighted error against your actual cost function — not the one with highest F1 on a generic benchmark. This is where eval rigor pays for itself. Defining the cost weights up-front is a 1-hour conversation with whoever owns compliance — and it changes which model you ship.

The other quiet killer: data drift on production media. The flagging system that worked in Q1 quietly degrades by Q3 as the input distribution shifts (new content formats, new violation patterns). Building the retrain trigger into the prototype = production-ready. Skipping it = "production-ready" in name only.

5Timeline + cost honest read

Your spec said 8-12 weeks. Here's why I think 4-6 is doable — and what trades.
Realistic timeline
4-6 weeks
If data is already labeled + decision is single-architecture upfront. Add 2 wks if labeling.
Senior multimodal lead
$15-30K
Range based on 4-6 week engagement, scope-defined contract. Hourly $200-300, ~80-100 hrs.
Compute/infra
$1-3K
Fine-tune jobs on H100 hours + inference deployment. Lean per your spec.

What trades for the 4-6 vs 8-12 estimate

The 8-12 week budget probably assumes a multi-architecture exploration phase. If we commit to the hybrid pipeline (Option B) on the call, we skip 2-4 weeks of architecture bake-off and ship to working prototype faster. The trade: less in-prototype validation that single-tower wouldn't have been better. We can run a side-by-side validation in week 5 if it matters — adds 1 week.

If the data labeling story is messier than the spec implies, add 2 weeks to layer 1. Everything else stays.

Mon/Tues call · Solana Beach office · Async-ready

I sent the same to son@inspected.com so we have both rails. Toss me 2-3 windows that fit your day and I'll lock the best one. If you can send 1-2 sample media frames + the reference doc format ahead of time, I come in fully scoped.

PJ Text me 858-461-8054 PJ Text PJ 858-461-8054
🎁 Didn't quite find it?

Don't see what you were looking for?

Text PJ a sentence about what you actually need — I'll build you a free custom shareable on the house. No email, no funnel, no SOW.

📲 Text PJ — free shareable
~10 min turnaround. Your friends will love it.

I'm almost positive I can help. If I can't, you don't pay.

No signup. No seminar. No bullshit.

PJ · 858-461-8054

Ready to start?Operator Audit · $250 · 3-5 days · operator-honest signal-quality audit · credited if you upgrade · text PJ at 858-461-8054.