AI Vision - v1
The First Two Weeks
Recently, I have been working with Dustin on some prototyping and creating a proof of concept (POC) for an AI Vision product. It has been a whirlwind of two weeks, and I wanted to share a bit of the process, how it went, and some lessons learned.
A Little on Autonomy
We were given great autonomy as to how we worked.
We were able to pick the following:
languages
editors
AI coding assistants
cloud platform
the idea for the product
It was very fluid with the two of us. We worked both separately and also pair-programmed. Initially, each of us conducted our own separate investigation and experimentation.
We were not alone, though.
We had another teammate who acted as a product owner who was experienced with AI. They could help us with stand-ups, managing items on the GitHub project, and obtaining clarifications on questions to provide us with direction on solving the right problems. They helped us make connections with other subject matter experts.
We were expected to have a daily 30-minute stand-up, a larger group weekly stand-up, and a weekly demo.
We consulted with other colleagues within our company to obtain assistance when we needed a subject matter expert to guide us. We had huddles with at least seven different people at different times.
Our First Demo
The first week’s demo came quickly, after only two days, but we managed to get something to demo.
Our first demo was of a container running a bi-modal model. We provided a pre-defined prompt and image, and then output a textual analysis of the image.
Our Tech Stack
Early on, I decided to use a tech stack that was unfamiliar to me. This effort was a prime opportunity to try out and learn new things. This decision made me uncomfortable at times, especially since I was still setting up my basically fresh MacBook for development. However, if you are not uncomfortable at times, then you may not actually be pushing yourself enough to learn new things. Eventually, as things settled in and I got things set up, my comfort level reached a good balance of intrigue/learning vs being overwhelmed/uncomfortable.
Our tech stack:
VS Code
Azure
Ollama
Node
I had spent the last decade developing with Scala and targeting AWS cloud, so this was a completely new experience for me.
I had also wanted to learn Claude Code and AgentOS. I had seen a demo of spec-driven development with multiple sub-agents before this effort. Using this AI assistant and framework appeared to be the best way to produce good results quickly.
My Experiments with Edge Computing
I began investigating edge-based model inference by training a Utralytics YOLO model on my local laptop with open-source datasets found on Kaggle and Python.
Challenges with Training Locally
Lack of GPU Support/Unsupported CPUs
I discovered my laptop was not up to the task. I was able to train a model on approximately 1.4k images, taking over 20 hours. I was only getting 30-40% accuracy, which is not stellar results. Next, I started training a model based on approximately 4.6k images, assuming more images in the dataset would produce better outcomes. That was the breaking point for my computer, as it was estimated to take over 20 days to get it trained. I doubt my hardware was being targeted correctly. I have an older 2019 Intel-based MacBook without an NVIDIA graphics card, so using CUDA to make efficient use of my GPU was not feasible. My graphics card also had only 4GB of memory, which is pretty small by today’s standards. I found out later on that there was a newer release with better support for my hardware. However, by then, I was finished trying to get things to work locally and had moved on.
Out of Memory (OOM) Errors
Training would produce OOMs by overwhelming my graphics card’s memory. I’d have to tune things and restart from scratch.
Fine Grain Tuning
Later, I learned you can cut off training early if your model is no longer improving. Additionally, with additional coding/configuration, you can attempt to auto-resume from the last.pt or best.pt checkpoint files.
Pretrained Models
I ended up using pre-trained models for Utralytics YOLO 8 and 10 (nano and small models) provided on the open-source dataset’s releases. The YOLO 8 models had higher confidence (around 80-90%), but the YOLO10 (around 70-80% confidence) handled edge cases much better. Claude made this so easy to set all this up. I ended up abandoning the training on my own laptop route as my teammate on the project was getting better results from a cloud bi-modal LLM (Qwen3-VL with 235 billion parameters). The performance of the Qwen model makes sense because the larger your model is, the better results you’ll generally get. The YOLO nano/small models would run well on edge devices. Examples of edge devices are phones, tablets, or Raspberry Pis. It takes more work to get models to run inference on edge devices because they have GPU/CPU/Memory constraints. My edge models also gave bounding boxes and confidences, which is more image object detection vs image understanding from the cloud model. In the cloud, you can scale to use whatever GPU/CPU you want; scaling usually incurs additional costs.
Image Understanding
Prompt: How many people are present, and are they all wearing stocking caps?
Input: Normalized base64 encoded image
Output: There are two people present. A man and a woman. Both of them are wearing stocking caps.
Object Detection
Prompt: N/A
Input: Normalized base64 encoded image
Output:
Person: Bounding Box (5,10,50,100), Confidence 90%
Person: Bounding Box (55,10,50,100), Confidence 93%
Stocking Cap: Bounding Box(5,10,40,40), Confidence 89%
Stocking Cap: Bounding Box(55,10,35,35) Confidence 91%
Our Second Demo
For our second demo, we had a fully functional application deployed on Azure. We had implemented simple evaluations to measure and graph the Confusion Matrix (false positives/negatives, true positives/negatives). We also calculated accuracy, precision, recall, and F1 score. These metrics inform you on how well the version of your system (model, prompt, application logic) is performing over time. We also had a misbehavior gallery where you could view and review false positives/negatives with a subject matter expert to understand the deficits/limitations of your system and discuss potential improvements that could be made.
Additional Work
We created a GitHub Actions CI/CD pipeline to deploy our app using Bicep for IaC and add auth around our app/APIs.
We then proceeded to add validation of our golden dataset after every CI/CD deployment to prevent regression.
Lessons Learned
Training
There were some gaps in these datasets, where the image being evaluated (inference) was overly complicated. The model wasn’t trained on these complex cases and would produce false positives/negatives.
In the first phase of standing up a new AI vision project, it’s advisable to:
Have a controlled environment when creating a dataset.
Covering all the edge cases.
Create a golden dataset of primary cases, and evaluate that the model doesn’t regress over time.
Training your own models can be complex in terms of hardware, config, and auto-resume. It’s ill-advised to train models on your laptop unless you have some specialized heavy-duty hardware.
Evaluating/Inference
For better results, control the environment for the evaluation/inference on images:
Limit the number of people visible.
Don’t have overlapping people.
Don’t have people hidden/occluded by other things
Don’t have people very distant or cut off in the image.
Have your training set cover different orientations. Alternatively, have the people being evaluated have a standard orientation the same as the training set.
Overly complicated or noisy scenery in the background confuses the models.
Standardize your input images to a standard resolution/compression to get consistent results.
Model Size
Edge models are smaller and more appropriate for devices like phones/tablets/Raspberry Pis to perform inference locally; however, it’s harder to get good results.
Larger cloud models produce better results; however, they require making an API call.
Summary
The last two weeks have been super fun, and I’ve learned so many new things. Lean Techniques deserves a special thanks for this unique opportunity to learn and grow; kudos and thank you. I’m excited and looking forward to the next demo and learning opportunity to create awesome products.

