Object detection has become the backbone of a lot of important products like safety systems that know where hands are near dangerous machinery, retail analytics counting products and people, autonomous vehicles, warehouse robots, ergonomic assessment tools, and more. Traditionally, those systems all shared one big assumption: you must decide up front which objects matter, hard-code that label set, and then spend a lot of time, money, and human resources in annotating data for those classes.
\ Vision-language models (VLMs) and open-vocabulary object detectors (OVD) eliminate this assumption completely. Instead of baking labels into the weights, you pass them in as prompts: “red mug”, “overhead luggage bin”, “safety helmet”, “tablet on the desk.” And surprisingly, the best of these models now match or even beat strong closed-set detectors without ever seeing that dataset’s labels.
\ In my day job, I work on real-time, on-device computer vision for ergonomics and workplace safety, think iPads or iPhones checking posture, reach, and PPE in warehouses and aircraft cabins. For a long time, every new “Can we detect X?” request meant another round of data collection, labeling, and retraining. When we started experimenting with open-vocabulary detectors, the workflow flipped: we could prompt for new concepts, see if the signals looked promising in real video, and only then decide whether it was worth investing in a dedicated closed-set model.
\ This article walks through:
Object detection tries to answer two questions for each image (or video frame):
Unlike plain image classification (one label per image), detection says “two people, one laptop, one chair, one cup” with coordinates. That’s what makes it useful for:
In traditional pipelines, the object catalog (your label set) is fixed, for example, 80 COCO classes, or 1,203 LVIS classes. Adding “blue cardboard box”, “broken pallet”, or a specific SKU later is where things start to hurt.
Before deep learning, detectors used hand-crafted features like HOG (Histograms of Oriented Gradients) and part-based models. You’d slide a window over the image, compute features, and run a classifier.
Two representative classical systems on PASCAL VOC 2007:
VOC 2007 has 5,011 train+val images and 4,952 test images (9,963 total).
Then came CNNs:
The 07+12 setup uses VOC 2007 trainval (5,011 images) + VOC 2012 trainval, giving about 16.5k training images.
So on the same dataset, going from hand-crafted to CNNs roughly doubled performance:
| Dataset | Model | # training images (VOC) | mAP @ 0.5 | |----|----|----|----| | VOC 2007 test | DPM voc-release5 (no context) | 5,011 (VOC07 trainval) | 33.7% | | VOC 2007 test | Regionlets | 5,011 (VOC07 trainval) | 41.7% | | VOC 2007 test | Fast R-CNN (VGG16, 07+12) | ≈16.5k (VOC07+12) | 70.0% | | VOC 2007 test | Faster R-CNN (VGG16, 07+12) | ≈16.5k (VOC07+12) | 73.2% |
That’s the story we’ve been telling for a decade: deep learning crushed classical detection.
But all of these are closed-set: you pick a fixed label list, and the model can’t recognize anything outside it.
Closed-set detectors (Faster R-CNN, YOLO, etc.) are great if:
In practice, especially in enterprise settings:
Technically, closed-set detectors are optimized for one label space:
This is where open-vocabulary detectors and vision-language models become interesting.
Open-vocabulary detectors combine two ideas:
Instead of learning a classifier over a fixed set of one-hot labels, the detector learns to align region features and text embeddings in a shared space. At inference time, you can pass any string: “steel toe boot”, “forklift”, “wrench”, “coffee stain”, and the model scores regions against those text prompts.
Examples:
These models are usually pre-trained on millions of image–text pairs from the web, then sometimes fine-tuned on detection datasets with large vocabularies.
In the side-by-side image below, the open-vocabulary Grounding DINO model is prompted with fine-grained phrases like “armrests,” “mesh backrest,” “seat cushion,” and “chair,” and it correctly identifies each region, not just the overall object. This works because Grounding DINO connects image regions with text prompts during inference, enabling it to recognize categories that weren’t in its original training list. In contrast, the closed-set Fast R-CNN model is trained on a fixed set of categories (such as those in the PASCAL VOC or COCO label space), so it can only detect the broader “chair” class and misses the finer parts. This highlights the real-world advantage of promptable detectors: they can adapt to exactly what you ask for without retraining, while still maintaining practical performance. It also shows why open-vocabulary models are so promising for dynamic environments where new items, parts, or hazards appear regularly.
Promptable vs. closed-set detection on the same scene. Grounding DINO (left) identifies armrests, mesh backrest, seat cushion, and the overall chair; Fast RCNN (right) detects only the chair. Photo: © 2025 Balaji Sundareshan: original photo by the author.
Let’s look at COCO 2017, the standard 80-class detection benchmark. COCO train2017 has about 118k training images and 5k val images.
A strong closed-set baseline:
Now compare that to Grounding DINO:
| Dataset | Model | # training images from COCO | AP@[0.5:0.95] | |----|----|----|----| | COCO 2017 test-dev | EfficientDet-D7 (closed-set) | 118k (train2017) | 52.2 AP | | COCO det. (zero-shot) | Grounding DINO (open-vocab, zero-shot) | 0 (no COCO data) | 52.5 AP | | COCO det. (supervised) | Grounding DINO (fine-tuned) | 118k (train2017) | 63.0 AP |
You can fairly say:
An open-vocabulary detector, trained on other data, matches a COCO-specific SOTA detector on COCO, and then beats it once you fine-tune.
That’s a strong argument for reusability: with OVDs, you get decent performance on new domains without painstaking dataset-specific labeling.
In our own experiments on office ergonomics product, we’ve seen a similar pattern: a promptable detector gets us to a usable baseline quickly, and a small fine-tuned model does the heavy lifting in production.
COCO has 80 classes. LVIS v1.0 is more realistic for enterprise: ~100k train images, ~20k val, and 1,203 categories with a long-tailed distribution.
The Copy-Paste paper benchmarks strong instance/detection models on LVIS v1.0. With EfficientNet-B7 NAS-FPN and a two-stage training scheme, they report:
Another line of work, Detic hits 41.7 mAP on the standard LVIS benchmark across all classes, using LVIS annotations plus additional image-level labels.
Two representative OVDs:
These models use no LVIS training images, they rely on large-scale pre-training with grounding annotations and text labels, then are evaluated on LVIS as a new domain.
| Dataset / split | Model | # training images from LVIS | AP (box) | |----|----|----|----| | LVIS v1.0 (val) | Eff-B7 NAS-FPN + Copy-Paste (closed-set) | 100k (LVIS train) | 41.6 AP | | LVIS v1.0 (all classes) | Detic (open-vocab-friendly, LVIS-trained) | 100k (LVIS train) | 41.7 mAP | | LVIS v1.0 (zero-shot) | YOLO-World (open-vocab, zero-shot) | 0 (no LVIS data) | 35.4 AP | | LVIS-minival (zero-shot) | Grounding DINO 1.5 Edge (open-vocab, edge-optimized) | 0 (no LVIS data) | 36.2 AP |
Takeaway that you can safely emphasize:
On LVIS, the best open-vocabulary detectors reach ~35–36 AP in pure zero-shot mode, not far behind strong closed-set models in the low-40s AP that use 100k fully annotated training images.
That’s a powerful trade-off story for enterprises: ~10k+ human hours of annotation vs zero LVIS labels for a ~5–6 AP gap.
In one of our internal pilots, we used an open-vocab model to sweep through a few hundred hours of warehouse video with prompts like “forklift”, “ladder”, and “cardboard boxes on the floor.” The raw detections were noisy, but they gave our annotators a huge head start: instead of hunting for rare events manually, they were editing candidate boxes. That distilled into a compact closed-set model we could actually ship on edge hardware, and it only existed because the open-vocab model gave us a cheap way to explore the long tail.
Open-vocabulary detectors aren’t magic. They introduce new problems:
So while OVDs are amazing for exploration, prototyping, querying, and rare-class detection, you might not always want to ship them directly to every edge device.
A pattern that makes sense for many enterprises:
This gives you the best of both worlds:
If you zoom out over the last 15 years:
The story for readers is simple:
If you’re building enterprise systems, it’s a good time to start treating prompts as the new label files and vision-language detectors as your first stop for exploration, before you commit to yet another closed-set training cycle.

