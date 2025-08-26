Abstract and 1 Introduction

Datasets: We conduct our experiments using the ScanNet200 [38] and Replica [40] datasets. Our analysis on ScanNet200 is based on its validation set, comprising 312 scenes. For the 3D instance segmentation task, we utilize the 200 predefined categories from the ScanNet200 annotations. ScanNet200 labels are categorized into three subsets—head (66 categories), common (68 categories), and tail (66 categories)—based on the frequency of labeled points in the training set. This categorization allows us to evaluate our method’s performance across the long-tail distribution, underscoring ScanNet200 as a suitable evaluation dataset. Additionally, to assess the generalizability of our approach, we conduct experiments on the Replica dataset, which has 48 categories. For the metrics, we follow the evaluation methodology in ScanNet [9] and report the average precision (AP) at two mask overlap thresholds: 50% and 25%, as well as the average across the overlap range of [0.5:0.95:0.05].

\ Implementation details: We use RGB-depth pairs from the ScanNet200 and Replica datasets, processing every 10th frame for ScanNet200 and all frames for Replica, maintaining the same settings as OpenMask3D for fair comparison. To create LG label maps, we use the YOLO-World [7] extra-large model for its real-time capability and high zero-shot performance. We use Mask3D [39] with non-maximum suppression to filter proposals similar to Open3DIS [34], and avoid DBSCAN [11] to prevent inference slowdowns. We use a single NVIDIA A100 40GB GPU for all experiments.

\

5.1 Results analysis

Open-Vocabulary 3D instance segmentation on ScanNet200: We compare our method’s performance against other approaches on the ScanNet200 dataset in Table 1. We indicate whether each

\

\

\ method uses 2D instances to generate 3D proposals and whether SAM is used for labeling the 3D masks. Our method achieves state-of-the-art performance with proposals from only a 3D instance segmentation network compared to methods from two settings (i) 3D mask proposals from only a 3D instance segmentation network (ii) using a combination of 3D mask proposals from a 3D network and 2D instances from a 2D segmentation model. Additionally, our method is ∼ 16× faster compared to state-of-the-art Open3DIS.

\ Generalizability test: To test the generalizability of our method, we use the 3D proposal network pre-trained on the ScanNet200 training set and evaluate on the Replica dataset; the results are shown in Table 2. Our method shows competitive performance with ∼ 11 × speedup against state-of-the-art models [34], which use CLIP features.

\ Performance with given 3D masks: We further test our method for 3D proposal prompting from text against existing methods in the literature in the case of ground truth 3D masks and report the results in Table 5. For Mask3D, oracle masks are assigned class predictions from matched predicted masks using Hungarian matching. For open-vocabulary methods, we use ground-truth masks as input proposals. We show the results in Table 5 that MVPDist can outperform CLIP-based approaches in retrieving the correct 3D proposal masks from text prompts, due to the high zero-shot performance achieved by state-of-the-art open-vocabulary 2D object detectors.

\

\

\

\ Ablation over Replica dataset: To test the ability of object detectors to replace SAM for generating crops for 3D CLIP feature aggregation, we conduct three experiments with different object detectors on the Replica dataset, using ground truth 3D mask proposals to compare our method’s labeling ability against others. The results are in Table 3, rows R2 to R4, with row R1 showing OpenMask3D [42] base code results. We generate class-agnostic bounding boxes with an object detector, then assign the highest IoU bounding box to each 3D instance as a crop, selecting the most visible views. For 3D CLIP feature aggregation, we follow OpenMask3D’s approach, aggregating features from multiple levels and views. These experiments show that YOLO-World can generate crops nearly as good as SAM but with significant speed improvements. Row R5 demonstrates YOLO-World’s better zero-shot performance using our proposed Multi-View Prompt Distribution with LG label maps. Row R6 shows speed improvements using the GPU for 3D mask visibility computation.

\ Top K analysis: We show in Table 4 that naively using YOLO-World with only one label-map with the highest visibility per 3D mask proposal results in sub-optimal results and using top-K label-maps can result in better predictions as the distribution can provide better estimate across multiple frames, since YOLO-World is also expected to make misclassifications in some views while generating correct ones in others. This approach assumes that YOLO-World makes a correct class prediction for the same 3D object in multiple views for it to be effective.

\ High-Granularity (HG) vs. Low-Granularity (LG): Table 4 shows that using SAM to generate HG label maps slightly reduces mAP, and slows down the inference by ∼ 5 times. This is due to the nature of projected 3D instances into 2D, where the projection already holds 2D instance information as shown in Figure 3, and SAM would just result in redundancy in the prediction.

\ Qualitative results: We show qualitative results on the Replica dataset in Figure 4 and compare it to Open3DIS. Open3DIS shows good performance in recalling novel geometries that 3D proposal networks like Mask3D generally fail to capture (small-sized objects). However, it comes at the cost of very low precision due to the redundant masks with different class predictions.

\ Limitations: Our method makes use of a 3D proposal network only for proposal generation in order to reach high speed. Other proposal generation methods [32, 34] fuse 2D instance masks from a 2D instance segmentation methods to generate rich 3D proposals even for very small objects, which are generally overlooked by 3D proposal networks like Mask3D [39] due to low resolution in 3D. Thus, fast 2D instance segmentation models like FastSAM [56] can be used to generate 3D proposals from the 2D images, which might further improve the performance of our method.

\

6 Conclusion

We present Open-YOLO 3D, a novel and efficient open-vocabulary 3D instance segmentation method, which makes use of open-vocabulary 2D object detectors instead of heavy segmentation models. Our approach leverages a 2D object detector for class-labeled bounding boxes and a 3D instance segmentation network for class-agnostic masks. We propose to use MVPDist generated from multiview low granularity label maps to match text prompts to 3D class agnostic masks. Our proposed method outperforms existing techniques, with gains in mAP and inference speed. These results show a new direction toward more efficient open-vocabulary 3D instance segmentation models.

\

\