Open‑YOLO 3D uses 2D object detection instead of heavy SAM/CLIP for open‑vocabulary 3D segmentation, achieving SOTA results with up to 16× faster inference.Open‑YOLO 3D uses 2D object detection instead of heavy SAM/CLIP for open‑vocabulary 3D segmentation, achieving SOTA results with up to 16× faster inference.

No SAM, No CLIP, No Problem: How Open‑YOLO 3D Segments Faster

:::info Authors:

(1) Mohamed El Amine Boudjoghra, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) (mohamed.boudjoghra@mbzuai.ac.ae);

(2) Angela Dai, Technical University of Munich (TUM) (angela.dai@tum.de);

(3) Jean Lahoud, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) ( jean.lahoud@mbzuai.ac.ae);

(4) Hisham Cholakkal, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) (hisham.cholakkal@mbzuai.ac.ae);

(5) Rao Muhammad Anwer, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) and Aalto University (rao.anwer@mbzuai.ac.ae);

(6) Salman Khan, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) and Australian National University (salman.khan@mbzuai.ac.ae);

(7) Fahad Shahbaz Khan, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) and Australian National University (fahad.khan@mbzuai.ac.ae).

:::

Abstract and 1 Introduction

  1. Related works
  2. Preliminaries
  3. Method: Open-YOLO 3D
  4. Experiments
  5. Conclusion and References

A. Appendix

Abstract

Recent works on open-vocabulary 3D instance segmentation show strong promise, but at the cost of slow inference speed and high computation requirements. This high computation cost is typically due to their heavy reliance on 3D clip features, which require computationally expensive 2D foundation models like Segment Anything (SAM) and CLIP for multi-view aggregation into 3D. As a consequence, this hampers their applicability in many real-world applications that require both fast and accurate predictions. To this end, we propose a fast yet accurate open-vocabulary 3D instance segmentation approach, named Open-YOLO 3D, that effectively leverages only 2D object detection from multi-view RGB images for open-vocabulary 3D instance segmentation. We address this task by generating class-agnostic 3D masks for objects in the scene and associating them with text prompts. We observe that the projection of class-agnostic 3D point cloud instances already holds instance information; thus, using SAM might only result in redundancy that unnecessarily increases the inference time. We empirically find that a better performance of matching text prompts to 3D masks can be achieved in a faster fashion with a 2D object detector. We validate our Open-YOLO 3D on two benchmarks, ScanNet200 and Replica, under two scenarios: (i) with ground truth masks, where labels are required for given object proposals, and (ii) with class-agnostic 3D proposals generated from a 3D proposal network. Our OpenYOLO 3D achieves state-of-the-art performance on both datasets while obtaining up to ∼16× speedup compared to the best existing method in the literature. On ScanNet200 val. set, our Open-YOLO 3D achieves mean average precision (mAP) of 24.7% while operating at 22 seconds per scene. Code and model are available at github.com/aminebdj/OpenYOLO3D

\

1 Introduction

3D instance segmentation is a computer vision task that involves the prediction of masks for individual objects in a 3D point cloud scene. It holds significant importance in fields like robotics and augmented reality. Due to its diverse applications, this task has garnered increasing attention in recent years. Researchers have long focused on methods that typically operate within a closed-set framework, limiting their ability to recognize objects not present in the training data. This constraint poses challenges, particularly when novel objects must be identified or categorized in unfamiliar environments. Recent methods [34, 42] address the problem of novel class segmentation, but they suffer from slow inference that ranges from 5 minutes for small scenes to 10 minutes for large scenes

\ Figure 1: Open-vocabulary 3D instance segmentation with our Open-YOLO 3D. The proposed Open-YOLO 3D is capable of segmenting objects in a zero-shot manner. Here, We show the output for a ScanNet200 [38] scene with various prompts, where our model yields improved performance compared to the recent Open3DIS [34]. We show zoomed-in images of hidden predicted instances in the colored boxes. Additional results are in Figure 4 and suppl. material.

\ due to their reliance on computationally heavy foundation models like SAM [23] and CLIP [55] along with heavy computation for lifting 2D CLIP feature to 3D.

\ Open-vocabulary 3D instance segmentation is important for robotics tasks such as, material handling where the robot is expected to perform operations from text-based instructions like moving specific products, loading and unloading goods, and inventory management while being fast in the decision-making process. Although state-of-the-art open-vocabulary 3D instance segmentation methods show high promise in terms of generalizability to novel objects, they still operate in minutes of inference time due to their reliance on heavy foundation models such as SAM. Motivated by recent advances in 2D object detection [7], we look into an alternative approach that leverages fast object detectors instead of utilizing computationally expensive foundation models.

\ This paper proposes a novel open-vocabulary 3D instance segmentation method, named Open-YOLO 3D, that utilizes efficient, joint 2D-3D reasoning, using 2D bounding box predictions to replace computationally-heavy segmentation models. We employ an open-vocabulary 2D object detector to generate bounding boxes with their class labels for all frames corresponding to the 3D scene; on the other side, we utilize a 3D instance segmentation network to generate 3D class-agnostic instance masks for the point clouds, which proves to be much faster than 3D proposal generation methods from 2D instances [34, 32]. Unlike recent methods [42, 34] which use SAM and CLIP to lift 2D clip features to 3D for prompting the 3D mask proposal, we propose an alternative approach that relies on the bounding box predictions from 2D object detectors which prove to be significantly faster than CLIP-based methods. We utilize the predicted bounding boxes in all RGB frames corresponding to the point cloud scene to construct a Low Granularity (LG) label map for every frame. One LG label map is a two-dimensional array with the same height and width as the RGB frame, with the bounding box areas replaced by their predicted class label. Next, we use intrinsic and extrinsic parameters to project the point cloud scene onto their respective LG label maps with top-k visibility for final class prediction. We present an example output of our method in Figure 1. Our contributions are following:

\ • We introduce a 2D object detection-based approach for open-vocabulary labeling of 3D instances, which greatly improves the efficiency compared to 2D segmentation approaches.

\ • We propose a novel approach to scoring 3D mask proposals using only bounding boxes from 2D object detectors.

\ • Our Open-YOLO 3D achieves superior performance on two benchmarks, while being considerably faster than existing methods in the literature. On ScanNet200 val. set, our Open-YOLO 3D achieves an absolute gain of 2.3% at mAP50 while being ∼16x faster compared to the recent Open3DIS [34].

\

:::info This paper is available on arxiv under CC BY-NC-SA 4.0 Deed (Attribution-Noncommercial-Sharelike 4.0 International) license.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Is Putnam Global Technology A (PGTAX) a strong mutual fund pick right now?

Is Putnam Global Technology A (PGTAX) a strong mutual fund pick right now?

The post Is Putnam Global Technology A (PGTAX) a strong mutual fund pick right now? appeared on BitcoinEthereumNews.com. On the lookout for a Sector – Tech fund? Starting with Putnam Global Technology A (PGTAX – Free Report) should not be a possibility at this time. PGTAX possesses a Zacks Mutual Fund Rank of 4 (Sell), which is based on various forecasting factors like size, cost, and past performance. Objective We note that PGTAX is a Sector – Tech option, and this area is loaded with many options. Found in a wide number of industries such as semiconductors, software, internet, and networking, tech companies are everywhere. Thus, Sector – Tech mutual funds that invest in technology let investors own a stake in a notoriously volatile sector, but with a much more diversified approach. History of fund/manager Putnam Funds is based in Canton, MA, and is the manager of PGTAX. The Putnam Global Technology A made its debut in January of 2009 and PGTAX has managed to accumulate roughly $650.01 million in assets, as of the most recently available information. The fund is currently managed by Di Yao who has been in charge of the fund since December of 2012. Performance Obviously, what investors are looking for in these funds is strong performance relative to their peers. PGTAX has a 5-year annualized total return of 14.46%, and is in the middle third among its category peers. But if you are looking for a shorter time frame, it is also worth looking at its 3-year annualized total return of 27.02%, which places it in the middle third during this time-frame. It is important to note that the product’s returns may not reflect all its expenses. Any fees not reflected would lower the returns. Total returns do not reflect the fund’s [%] sale charge. If sales charges were included, total returns would have been lower. When looking at a fund’s performance, it…
Share
BitcoinEthereumNews2025/09/18 04:05
U.S. Banks Near Stablecoin Issuance Under FDIC Genius Act Plan

U.S. Banks Near Stablecoin Issuance Under FDIC Genius Act Plan

The post U.S. Banks Near Stablecoin Issuance Under FDIC Genius Act Plan appeared on BitcoinEthereumNews.com. U.S. banks could soon begin applying to issue payment
Share
BitcoinEthereumNews2025/12/17 02:55
Turmoil Strikes Theta Labs with New Legal Allegations

Turmoil Strikes Theta Labs with New Legal Allegations

Cryptocurrency often sees its fair share of lawsuits, with many concluding without much ado. However, a fresh legal battle has surfaced involving a well-known altcoin
Share
Coinstats2025/12/17 03:06