Open‑YOLO 3D uses 2D object detection instead of heavy SAM/CLIP for open‑vocabulary 3D segmentation, achieving SOTA results with up to 16× faster inference.Open‑YOLO 3D uses 2D object detection instead of heavy SAM/CLIP for open‑vocabulary 3D segmentation, achieving SOTA results with up to 16× faster inference.

No SAM, No CLIP, No Problem: How Open‑YOLO 3D Segments Faster

:::info Authors:

(1) Mohamed El Amine Boudjoghra, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) (mohamed.boudjoghra@mbzuai.ac.ae);

(2) Angela Dai, Technical University of Munich (TUM) (angela.dai@tum.de);

(3) Jean Lahoud, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) ( jean.lahoud@mbzuai.ac.ae);

(4) Hisham Cholakkal, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) (hisham.cholakkal@mbzuai.ac.ae);

(5) Rao Muhammad Anwer, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) and Aalto University (rao.anwer@mbzuai.ac.ae);

(6) Salman Khan, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) and Australian National University (salman.khan@mbzuai.ac.ae);

(7) Fahad Shahbaz Khan, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) and Australian National University (fahad.khan@mbzuai.ac.ae).

:::

Abstract and 1 Introduction

  1. Related works
  2. Preliminaries
  3. Method: Open-YOLO 3D
  4. Experiments
  5. Conclusion and References

A. Appendix

Abstract

Recent works on open-vocabulary 3D instance segmentation show strong promise, but at the cost of slow inference speed and high computation requirements. This high computation cost is typically due to their heavy reliance on 3D clip features, which require computationally expensive 2D foundation models like Segment Anything (SAM) and CLIP for multi-view aggregation into 3D. As a consequence, this hampers their applicability in many real-world applications that require both fast and accurate predictions. To this end, we propose a fast yet accurate open-vocabulary 3D instance segmentation approach, named Open-YOLO 3D, that effectively leverages only 2D object detection from multi-view RGB images for open-vocabulary 3D instance segmentation. We address this task by generating class-agnostic 3D masks for objects in the scene and associating them with text prompts. We observe that the projection of class-agnostic 3D point cloud instances already holds instance information; thus, using SAM might only result in redundancy that unnecessarily increases the inference time. We empirically find that a better performance of matching text prompts to 3D masks can be achieved in a faster fashion with a 2D object detector. We validate our Open-YOLO 3D on two benchmarks, ScanNet200 and Replica, under two scenarios: (i) with ground truth masks, where labels are required for given object proposals, and (ii) with class-agnostic 3D proposals generated from a 3D proposal network. Our OpenYOLO 3D achieves state-of-the-art performance on both datasets while obtaining up to ∼16× speedup compared to the best existing method in the literature. On ScanNet200 val. set, our Open-YOLO 3D achieves mean average precision (mAP) of 24.7% while operating at 22 seconds per scene. Code and model are available at github.com/aminebdj/OpenYOLO3D

\

1 Introduction

3D instance segmentation is a computer vision task that involves the prediction of masks for individual objects in a 3D point cloud scene. It holds significant importance in fields like robotics and augmented reality. Due to its diverse applications, this task has garnered increasing attention in recent years. Researchers have long focused on methods that typically operate within a closed-set framework, limiting their ability to recognize objects not present in the training data. This constraint poses challenges, particularly when novel objects must be identified or categorized in unfamiliar environments. Recent methods [34, 42] address the problem of novel class segmentation, but they suffer from slow inference that ranges from 5 minutes for small scenes to 10 minutes for large scenes

\ Figure 1: Open-vocabulary 3D instance segmentation with our Open-YOLO 3D. The proposed Open-YOLO 3D is capable of segmenting objects in a zero-shot manner. Here, We show the output for a ScanNet200 [38] scene with various prompts, where our model yields improved performance compared to the recent Open3DIS [34]. We show zoomed-in images of hidden predicted instances in the colored boxes. Additional results are in Figure 4 and suppl. material.

\ due to their reliance on computationally heavy foundation models like SAM [23] and CLIP [55] along with heavy computation for lifting 2D CLIP feature to 3D.

\ Open-vocabulary 3D instance segmentation is important for robotics tasks such as, material handling where the robot is expected to perform operations from text-based instructions like moving specific products, loading and unloading goods, and inventory management while being fast in the decision-making process. Although state-of-the-art open-vocabulary 3D instance segmentation methods show high promise in terms of generalizability to novel objects, they still operate in minutes of inference time due to their reliance on heavy foundation models such as SAM. Motivated by recent advances in 2D object detection [7], we look into an alternative approach that leverages fast object detectors instead of utilizing computationally expensive foundation models.

\ This paper proposes a novel open-vocabulary 3D instance segmentation method, named Open-YOLO 3D, that utilizes efficient, joint 2D-3D reasoning, using 2D bounding box predictions to replace computationally-heavy segmentation models. We employ an open-vocabulary 2D object detector to generate bounding boxes with their class labels for all frames corresponding to the 3D scene; on the other side, we utilize a 3D instance segmentation network to generate 3D class-agnostic instance masks for the point clouds, which proves to be much faster than 3D proposal generation methods from 2D instances [34, 32]. Unlike recent methods [42, 34] which use SAM and CLIP to lift 2D clip features to 3D for prompting the 3D mask proposal, we propose an alternative approach that relies on the bounding box predictions from 2D object detectors which prove to be significantly faster than CLIP-based methods. We utilize the predicted bounding boxes in all RGB frames corresponding to the point cloud scene to construct a Low Granularity (LG) label map for every frame. One LG label map is a two-dimensional array with the same height and width as the RGB frame, with the bounding box areas replaced by their predicted class label. Next, we use intrinsic and extrinsic parameters to project the point cloud scene onto their respective LG label maps with top-k visibility for final class prediction. We present an example output of our method in Figure 1. Our contributions are following:

\ • We introduce a 2D object detection-based approach for open-vocabulary labeling of 3D instances, which greatly improves the efficiency compared to 2D segmentation approaches.

\ • We propose a novel approach to scoring 3D mask proposals using only bounding boxes from 2D object detectors.

\ • Our Open-YOLO 3D achieves superior performance on two benchmarks, while being considerably faster than existing methods in the literature. On ScanNet200 val. set, our Open-YOLO 3D achieves an absolute gain of 2.3% at mAP50 while being ∼16x faster compared to the recent Open3DIS [34].

\

:::info This paper is available on arxiv under CC BY-NC-SA 4.0 Deed (Attribution-Noncommercial-Sharelike 4.0 International) license.

:::

\

Market Opportunity
YOLO Logo
YOLO Price(YOLO)
$0.000000006819
$0.000000006819$0.000000006819
-0.52%
USD
YOLO (YOLO) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Bitcoin (BTC) Rebounds Today: “This Level Must Be Broken for Major October Rally,” Says Analysis Firm

Bitcoin (BTC) Rebounds Today: “This Level Must Be Broken for Major October Rally,” Says Analysis Firm

The post Bitcoin (BTC) Rebounds Today: “This Level Must Be Broken for Major October Rally,” Says Analysis Firm appeared on BitcoinEthereumNews.com. QCP Capital announced that cryptocurrency markets are showing signs of recovery after last week’s selling pressure, paving the way for an “October rally.” The company’s report noted that Bitcoin (BTC) rose to $112,000 and Ethereum (ETH) to $4,100. Spot prices remained stable over the weekend, despite significant ETF outflows last Friday, suggesting that selling pressure was absorbed more strongly than expected. QCP Capital argued that quarter-end liquidations were the main driver of these outflows and that this week’s ETF flows will determine the direction of institutional demand. The report revealed that despite a challenging month, Bitcoin closed September with a gain of more than 3%. Analysts noted that the market is preparing for the seasonal rally known as “Uptober,” and that it is critical for BTC to surpass the $115,000 level to confirm the uptrend. Cautious optimism is prevailing in the options market. According to QCP Capital, investor confidence is slowly returning, bearish sentiment is diminishing, and open interest in both Bitcoin and Ethereum is beginning to stabilize. This suggests that a potential October rally is starting to be factored in among investors, according to the analyst firm. *This is not investment advice. Follow our Telegram and Twitter account now for exclusive news, analytics and on-chain data! Source: https://en.bitcoinsistemi.com/bitcoin-btc-rebounds-today-this-level-must-be-broken-for-major-october-rally-says-analysis-firm/
Share
BitcoinEthereumNews2025/09/29 22:35
WIF Price Prediction: Targeting $0.48 Recovery Within 2 Weeks as MACD Shows Bullish Divergence

WIF Price Prediction: Targeting $0.48 Recovery Within 2 Weeks as MACD Shows Bullish Divergence

The post WIF Price Prediction: Targeting $0.48 Recovery Within 2 Weeks as MACD Shows Bullish Divergence appeared on BitcoinEthereumNews.com. James Ding Dec 16
Share
BitcoinEthereumNews2025/12/17 17:32
OpenVPP accused of falsely advertising cooperation with the US government; SEC commissioner clarifies no involvement

OpenVPP accused of falsely advertising cooperation with the US government; SEC commissioner clarifies no involvement

PANews reported on September 17th that on-chain sleuth ZachXBT tweeted that OpenVPP ( $OVPP ) announced this week that it was collaborating with the US government to advance energy tokenization. SEC Commissioner Hester Peirce subsequently responded, stating that the company does not collaborate with or endorse any private crypto projects. The OpenVPP team subsequently hid the response. Several crypto influencers have participated in promoting the project, and the accounts involved have been questioned as typical influencer accounts.
Share
PANews2025/09/17 23:58