Open‑YOLO 3D uses 2D object detection instead of heavy SAM/CLIP for open‑vocabulary 3D segmentation, achieving SOTA results with up to 16× faster inference.Open‑YOLO 3D uses 2D object detection instead of heavy SAM/CLIP for open‑vocabulary 3D segmentation, achieving SOTA results with up to 16× faster inference.

No SAM, No CLIP, No Problem: How Open‑YOLO 3D Segments Faster

2025/08/26 16:10

:::info Authors:

(1) Mohamed El Amine Boudjoghra, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) (mohamed.boudjoghra@mbzuai.ac.ae);

(2) Angela Dai, Technical University of Munich (TUM) (angela.dai@tum.de);

(3) Jean Lahoud, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) ( jean.lahoud@mbzuai.ac.ae);

(4) Hisham Cholakkal, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) (hisham.cholakkal@mbzuai.ac.ae);

(5) Rao Muhammad Anwer, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) and Aalto University (rao.anwer@mbzuai.ac.ae);

(6) Salman Khan, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) and Australian National University (salman.khan@mbzuai.ac.ae);

(7) Fahad Shahbaz Khan, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) and Australian National University (fahad.khan@mbzuai.ac.ae).

:::

Abstract and 1 Introduction

  1. Related works
  2. Preliminaries
  3. Method: Open-YOLO 3D
  4. Experiments
  5. Conclusion and References

A. Appendix

Abstract

Recent works on open-vocabulary 3D instance segmentation show strong promise, but at the cost of slow inference speed and high computation requirements. This high computation cost is typically due to their heavy reliance on 3D clip features, which require computationally expensive 2D foundation models like Segment Anything (SAM) and CLIP for multi-view aggregation into 3D. As a consequence, this hampers their applicability in many real-world applications that require both fast and accurate predictions. To this end, we propose a fast yet accurate open-vocabulary 3D instance segmentation approach, named Open-YOLO 3D, that effectively leverages only 2D object detection from multi-view RGB images for open-vocabulary 3D instance segmentation. We address this task by generating class-agnostic 3D masks for objects in the scene and associating them with text prompts. We observe that the projection of class-agnostic 3D point cloud instances already holds instance information; thus, using SAM might only result in redundancy that unnecessarily increases the inference time. We empirically find that a better performance of matching text prompts to 3D masks can be achieved in a faster fashion with a 2D object detector. We validate our Open-YOLO 3D on two benchmarks, ScanNet200 and Replica, under two scenarios: (i) with ground truth masks, where labels are required for given object proposals, and (ii) with class-agnostic 3D proposals generated from a 3D proposal network. Our OpenYOLO 3D achieves state-of-the-art performance on both datasets while obtaining up to ∼16× speedup compared to the best existing method in the literature. On ScanNet200 val. set, our Open-YOLO 3D achieves mean average precision (mAP) of 24.7% while operating at 22 seconds per scene. Code and model are available at github.com/aminebdj/OpenYOLO3D

\

1 Introduction

3D instance segmentation is a computer vision task that involves the prediction of masks for individual objects in a 3D point cloud scene. It holds significant importance in fields like robotics and augmented reality. Due to its diverse applications, this task has garnered increasing attention in recent years. Researchers have long focused on methods that typically operate within a closed-set framework, limiting their ability to recognize objects not present in the training data. This constraint poses challenges, particularly when novel objects must be identified or categorized in unfamiliar environments. Recent methods [34, 42] address the problem of novel class segmentation, but they suffer from slow inference that ranges from 5 minutes for small scenes to 10 minutes for large scenes

\ Figure 1: Open-vocabulary 3D instance segmentation with our Open-YOLO 3D. The proposed Open-YOLO 3D is capable of segmenting objects in a zero-shot manner. Here, We show the output for a ScanNet200 [38] scene with various prompts, where our model yields improved performance compared to the recent Open3DIS [34]. We show zoomed-in images of hidden predicted instances in the colored boxes. Additional results are in Figure 4 and suppl. material.

\ due to their reliance on computationally heavy foundation models like SAM [23] and CLIP [55] along with heavy computation for lifting 2D CLIP feature to 3D.

\ Open-vocabulary 3D instance segmentation is important for robotics tasks such as, material handling where the robot is expected to perform operations from text-based instructions like moving specific products, loading and unloading goods, and inventory management while being fast in the decision-making process. Although state-of-the-art open-vocabulary 3D instance segmentation methods show high promise in terms of generalizability to novel objects, they still operate in minutes of inference time due to their reliance on heavy foundation models such as SAM. Motivated by recent advances in 2D object detection [7], we look into an alternative approach that leverages fast object detectors instead of utilizing computationally expensive foundation models.

\ This paper proposes a novel open-vocabulary 3D instance segmentation method, named Open-YOLO 3D, that utilizes efficient, joint 2D-3D reasoning, using 2D bounding box predictions to replace computationally-heavy segmentation models. We employ an open-vocabulary 2D object detector to generate bounding boxes with their class labels for all frames corresponding to the 3D scene; on the other side, we utilize a 3D instance segmentation network to generate 3D class-agnostic instance masks for the point clouds, which proves to be much faster than 3D proposal generation methods from 2D instances [34, 32]. Unlike recent methods [42, 34] which use SAM and CLIP to lift 2D clip features to 3D for prompting the 3D mask proposal, we propose an alternative approach that relies on the bounding box predictions from 2D object detectors which prove to be significantly faster than CLIP-based methods. We utilize the predicted bounding boxes in all RGB frames corresponding to the point cloud scene to construct a Low Granularity (LG) label map for every frame. One LG label map is a two-dimensional array with the same height and width as the RGB frame, with the bounding box areas replaced by their predicted class label. Next, we use intrinsic and extrinsic parameters to project the point cloud scene onto their respective LG label maps with top-k visibility for final class prediction. We present an example output of our method in Figure 1. Our contributions are following:

\ • We introduce a 2D object detection-based approach for open-vocabulary labeling of 3D instances, which greatly improves the efficiency compared to 2D segmentation approaches.

\ • We propose a novel approach to scoring 3D mask proposals using only bounding boxes from 2D object detectors.

\ • Our Open-YOLO 3D achieves superior performance on two benchmarks, while being considerably faster than existing methods in the literature. On ScanNet200 val. set, our Open-YOLO 3D achieves an absolute gain of 2.3% at mAP50 while being ∼16x faster compared to the recent Open3DIS [34].

\

:::info This paper is available on arxiv under CC BY-NC-SA 4.0 Deed (Attribution-Noncommercial-Sharelike 4.0 International) license.

:::

\

Market Opportunity
YOLO Logo
YOLO Price(YOLO)
$0.000000006827
$0.000000006827$0.000000006827
-0.35%
USD
YOLO (YOLO) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

The Channel Factories We’ve Been Waiting For

The Channel Factories We’ve Been Waiting For

The post The Channel Factories We’ve Been Waiting For appeared on BitcoinEthereumNews.com. Visions of future technology are often prescient about the broad strokes while flubbing the details. The tablets in “2001: A Space Odyssey” do indeed look like iPads, but you never see the astronauts paying for subscriptions or wasting hours on Candy Crush.  Channel factories are one vision that arose early in the history of the Lightning Network to address some challenges that Lightning has faced from the beginning. Despite having grown to become Bitcoin’s most successful layer-2 scaling solution, with instant and low-fee payments, Lightning’s scale is limited by its reliance on payment channels. Although Lightning shifts most transactions off-chain, each payment channel still requires an on-chain transaction to open and (usually) another to close. As adoption grows, pressure on the blockchain grows with it. The need for a more scalable approach to managing channels is clear. Channel factories were supposed to meet this need, but where are they? In 2025, subnetworks are emerging that revive the impetus of channel factories with some new details that vastly increase their potential. They are natively interoperable with Lightning and achieve greater scale by allowing a group of participants to open a shared multisig UTXO and create multiple bilateral channels, which reduces the number of on-chain transactions and improves capital efficiency. Achieving greater scale by reducing complexity, Ark and Spark perform the same function as traditional channel factories with new designs and additional capabilities based on shared UTXOs.  Channel Factories 101 Channel factories have been around since the inception of Lightning. A factory is a multiparty contract where multiple users (not just two, as in a Dryja-Poon channel) cooperatively lock funds in a single multisig UTXO. They can open, close and update channels off-chain without updating the blockchain for each operation. Only when participants leave or the factory dissolves is an on-chain transaction…
Share
BitcoinEthereumNews2025/09/18 00:09
XRP Price Prediction: Can Ripple Rally Past $2 Before the End of 2025?

XRP Price Prediction: Can Ripple Rally Past $2 Before the End of 2025?

The post XRP Price Prediction: Can Ripple Rally Past $2 Before the End of 2025? appeared first on Coinpedia Fintech News The XRP price has come under enormous pressure
Share
CoinPedia2025/12/16 19:22
BlackRock boosts AI and US equity exposure in $185 billion models

BlackRock boosts AI and US equity exposure in $185 billion models

The post BlackRock boosts AI and US equity exposure in $185 billion models appeared on BitcoinEthereumNews.com. BlackRock is steering $185 billion worth of model portfolios deeper into US stocks and artificial intelligence. The decision came this week as the asset manager adjusted its entire model suite, increasing its equity allocation and dumping exposure to international developed markets. The firm now sits 2% overweight on stocks, after money moved between several of its biggest exchange-traded funds. This wasn’t a slow shuffle. Billions flowed across multiple ETFs on Tuesday as BlackRock executed the realignment. The iShares S&P 100 ETF (OEF) alone brought in $3.4 billion, the largest single-day haul in its history. The iShares Core S&P 500 ETF (IVV) collected $2.3 billion, while the iShares US Equity Factor Rotation Active ETF (DYNF) added nearly $2 billion. The rebalancing triggered swift inflows and outflows that realigned investor exposure on the back of performance data and macroeconomic outlooks. BlackRock raises equities on strong US earnings The model updates come as BlackRock backs the rally in American stocks, fueled by strong earnings and optimism around rate cuts. In an investment letter obtained by Bloomberg, the firm said US companies have delivered 11% earnings growth since the third quarter of 2024. Meanwhile, earnings across other developed markets barely touched 2%. That gap helped push the decision to drop international holdings in favor of American ones. Michael Gates, lead portfolio manager for BlackRock’s Target Allocation ETF model portfolio suite, said the US market is the only one showing consistency in sales growth, profit delivery, and revisions in analyst forecasts. “The US equity market continues to stand alone in terms of earnings delivery, sales growth and sustainable trends in analyst estimates and revisions,” Michael wrote. He added that non-US developed markets lagged far behind, especially when it came to sales. This week’s changes reflect that position. The move was made ahead of the Federal…
Share
BitcoinEthereumNews2025/09/18 01:44