3DIML is a new framework that uses implicit scene representations to segment 3D instances quickly and accurately. 3DIML uses a two-phase approach—InstanceMap and InstanceLift—to effectively lift 2D instance masks into consistent 3D label fields, in contrast to previous NeRF-based techniques that necessitate time-consuming optimization and intricate losses. Training and inference are greatly accelerated by its modular pipeline, which achieves a speedup of up to 24× while preserving high-quality segmentation. 3DIML offers a scalable, plug-and-play solution for quick 3D scene understanding in robotics and computer vision applications, especially with the addition of the InstaLoc module for near real-time instance localization.3DIML is a new framework that uses implicit scene representations to segment 3D instances quickly and accurately. 3DIML uses a two-phase approach—InstanceMap and InstanceLift—to effectively lift 2D instance masks into consistent 3D label fields, in contrast to previous NeRF-based techniques that necessitate time-consuming optimization and intricate losses. Training and inference are greatly accelerated by its modular pipeline, which achieves a speedup of up to 24× while preserving high-quality segmentation. 3DIML offers a scalable, plug-and-play solution for quick 3D scene understanding in robotics and computer vision applications, especially with the addition of the InstaLoc module for near real-time instance localization.

Solving 3D Segmentation’s Biggest Bottleneck

2025/10/24 23:33

:::info Authors:

(1) George Tang, Massachusetts Institute of Technology;

(2) Krishna Murthy Jatavallabhula, Massachusetts Institute of Technology;

(3) Antonio Torralba, Massachusetts Institute of Technology.

:::

Abstract and I. Introduction

II. Background

III. Method

IV. Experiments

V. Conclusion and References

\ Fig. 1: Our approach, 3DIML, learns an implicit representation of a scene as a composition of object instances. It does so by lifting 2D view-inconsistent instance labels from off-the-shelf 2D segmentation models (such as the Segment Anything) into 3D view-consistent instance labels. The images above show results for the in-the-wild scan postdoc office generated using 3DIML, composed of InstanceMap (left) and InstanceLift. InstanceLoc (right) is then used to refine the results. Each identified 3D label is shown in a different color. Notice how thin and partially occluded objects are accurately delineated across the sequence.

\ Abstract— We tackle the problem of learning an implicit scene representation for 3D instance segmentation from a sequence of posed RGB images. Towards this, we introduce 3DIML, a novel framework that efficiently learns a label field that may be rendered from novel viewpoints to produce view-consistent instance segmentation masks. 3DIML significantly improves upon training and inference runtimes of existing implicit scene representation based methods. Opposed to prior art that optimizes a neural field in a self-supervised manner, requiring complicated training procedures and loss function design, 3DIML leverages a two-phase process. The first phase, InstanceMap, takes as input 2D segmentation masks of the image sequence generated by a frontend instance segmentation model, and associates corresponding masks across images to 3D labels. These almost view-consistent pseudolabel masks are then used in the second phase, InstanceLift, to supervise the training of a neural label field, which interpolates regions missed by InstanceMap and resolves ambiguities. Additionally, we introduce InstanceLoc, which enables near realtime localization of instance masks given a trained label field and an offthe-shelf image segmentation model by fusing outputs from both. We evaluate 3DIML on sequences from the Replica and ScanNet datasets and demonstrate 3DIML’s effectiveness under mild assumptions for the image sequences. We achieve a large practical speedup over existing implicit scene representation methods with comparable quality, showcasing its potential to facilitate faster and more effective 3D scene understanding.

I. INTRODUCTION

Intelligent agents require scene understanding at the object level to effectively carry out context-specific actions such as navigation and manipulation. While segmenting objects from images has seen remarkable progress with scalable models trained on internet-scale datasets [1], [2], extending such capabilites to the 3D setting remains challenging.

\ In this work, we tackle the problem of learning a 3D scene representation from posed 2D images that factorizes the underlying scene into its set of constituent objects. Existing approaches to tackle this problem have focused on training class-agnostic 3D segmentation models [3], [4], requiring large amounts of annotated 3D data, and operating directly over explicit 3D scene representations (e.g., pointclouds). An alternate class of approaches [5], [6] has instead proposed to directly lift segmentation masks from off-the-shelf instance segmentation models into implicit 3D representations, such as neural radiance fields (NeRF) [7], enabling them to render 3D-consistent instance masks from novel viewpoints.

\ However, the neural field-based approaches have remained notoriously difficult to optimize, with [5] and [6] taking several hours to optimize for low-to-mid resolution images (e.g., 300 × 640). In particular, Panoptic Lifting [5] scales cubicly with the number of objects in the scene preventing it from being applied to scenes with hundreds of objects, while Contrastively Lifting [6] requires a complicated, multi-stage training procedure, hindering practicality for use in robotics applications.

\ To this end, we propose 3DIML, an efficient technique to learn 3D-consistent instance segmentation from posed RGB images. 3DIML comprises two phases: InstanceMap and InstanceLift. Given view-inconsistent 2D instance masks extracted from the RGB sequence using a frontend instance segmentation model [2], InstanceMap produces a sequence of view-consistent instance masks. To do so, we first associate masks across frames using keypoint matches between similar pairs of images. We then use these potentially noisy associations to supervise a neural label field, InstanceLift, which exploits 3D structure to interpolate missing labels and resolve ambiguities. Unlike prior work, which requires multistage training and additional loss function engineering, we use a single rendering loss for instance label supervision, enabling the training process to converge significantly faster. The total runtime of 3DIML, including InstanceMap, takes 10-20 minutes, as opposed to 3-6 hours for prior art.

\ In addition, we devise InstaLoc, a fast localization pipeline that takes in a novel view and localizes all instances segmented in that image (using a fast instance segmentation model [8]) by sparsely querying the label field and fusing the label predictions with extracted image regions. Finally, 3DIML is extremely modular, and we can easily swap components of our method for more performant ones as they become available.

\ To summarize, our contributions are:

\ • An efficient neural field learning approach that factorizes a 3D scene into its constituent objects

\ • A fast instance localization algorithm that fuses sparse queries to the trained label field with performant image instance segmentation models to generate 3D-consistent instance segmentation masks

\ • An overall practical runtime improvement of 14-24× over prior art benchmarked on a single GPU (NVIDIA RTX 3090)

II. BACKGROUND

2D segmentation: The prevalence of vision transformer architecture and the increasing scale of image datasets have resulted in a series of state-of-the-art image segmentation models. Panoptic and Contrastive Lifting both lift panoptic segmentation masks produced by Mask2Former [1] to 3D by learning a neural field. Towards open-set segmentation, segment anything (SAM) [2] achieves unprecedented performance by training on a billion masks over 11 million images. HQ-SAM [9] improves upon SAM for fine-grained masks. FastSAM [8] distills SAM into a CNN architecture and achieves similar performance while being orders of magnitude faster. In this work, we use GroundedSAM [10], [11], which refines SAM to produce object-level, as opposed to part-level segmentation masks.

\ Neural fields for 3D instance segmentation: NeRFs are implicit scene representations that can accurately encode complex geometry, semantics, and other modalities, as well as resolve viewpoint inconsistent supervision [12]. Panoptic lifting [5] constructs semantics and instances branches on an efficient variant of NeRF, TensoRF [13], utilizing a Hungarian matching loss function to assign learned instance masks to surrogate object IDs given reference view-inconsistent masks. This scales poorly with increasing number of objects (owing to the cubic complexity of Hungarian matching). Contrastive lifting [6] addresses this by instead employing contrastive learning on scene features, with positive and negative relations determined by whether or not they project onto the same mask. In addition, contrastive lifting requires a slow-fast clustering-based loss for stable training, leading to faster performance than panoptic lifting but requires multiple stages of training, leading to slow convergence. Concurrently to us, Instance-NeRF [14] directly learn a label field, but they based their mask association on utilizing NeRF-RPN [15] to detect objects in a NeRF. Our approach, on the contrary, allows scaling to very high image resolutions while requiring only a small number (40-60) of neural field queries to render segmentation masks.

\ Structure from Motion: During mask association in InstanceMap, we take inspiration from scalable 3D reconstruction pipelines such as hLoc [16], including the use of visual descriptors for matching image viewpoints first, then applying keypoint matching as a preliminary for mask association. We utilize LoFTR [17] for keypoint extraction and matching.

\

:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

What can save you, my crypto world?

What can save you, my crypto world?

Author: Nancy, PANews “I wasted eight years of my life in the crypto industry.” Aevo co-founder Ken Chan published an article denouncing the crypto industry as having degenerated into a "super casino," a post that quickly went viral in online communities both domestically and internationally. Behind the millions of views, the community debate exploded. Supporters saw it as a wake-up call, bursting the bubble, while opponents viewed it as a betrayal by those who had already benefited. Putting aside the emotional outbursts, this debate reflects the collective anxiety and cyclical confusion within the industry currently facing liquidity shortages and a narrative vacuum. Turned into a super casino? What's wrong with the crypto ecosystem? In this lengthy article, Ken Chan candidly admits that the past eight years have been a journey from idealism to disillusionment. As a libertarian and programmer deeply influenced by the works of Ayn Rand, he was a staunch believer in the cypherpunk spirit, viewing Bitcoin as "a private bank for the rich." However, after eight years of full-time dedication to the industry, he painfully admitted that even though he had made money, he still felt that those eight years of his youth had been completely wasted. The narrative most often uttered by industry practitioners is "completely replacing the existing financial system with blockchain," but this is merely a propaganda slogan; they are simply maintaining the world's largest online casino, operating 24/7. This misperception stems from a drastically distorted industry incentive mechanism. In reality, no one cares about genuine technological iteration. Market participants are blindly pouring funds into the next Layer 1 public chain, attempting to bet on the next Solana. This speculative mentality has fueled an inflated market capitalization of hundreds of billions of dollars. In fact, there are quite a few zombie public blockchains nowadays. Even emerging high-performance blockchains that have raised tens or even hundreds of millions of dollars are not immune to the airdrop craze and incentive subsidy activities, leaving very few real users. This is like building countless highways in a desert, but there are no cities or factories along the way, only a group of speculators reselling land. The data also confirms this predicament. According to DeFiLlama, in the past 24 hours, only 15 chains had on-chain DEX transaction volumes exceeding 10 million, and only 4 chains met the requirement of having millions of daily active addresses. On this "ghost town" of over-saturated infrastructure, Ken argues that spot DEXs, perpetual contracts, prediction markets, and the Meme coin platform are essentially gambling tools. For example, the former Meme culture has been replaced by an industrialized "coin issuance pipeline," becoming an on-chain casino of extreme PvP; and the frequent interactions across many applications are not driven by genuine needs, but rather by the pursuit of points for airdrops. As Ken points out, while VCs can write 5,000-word essays outlining grand visions, the reality is that these games are constantly consuming the existing funds of retail and institutional investors. What makes Ken Chan even more uncomfortable is the industry's subversion of common business sense. Here, making money through token issuance, market making, and profit-taking is far easier than refining a product. The market is flooded with tokens that have "high FDV and low liquidity," projects with no real revenue yet boasting valuations of billions of dollars, and so-called governance tokens that are nothing more than liquidity tools for investors to exit. This environment where bad money drives out good not only deprives practitioners of the ability to identify sustainable businesses but also instills a highly toxic "financial nihilism" in the younger generation. With traditional assets becoming increasingly unaffordable, Generation Z is exhibiting its own form of "financial rebellion." According to a recent Financial Times article, the deteriorating housing affordability in the United States is profoundly changing Generation Z's financial and consumption behaviors, even driving some young people to speculate in cryptocurrencies and generating feelings of economic nihilism. Besides cryptocurrencies, trendy stocks, collectible toys, leveraged ETFs, and prediction markets are all financial trends among young people. Ken Chan's accusations resonated with many. For example, Tangent founder Jason Choi lamented that we already have countless low-cost/fast blockchains, lax regulatory systems, massive overfunding since 2017, and thousands of developers delivering smart contracts over the past decade. Yet, an AI company is about to IPO at a price exceeding the total market capitalization of all cryptocurrencies except Bitcoin and stablecoins. Inversion Capital founder Santiago Roel Santos points out that this is a sobering reminder of reality for the entire industry. Today, the crypto industry has only about 40 million monthly active users (MAU), while Facebook had 845 million MAU at its IPO and a market capitalization of approximately $100 billion; OpenAI currently has about 800 million MAU and its most recent valuation was $500 billion. To have a $10 trillion asset class, we need at least a billion users. Crypto KOL YQ cited an older article stating that many crypto OGs have chosen to leave the market after questioning their initial beliefs. In the current cycle, highly speculative projects like memes, perpetual tokens, and prediction markets remain resilient, while the value of many infrastructure and social projects is increasingly difficult to prove. This is undoubtedly the most difficult phase for startups, VCs, traders, and users, and the market is rife with "pump and dump" schemes using leveraged perpetual tokens to manipulate small-cap or older coins. In this environment, it's crucial to acknowledge the facts and accept reality. Whether you're a VC or an entrepreneur, the only way to survive is to continuously adjust your direction and consistently deliver products. Navigating the cycles of crypto sentiment, "the forest needs to be cleared of dead trees." Many industry professionals believe that Ken Chan's negative emotions are essentially a typical "retreat the ladder after getting ashore" mentality. As a beneficiary of the existing system, he made his fortune in the crypto market, yet he turned around and criticized this ladder to wealth as dirty. At the same time, his aversion to financial nihilism ignored the fact that for countless ordinary people around the world, this bubble-filled market remains one of the few channels for upward social mobility. Moreover, AEVO's price has already fallen by more than 98% from its all-time high. Regarding the current predicament of the crypto market, Ken believes the industry is merely spinning its wheels, but many proponents see it as a necessary growing pain in technological development. We cannot negate the entire financial city that is rising from the ground just because we see people losing money in a casino. If we turn our attention to high-inflation countries like Argentina, Turkey, and Nigeria, we find that stablecoins such as USDT and USDC have become de facto "hard currency." Local people rely on them to protect their meager savings from hyperinflation, and this financial system has effectively served tens of millions of people. Meanwhile, Bitcoin is no longer just a geek's toy; it's becoming part of the balance sheets of sovereign wealth funds, national government reserves (such as in El Salvador and Bhutan), and top hedge funds. Ethereum's technical components have been established as a global public blockchain standard and have gained recognition from Wall Street capital. Furthermore, with assets such as stocks, bonds, and real estate rapidly being put on-chain, financial efficiency is experiencing a substantial leap. On the technological front, countless developers are making breakthroughs in cutting-edge fields such as zero-knowledge proofs (ZK), censorship-resistant networks, and quantum resistance. These are the real undercurrents behind the noisy crypto market. Regarding the "casino analogy," Haseeb, a partner at Dragonlfy, points out that the cryptocurrency space has never lacked casinos. The first blockbuster application on Bitcoin was Satoshi Dice (2012). The first blockbuster smart contract on Ethereum was King of the Ether Throne (2015), which was essentially a Ponzi scheme. Once programmable money exists, people's first instinct is always to bet and play games—this is human nature. The crypto world has always had its hottest casinos: ICO casinos, DeFi, NFTs, and now MEME coins. The forms change, but the essence remains the same. While casinos are glamorous and attract attention on social media, focusing solely on their superficiality will cause you to miss the more important stories. He further points out that cryptocurrencies are becoming a superior financial vehicle, reshaping the nature of money and subtly altering the power relationship between individuals and governments. Bitcoin has begun to challenge national sovereignty, with governments incorporating it into their balance sheets; stablecoins are influencing monetary policy, prompting central banks to scramble to respond; and the scale and value of permissionless financial protocols like Uniswap and AAVE have surpassed many unicorn fintech companies. The world is undergoing a profound shift around cryptocurrencies. “This transformation is slower than many anticipated, but that’s how technology diffusion always is,” Haseeb stated. Three years after ChatGPT’s launch, generative AI still hasn’t been reflected in GDP or employment data; the Industrial Revolution took 50 years to truly impact productivity; and the widespread adoption of the internet took over 20 years. Expecting it to replace the world’s most regulated financial system within a mere five years is unrealistic. If you’re frustrated because you didn’t become rich from participating in a MEME project, take a deep breath; the industry doesn’t owe anyone wealth. In fact, pessimism and a sense of “mental surrender” on the timeline aren’t necessarily bad things. Pantera Capital partner Mason Nystrom also believes that a pessimistic view of cryptocurrencies and their social value is wrong. While speculation and abuse exist in the cryptocurrency space, and its casinos are real and large-scale, with many people losing money at the tables, it also contains a great deal of overlooked positive social value. He explained that Bitcoin has become a global, non-sovereign asset that anyone in the world with an internet connection can hold. It provides a veto/exit mechanism for people worldwide, transferring economic control from nations to individuals. Stablecoins offer more efficient and secure financial services to people around the world, with faster disbursement, higher returns, and lower costs. The lack of returns from banks for depositors, high fees for cross-border remittances, and the 2.9% transaction fee for e-commerce are all being reshaped by stablecoins, bringing tangible social value. Lending platforms like Aave and Morpho enable people worldwide to access over-collateralized loans. The low-collateral lending market will further unleash enormous social benefits, reduce capital costs, and create significant positive externalities. Furthermore, blockchain will enable global users to access previously restricted financial products such as stocks, bonds, insurance, and credit. Permissionless financing allows any good idea to gain support based on its own value. A more transparent, efficient, and low-cost market is itself an improvement for society. Mason Nystrom also stated that cryptocurrencies are building a completely new financial system. Some will build casinos, some will build payment networks, some will build speculative instruments, and others will build inclusive credit infrastructure. This new financial system will not be perfect, but it will far surpass the current state. If we only see the casino aspect of cryptocurrencies, perhaps we should take a step back and look at all the benefits that cryptocurrencies have brought to and will continue to bring to society from a more macro perspective. The crypto industry is currently experiencing a low point, and Ken's post is less a reflection and more an emotional outpouring after a failed startup. Projects like Aevo are not uncommon in their difficulties; this is precisely the survival of the fittest the industry is undergoing. In the past few years, the sector has seen an oversupply of projects lacking real value and unable to deliver viable products. The current pain is simply squeezing out the bubble that has accumulated. Just as forests need to be regularly cleared of dead trees to prevent decay from spreading, the same applies to the crypto industry. Let those who are weary, lost, or only here for speculation leave naturally, and the air will become clear. Either change your mindset and refocus on the future, or make way for those still building. This journey has just begun and is far from over.
Share
PANews2025/12/08 18:28