“We manufacture time by making robots useful.”
LAS VEGAS, Jan. 6, 2026 /PRNewswire/ — At Sharpa, our mission is to build robots that are useful in our daily lives. We believe robots should help us handle repetitive, tedious, and sometimes dangerous work, so that we have the choice to do what we enjoy doing – climbing mountains, making artwork, or spending time with family. By returning more of people’s time back to them, we believe we are manufacturing time, for mankind.
While significant progress has been made on locomotion (think dancing, boxing or playing soccer), it has not freed people’s time meaningfully: robots’ hands are clumsy and people need to stay in the loop to compensate for their mistakes. Robotics have not yet manufactured time for mankind. From a productivity perspective, the key focus for robotics should be to deliver manipulation capabilities on par with humans. Yet, despite advances in locomotion, dexterous autonomy remains a distant reality.
The Challenges
Three challenges persist in the current manipulation paradigm.
1. “Tactileless is the new blindness.” —— Conventional manipulation policies are based on data with only motion trajectories, with neither force nor tactile feedback. In contrast, humans rely heavily on tactile sensing and sometimes very little vision feedback to accomplish most manual tasks. Think about fastening a shirt, inserting a key into a lock or grabbing a plastic cup.
2. “90% of the effort is in the last millimeter when interacting with objects.” —— Currently, the computation architecture for fine manipulation usually involves a unified Vision Language Action (VLA) model. It is suitable to initiate the directional motion prior to contact with an object, but it lacks the high-frequency, force- & tactile-based feedback loop that is necessary to make contact successful-whether it involves tightening the grasp, sliding the object, or quickly moving the fingers from one object to the other etc. The “Last Millimeter Challenge” cannot be tackled by a model that does not adjust its control loop to pre- and post-contact situations.
3. “The Data Drought” is a bottleneck to model training —— not only are high-quality data scarce but they are also expensive to acquire. Nothing can beat online text, images and videos on scale, as LLM training has proven in the past couple of years. Therefore, only data that is both widely available and anthropomorphic, in that it mimics the interaction abilities of human hands in terms of dexterity and object handling, is useful. In most cases both conditions are not simultaneously met (e.g. Simulation data is available but not sufficiently anthropomorphic, robot teleoperation data is of high quality but difficult to scale).
We developed CraftNet to solve this problem. It is an end-to-end, hierarchical VTLA (Vision Tactile Language Action) model for fine manipulation —— with native anthropomorphic last-millimeter interaction.
CraftNet
We built an optimal manipulation architecture inspired by the System 1/System 2 theory, and incorporated the “last millimeter” interaction control model described above. It consists of three systems:
System 2 (the Reasoning Brain) – a Vision-Language Model (VLM) with the following functions:
1. It understands human instructions and decomposes complex tasks into sequential steps or sub-tasks.
2. It perceives the environment and plan the actions accordingly.
3. It provides high level reasoning and logical processing, decision-making and long-horizon planning with sufficient generalization capabilities for humanoid robots to operate across diverse scenarios.
System 2 is an open-source model, that serves as the interface between robots and the operators they assist (e.g. restaurant managers, cleaning shift managers, etc.). It is relatively slow (~1Hz).
System 1 (the Motion Brain) – a foundation model responsible for motion planning and coarse action control. It ensures that the robot approaches the object in an optimal way, prior to contact. It is relatively fast at around 10 Hz. Most of the application-related training is performed on System 1 using public or private domain training data.
System 0 (the Interaction Brain) – a super high-frequency model responsible for instantaneous interaction via fine-motor control. It operates at ~100 Hz. System 0 leverages the tactile feedback information in real time, at fast processing speed, to readjust hand and finger positions continuously in the interaction process with an object.
CraftNet is the combination of System 0 and System 1, forming a VTLA (Vision-Tactile-Language-Action) model that translates high-level, multimodal inputs into continuous, fine-grained actions executed by dexterous humanoid robots — at a level of precision and speed previously only achievable by humans.
The Flow
System 2 supplies semantic intent and/or explicit language instructions to System 1 within CraftNet.
System 1 generates coarse motions, including upper-body movements and pre-contact hand poses. The latent physical intent and/or coarse actions are then passed to System 0, along with tactile and proprioceptive signals (e.g. force and torque).
System 0 produces human-level, fine-grained actions in the last millimeters of contact, enabling robust performance in complex manipulation tasks. System 0 back-propagates state vectors to System 1, allowing adjustments to latent physical intent or coarse actions when fine-grained execution fails during manipulation.
System 1 returns the “progress vector” to System 2, continuously updating task status and maintaining a global view of the execution process.
Inference across the three systems occurs at distinct frequencies: approximately 1 Hz for System 2, 10 Hz for System 1, and 100 Hz for System 0. This asynchronous design allows temporal decoupling: supporting deep understanding and long-horizon planning in the Reasoning Brain, fast processing and coarse trajectory generation in the Motion Brain, and instantaneous, high-dynamic fine-motor control in the Interaction Brain.
The “Midas Touch for Data”
CraftNet enhances the utility of different types of training data, by making transforming generally available video data into high-quality interaction data enhanced with tactile sensing capabilities.
For Real-World Data collection, while teleoperating the robot, CraftNet is actively controlling the interaction process while humans are in the loop to accomplish the full task. Subsequently, System 1 and 0 are jointly trained and fine-tuned using high-quality teleoperation data in the post-training phase. The subsequent training is faster and more effective.
For Simulation Data, the CraftNet model, placed in the simulation loop refines simulated trajectories with realistic post-contact behavior and corrects unrealistic force and compliance patterns.
For Public Internet Data, System 2 leverages an open-source VLM pre-trained on large-scale Internet data (mainly text and images). CraftNet enhances the Internet Data with tactile information so that System 1 can be properly pre-trained.
Conclusion
The transition from a robot that can dance to a robot that can do tasks marks the most significant shift in modern robotics. By solving the “Tactile Blindness” that has plagued conventional models, CraftNet moves beyond simple motion trajectories and enters the realm of physical intelligence.
By integrating the Interaction Brain, we can provide robots with the “common sense” of touch. This high-frequency autonomy ensures that robots mimic human movement and make object contact successful.
Every manual task delegated to a CraftNet-powered robot is a deposit into the human “time bank.” When a robot can sense the friction of a delicate glass, feel the tension in a snagged zipper, or navigate the “last millimeter” of a complex assembly without human intervention, it ceases to be a novelty and becomes a productive force.
View original content to download multimedia:https://www.prnewswire.com/news-releases/sharpa-announces-craftnet—a-hierarchical-vtla-model-for-fine-manipulation-302653876.html
SOURCE Sharpa


