How Does TRELLIS 2 Work: Architecture & Technology Explained (2026)

Last updated: May 3, 2026

TRELLIS 2 generates production-ready 3D models from a single image in about 3 seconds. But what happens between uploading a photo and downloading a GLB file? This article breaks down the architecture behind Microsoft Research's 3D generation model — from the O-Voxel representation to the three-stage diffusion pipeline — without requiring a machine learning background.

The Core Problem: How Do You Represent 3D in an AI Model?

Before understanding TRELLIS 2, it helps to understand why 3D generation is harder than 2D image generation.

A 2D image is a grid of pixels — simple, uniform, and easy for neural networks to process. But 3D objects are fundamentally different. A 3D model has:

Irregular geometry: surfaces that curve, fold, and intersect
Hollow interiors: the inside of a cup, the back of a mask
Open surfaces: a cape, a leaf, a piece of fabric
Materials: color, metalness, roughness — all varying across the surface

Traditional 3D representations force trade-offs. Neural Radiance Fields (NeRF) produce photorealistic results but are slow to render. Signed Distance Fields (SDF) handle geometry well but struggle with open surfaces. Polygon meshes are fast but hard for AI to generate directly.

TRELLIS 2 introduces a new representation called O-Voxel that solves these problems simultaneously.

O-Voxel: The Foundation of TRELLIS 2

O-Voxel (Omni-Voxel) is a field-free sparse voxel representation. Let's break that down.

Feature	What It Stores
v (vertex position)	Precise vertex coordinates inside the voxel
delta (edge crossing)	Binary flags indicating which edges the surface crosses
gamma (split weight)	Handles multi-surface intersections within one voxel

SC-VAE	Encodes/Decodes	Conditioned On
Shape SC-VAE	Geometry (vertex positions, edge crossings)	Input image
Material SC-VAE	PBR materials (color, metalness, roughness)	Input image + shape latents

Parameter	Value
Hidden width	1,536
Transformer blocks	30
Attention heads	12
MLP width	8,192

Channel	Property	Range
c	Base Color (RGB)	0–1 per channel
m	Metallic	0–1
r	Roughness	0–1
alpha	Opacity	0–1

Resolution	Total Time	Shape	Material	Hardware
512³	~3 seconds	~2s	~1s	NVIDIA H100
1024³	~17 seconds	~10s	~7s	NVIDIA H100
1536³	~60 seconds	~35s	~25s	NVIDIA H100

How Does TRELLIS 2 Work: Architecture & Technology Explained (2026)

The Core Problem: How Do You Represent 3D in an AI Model?

O-Voxel: The Foundation of TRELLIS 2

Author

Categories

More Posts

Newsletter

Field-Free: No Implicit Decoding Needed

Handles Complex Topology

SC-VAE: 16x Spatial Compression

How SC-VAE Works

Two Separate SC-VAEs

The DiT Backbone: Flow-Matching Diffusion Transformer

Architecture Specs

Key Techniques

Flow Matching with Rectified Flow

The Three-Stage Generation Pipeline

Stage 1: Sparse Structure Generation

Stage 2: Geometry Generation

Stage 3: Material Generation

How the Image Conditions Generation

Performance Across Resolutions

O-Voxel vs SLAT: What Changed from v1

PBR Texture Generation: Materials in 3D Space

How It Works

Standalone Texture Generation

Multi-Format Output

FlexGEMM: Custom Sparse Convolution Backend

Technical Specifications Summary

Try TRELLIS 2 Yourself

What is 3D Art? Complete Guide to Types, Tools & Techniques

When Did TRELLIS 2 Come Out? Release Date & Timeline (2026)

Image to 3D Model for 3D Printing: Complete Workflow Guide (2026)

Aspect	TRELLIS v1 (SLAT)	TRELLIS 2 (O-Voxel)
Representation	Structured latent (SLAT)	Field-free sparse voxel
Compression	4x spatial downsampling	16x spatial downsampling
Mesh decoding	Via neural field query	Direct Dual Contouring (< 100ms)
Topology	Closed surfaces only	Open, non-manifold, internal
Textures	Multi-view / UV-based	Native 3D material inference
Material type	Basic color	Full PBR (color, metallic, roughness, opacity)
Parameters	~2 billion	4 billion

Format	Extension	How It's Derived
Mesh	`.glb`	Dual Contouring from O-Voxel features
3D Gaussian Splatting	`.ply`	Gaussians positioned at voxel centers
NeRF	`.npz`	Neural field fitted to O-Voxel data

Specification	Value
Total parameters	~4 billion
DiT models	3 × 1.3B
SC-VAE parameters	~800M (shape + material)
Max resolution	1536³ voxels
Compression ratio	16x (1024³ → ~9,600 tokens)
Image encoder	DINOv3-L
Flow matching	Rectified Flow, logitNormal(1,1)
License	MIT
Training data	500,000+ diverse 3D objects
Conference	CVPR 2025

Approach	Local Install	Our Platform
Setup time	2-4 hours	0 minutes
GPU required	NVIDIA 16GB+ VRAM	None
Technical knowledge	Python, CUDA	None
Max resolution	Limited by your GPU	Up to 1536³