
Deep dive into Microsoft TRELLIS 2's technical architecture. Learn how O-Voxel, SC-VAE, and flow-matching DiT work together to generate 3D models from images in 3 seconds.
Last updated: May 3, 2026
TRELLIS 2 generates production-ready 3D models from a single image in about 3 seconds. But what happens between uploading a photo and downloading a GLB file? This article breaks down the architecture behind Microsoft Research's 3D generation model — from the O-Voxel representation to the three-stage diffusion pipeline — without requiring a machine learning background.
Before understanding TRELLIS 2, it helps to understand why 3D generation is harder than 2D image generation.
A 2D image is a grid of pixels — simple, uniform, and easy for neural networks to process. But 3D objects are fundamentally different. A 3D model has:
Traditional 3D representations force trade-offs. Neural Radiance Fields (NeRF) produce photorealistic results but are slow to render. Signed Distance Fields (SDF) handle geometry well but struggle with open surfaces. Polygon meshes are fast but hard for AI to generate directly.
TRELLIS 2 introduces a new representation called O-Voxel that solves these problems simultaneously.
O-Voxel (Omni-Voxel) is a field-free sparse voxel representation. Let's break that down.
3D technology specialists focused on AI-powered 3D model generation, format conversion, and browser-based 3D rendering. We test and review 3D tools so you don't have to.
Join the community
Subscribe to our newsletter for the latest news and updates
A 3D object occupies a tiny fraction of the space around it. Instead of storing information for every point in a 3D grid, O-Voxel only stores data for occupied voxels — the ones that contain part of the object. This makes the representation compact and efficient.
Previous methods like NeRF and SDF use implicit neural fields — you query a neural network at any 3D point and it returns a value. This is flexible but slow because every query requires a forward pass through a network.
O-Voxel stores geometry data directly on each voxel, eliminating the need for an implicit field decoder. Each occupied voxel contains:
| Feature | What It Stores |
|---|---|
| v (vertex position) | Precise vertex coordinates inside the voxel |
| delta (edge crossing) | Binary flags indicating which edges the surface crosses |
| gamma (split weight) | Handles multi-surface intersections within one voxel |
This data is sufficient to reconstruct a mesh using a technique called Dual Contouring — and it's fast. Converting O-Voxel to mesh takes under 100 milliseconds with CUDA acceleration.
Most 3D representations only handle closed, manifold surfaces (like a sphere). O-Voxel's Flexible Dual Grid formulation handles:
This is a significant advantage over TRELLIS v1's SLAT representation, which had limited topology support.
Generating 3D at high resolution requires a lot of data. A 1024³ voxel grid contains over a billion potential points. TRELLIS 2 uses a Sparse Compression VAE (SC-VAE) to compress this down to approximately 9,600 latent tokens — a 16x spatial compression ratio.
SC-VAE is a sparse convolutional U-Net with approximately 800 million parameters (354M encoder + 474M decoder). It includes three key innovations:
1. Sparse Residual Autoencoding
Non-parametric skip connections that preserve fine geometric details even at high compression ratios. Unlike learned skip connections, these don't introduce extra parameters or risk overfitting.
2. Early-Pruning Upsampler
During decoding (upsampling), the model predicts which child voxels are actually occupied. It skips computation on empty voxels entirely — this dramatically reduces memory and compute during generation.
3. Optimized Residual Block
Uses ConvNeXt-style design — a single convolution followed by a wide pointwise MLP. More efficient than traditional residual blocks while maintaining expressiveness.
TRELLIS 2 trains two independent SC-VAEs:
| SC-VAE | Encodes/Decodes | Conditioned On |
|---|---|---|
| Shape SC-VAE | Geometry (vertex positions, edge crossings) | Input image |
| Material SC-VAE | PBR materials (color, metalness, roughness) | Input image + shape latents |
The material SC-VAE is conditioned on the shape output, ensuring textures align correctly with geometry.
At the heart of TRELLIS 2 are three Diffusion Transformers (DiT) — one for each generation stage. Each has approximately 1.3 billion parameters (4 billion total across all three).
| Parameter | Value |
|---|---|
| Hidden width | 1,536 |
| Transformer blocks | 30 |
| Attention heads | 12 |
| MLP width | 8,192 |
RoPE (Rotary Position Embedding) — Enables the model to generalize across resolutions. Trained at 512³ and 1024³, but can extrapolate to 1536³ at inference time.
QK-Norm — Applies RMSNorm normalization to Query and Key vectors in attention, preventing attention score explosion and improving training stability.
AdaLN-Single — Adaptive Layer Normalization injects conditioning signals (image features, timestep) through a shared MLP, reducing parameter count compared to per-layer modulation.
TRELLIS 2 uses Rectified Flow — a flow-matching formulation that maps noise to data along straight paths. Compared to traditional DDPM diffusion:
Here's what happens when you upload an image to TRELLIS 2:
Input: An image (or text prompt) Output: Binary voxel occupancy grid
The first DiT model predicts which voxels in 3D space are occupied. Think of this as "sketching the outline" — the model determines the rough shape and extent of the object without any fine detail.
Input: Occupancy grid + image features Output: Geometric latent representations
The second DiT generates geometry latents for each occupied voxel. The Shape SC-VAE decodes these into O-Voxel features (vertex positions, edge crossings, split weights), which are then converted to a mesh via Dual Contouring.
Input: Shape latents + image features Output: Material latent representations
The third DiT generates material latents conditioned on the shape output. The Material SC-VAE decodes these into PBR material properties for each voxel:
| Channel | Property | Range |
|---|---|---|
| c | Base Color (RGB) | 0–1 per channel |
| m | Metallic | 0–1 |
| r | Roughness | 0–1 |
| alpha | Opacity | 0–1 |
Materials are generated directly in 3D space, not via UV mapping or multi-view rendering. This means textures are seamless and consistent from every angle.
The input image is processed by a DINOv3-L encoder, which extracts visual features. These features are injected into each DiT model via cross-attention — the same mechanism used in Stable Diffusion for text conditioning.
| Resolution | Total Time | Shape | Material | Hardware |
|---|---|---|---|---|
| 512³ | ~3 seconds | ~2s | ~1s | NVIDIA H100 |
| 1024³ | ~17 seconds | ~10s | ~7s | NVIDIA H100 |
| 1536³ | ~60 seconds | ~35s | ~25s | NVIDIA H100 |
The model uses progressive training — starting at 512³, then scaling to 1024³. The 1536³ mode uses test-time scaling through cascaded inference, applying the model iteratively to refine the output.
| Aspect | TRELLIS v1 (SLAT) | TRELLIS 2 (O-Voxel) |
|---|---|---|
| Representation | Structured latent (SLAT) | Field-free sparse voxel |
| Compression | 4x spatial downsampling | 16x spatial downsampling |
| Mesh decoding | Via neural field query | Direct Dual Contouring (< 100ms) |
| Topology | Closed surfaces only | Open, non-manifold, internal |
| Textures | Multi-view / UV-based | Native 3D material inference |
| Material type | Basic color | Full PBR (color, metallic, roughness, opacity) |
| Parameters | ~2 billion | 4 billion |
The move from SLAT to O-Voxel is the single biggest architectural change. By eliminating implicit fields and storing geometry directly, TRELLIS 2 achieves faster decoding, better topology handling, and higher visual quality.
TRELLIS 2's material generation is notable because it operates entirely in 3D, not through traditional UV mapping or multi-view synthesis.
Stage 3 can run independently. Given an existing 3D shape, you can generate new textures for it. This enables:
The same O-Voxel representation can be decoded into multiple output formats:
| Format | Extension | How It's Derived |
|---|---|---|
| Mesh | .glb | Dual Contouring from O-Voxel features |
| 3D Gaussian Splatting | .ply | Gaussians positioned at voxel centers |
| NeRF | .npz | Neural field fitted to O-Voxel data |
GLB export supports configurable mesh decimation (target polygon count), texture resolution up to 4096×4096, and optional WebP texture compression.
TRELLIS 2 includes a custom sparse convolution backend called FlexGEMM, implemented in Triton:
This custom backend is part of why TRELLIS 2 achieves 3-second generation at 512³ resolution — the sparse operations that would bottleneck a standard implementation are heavily optimized.
| Specification | Value |
|---|---|
| Total parameters | ~4 billion |
| DiT models | 3 × 1.3B |
| SC-VAE parameters | ~800M (shape + material) |
| Max resolution | 1536³ voxels |
| Compression ratio | 16x (1024³ → ~9,600 tokens) |
| Image encoder | DINOv3-L |
| Flow matching | Rectified Flow, logitNormal(1,1) |
| License | MIT |
| Training data | 500,000+ diverse 3D objects |
| Conference | CVPR 2025 |
Understanding the architecture is one thing — seeing it in action is another. Our platform runs TRELLIS 2 on optimized hardware so you can generate 3D models without setting up anything locally.
| Approach | Local Install | Our Platform |
|---|---|---|
| Setup time | 2-4 hours | 0 minutes |
| GPU required | NVIDIA 16GB+ VRAM | None |
| Technical knowledge | Python, CUDA | None |
| Max resolution | Limited by your GPU | Up to 1536³ |
Related articles: