Logo of Trellis2
Trellis 2
FeaturesPricing
BlogFAQ
Logo of Trellis2
Trellis 2

Transform images into stunning 3D models with AI-powered technology

Product

  • Features
  • Pricing
  • Blog
  • FAQ

Company

  • About
  • Contact
  • Trify3D

Legal

  • Privacy Policy
  • Terms of Service

© 2026 Trellis 2. All rights reserved.

How Does TRELLIS 2 Work: Architecture & Technology Explained (2026)
2026/05/03
9 min read

How Does TRELLIS 2 Work: Architecture & Technology Explained (2026)

Deep dive into Microsoft TRELLIS 2's technical architecture. Learn how O-Voxel, SC-VAE, and flow-matching DiT work together to generate 3D models from images in 3 seconds.

Last updated: May 3, 2026

TRELLIS 2 generates production-ready 3D models from a single image in about 3 seconds. But what happens between uploading a photo and downloading a GLB file? This article breaks down the architecture behind Microsoft Research's 3D generation model — from the O-Voxel representation to the three-stage diffusion pipeline — without requiring a machine learning background.

The Core Problem: How Do You Represent 3D in an AI Model?

Before understanding TRELLIS 2, it helps to understand why 3D generation is harder than 2D image generation.

A 2D image is a grid of pixels — simple, uniform, and easy for neural networks to process. But 3D objects are fundamentally different. A 3D model has:

  • Irregular geometry: surfaces that curve, fold, and intersect
  • Hollow interiors: the inside of a cup, the back of a mask
  • Open surfaces: a cape, a leaf, a piece of fabric
  • Materials: color, metalness, roughness — all varying across the surface

Traditional 3D representations force trade-offs. Neural Radiance Fields (NeRF) produce photorealistic results but are slow to render. Signed Distance Fields (SDF) handle geometry well but struggle with open surfaces. Polygon meshes are fast but hard for AI to generate directly.

TRELLIS 2 introduces a new representation called O-Voxel that solves these problems simultaneously.

O-Voxel: The Foundation of TRELLIS 2

O-Voxel (Omni-Voxel) is a field-free sparse voxel representation. Let's break that down.

All Posts

Author

avatar for Trellis2 Team
Trellis2 Team

3D technology specialists focused on AI-powered 3D model generation, format conversion, and browser-based 3D rendering. We test and review 3D tools so you don't have to.

Categories

The Core Problem: How Do You Represent 3D in an AI Model?O-Voxel: The Foundation of TRELLIS 2Sparse: Only Store What MattersField-Free: No Implicit Decoding NeededHandles Complex TopologySC-VAE: 16x Spatial CompressionHow SC-VAE WorksTwo Separate SC-VAEsThe DiT Backbone: Flow-Matching Diffusion TransformerArchitecture SpecsKey TechniquesFlow Matching with Rectified FlowThe Three-Stage Generation PipelineStage 1: Sparse Structure GenerationStage 2: Geometry GenerationStage 3: Material GenerationHow the Image Conditions GenerationPerformance Across ResolutionsO-Voxel vs SLAT: What Changed from v1PBR Texture Generation: Materials in 3D SpaceHow It WorksStandalone Texture GenerationMulti-Format OutputFlexGEMM: Custom Sparse Convolution BackendTechnical Specifications SummaryTry TRELLIS 2 Yourself

More Posts

Newsletter

Join the community

Subscribe to our newsletter for the latest news and updates

Sparse: Only Store What Matters

A 3D object occupies a tiny fraction of the space around it. Instead of storing information for every point in a 3D grid, O-Voxel only stores data for occupied voxels — the ones that contain part of the object. This makes the representation compact and efficient.

Field-Free: No Implicit Decoding Needed

Previous methods like NeRF and SDF use implicit neural fields — you query a neural network at any 3D point and it returns a value. This is flexible but slow because every query requires a forward pass through a network.

O-Voxel stores geometry data directly on each voxel, eliminating the need for an implicit field decoder. Each occupied voxel contains:

FeatureWhat It Stores
v (vertex position)Precise vertex coordinates inside the voxel
delta (edge crossing)Binary flags indicating which edges the surface crosses
gamma (split weight)Handles multi-surface intersections within one voxel

This data is sufficient to reconstruct a mesh using a technique called Dual Contouring — and it's fast. Converting O-Voxel to mesh takes under 100 milliseconds with CUDA acceleration.

Handles Complex Topology

Most 3D representations only handle closed, manifold surfaces (like a sphere). O-Voxel's Flexible Dual Grid formulation handles:

  • Open surfaces — capes, leaves, fabric
  • Non-manifold geometry — where multiple surfaces meet at an edge
  • Internal cavities — hollow objects with enclosed empty space

This is a significant advantage over TRELLIS v1's SLAT representation, which had limited topology support.

SC-VAE: 16x Spatial Compression

Generating 3D at high resolution requires a lot of data. A 1024³ voxel grid contains over a billion potential points. TRELLIS 2 uses a Sparse Compression VAE (SC-VAE) to compress this down to approximately 9,600 latent tokens — a 16x spatial compression ratio.

How SC-VAE Works

SC-VAE is a sparse convolutional U-Net with approximately 800 million parameters (354M encoder + 474M decoder). It includes three key innovations:

1. Sparse Residual Autoencoding

Non-parametric skip connections that preserve fine geometric details even at high compression ratios. Unlike learned skip connections, these don't introduce extra parameters or risk overfitting.

2. Early-Pruning Upsampler

During decoding (upsampling), the model predicts which child voxels are actually occupied. It skips computation on empty voxels entirely — this dramatically reduces memory and compute during generation.

3. Optimized Residual Block

Uses ConvNeXt-style design — a single convolution followed by a wide pointwise MLP. More efficient than traditional residual blocks while maintaining expressiveness.

Two Separate SC-VAEs

TRELLIS 2 trains two independent SC-VAEs:

SC-VAEEncodes/DecodesConditioned On
Shape SC-VAEGeometry (vertex positions, edge crossings)Input image
Material SC-VAEPBR materials (color, metalness, roughness)Input image + shape latents

The material SC-VAE is conditioned on the shape output, ensuring textures align correctly with geometry.

The DiT Backbone: Flow-Matching Diffusion Transformer

At the heart of TRELLIS 2 are three Diffusion Transformers (DiT) — one for each generation stage. Each has approximately 1.3 billion parameters (4 billion total across all three).

Architecture Specs

ParameterValue
Hidden width1,536
Transformer blocks30
Attention heads12
MLP width8,192

Key Techniques

RoPE (Rotary Position Embedding) — Enables the model to generalize across resolutions. Trained at 512³ and 1024³, but can extrapolate to 1536³ at inference time.

QK-Norm — Applies RMSNorm normalization to Query and Key vectors in attention, preventing attention score explosion and improving training stability.

AdaLN-Single — Adaptive Layer Normalization injects conditioning signals (image features, timestep) through a shared MLP, reducing parameter count compared to per-layer modulation.

Flow Matching with Rectified Flow

TRELLIS 2 uses Rectified Flow — a flow-matching formulation that maps noise to data along straight paths. Compared to traditional DDPM diffusion:

  • Fewer sampling steps needed for the same quality (12 steps vs 50+)
  • More efficient training with logitNormal(1,1) time sampling
  • Faster inference — straight paths mean the model takes fewer steps to reach high-quality output

The Three-Stage Generation Pipeline

Here's what happens when you upload an image to TRELLIS 2:

Stage 1: Sparse Structure Generation

Input: An image (or text prompt) Output: Binary voxel occupancy grid

The first DiT model predicts which voxels in 3D space are occupied. Think of this as "sketching the outline" — the model determines the rough shape and extent of the object without any fine detail.

Stage 2: Geometry Generation

Input: Occupancy grid + image features Output: Geometric latent representations

The second DiT generates geometry latents for each occupied voxel. The Shape SC-VAE decodes these into O-Voxel features (vertex positions, edge crossings, split weights), which are then converted to a mesh via Dual Contouring.

Stage 3: Material Generation

Input: Shape latents + image features Output: Material latent representations

The third DiT generates material latents conditioned on the shape output. The Material SC-VAE decodes these into PBR material properties for each voxel:

ChannelPropertyRange
cBase Color (RGB)0–1 per channel
mMetallic0–1
rRoughness0–1
alphaOpacity0–1

Materials are generated directly in 3D space, not via UV mapping or multi-view rendering. This means textures are seamless and consistent from every angle.

How the Image Conditions Generation

The input image is processed by a DINOv3-L encoder, which extracts visual features. These features are injected into each DiT model via cross-attention — the same mechanism used in Stable Diffusion for text conditioning.

Performance Across Resolutions

ResolutionTotal TimeShapeMaterialHardware
512³~3 seconds~2s~1sNVIDIA H100
1024³~17 seconds~10s~7sNVIDIA H100
1536³~60 seconds~35s~25sNVIDIA H100

The model uses progressive training — starting at 512³, then scaling to 1024³. The 1536³ mode uses test-time scaling through cascaded inference, applying the model iteratively to refine the output.

O-Voxel vs SLAT: What Changed from v1

AspectTRELLIS v1 (SLAT)TRELLIS 2 (O-Voxel)
RepresentationStructured latent (SLAT)Field-free sparse voxel
Compression4x spatial downsampling16x spatial downsampling
Mesh decodingVia neural field queryDirect Dual Contouring (< 100ms)
TopologyClosed surfaces onlyOpen, non-manifold, internal
TexturesMulti-view / UV-basedNative 3D material inference
Material typeBasic colorFull PBR (color, metallic, roughness, opacity)
Parameters~2 billion4 billion

The move from SLAT to O-Voxel is the single biggest architectural change. By eliminating implicit fields and storing geometry directly, TRELLIS 2 achieves faster decoding, better topology handling, and higher visual quality.

PBR Texture Generation: Materials in 3D Space

TRELLIS 2's material generation is notable because it operates entirely in 3D, not through traditional UV mapping or multi-view synthesis.

How It Works

  1. The shape has already been generated (Stages 1-2)
  2. Shape latent representations are concatenated with image features
  3. The material DiT generates material latents for each occupied voxel
  4. The Material SC-VAE decodes these into PBR properties
  5. A Split-sum renderer (from nvdiffrec) ensures physical accuracy under varying lighting

Standalone Texture Generation

Stage 3 can run independently. Given an existing 3D shape, you can generate new textures for it. This enables:

  • Texture variations: Generate multiple texture sets for the same model
  • Re-texturing: Apply new materials to existing 3D assets
  • Style transfer: Generate textures in different artistic styles from the same geometry

Multi-Format Output

The same O-Voxel representation can be decoded into multiple output formats:

FormatExtensionHow It's Derived
Mesh.glbDual Contouring from O-Voxel features
3D Gaussian Splatting.plyGaussians positioned at voxel centers
NeRF.npzNeural field fitted to O-Voxel data

GLB export supports configurable mesh decimation (target polygon count), texture resolution up to 4096×4096, and optional WebP texture compression.

FlexGEMM: Custom Sparse Convolution Backend

TRELLIS 2 includes a custom sparse convolution backend called FlexGEMM, implemented in Triton:

  • Masked Implicit GEMM strategy for efficient sparse operations
  • Gray code ordering for SIMD optimization
  • Split-K technique for parallel reduction
  • Up to 2x faster than existing sparse convolution libraries

This custom backend is part of why TRELLIS 2 achieves 3-second generation at 512³ resolution — the sparse operations that would bottleneck a standard implementation are heavily optimized.

Technical Specifications Summary

SpecificationValue
Total parameters~4 billion
DiT models3 × 1.3B
SC-VAE parameters~800M (shape + material)
Max resolution1536³ voxels
Compression ratio16x (1024³ → ~9,600 tokens)
Image encoderDINOv3-L
Flow matchingRectified Flow, logitNormal(1,1)
LicenseMIT
Training data500,000+ diverse 3D objects
ConferenceCVPR 2025

Try TRELLIS 2 Yourself

Understanding the architecture is one thing — seeing it in action is another. Our platform runs TRELLIS 2 on optimized hardware so you can generate 3D models without setting up anything locally.

Generate Your First 3D Model — Free

ApproachLocal InstallOur Platform
Setup time2-4 hours0 minutes
GPU requiredNVIDIA 16GB+ VRAMNone
Technical knowledgePython, CUDANone
Max resolutionLimited by your GPUUp to 1536³

Related articles:

  • What is TRELLIS 3D?: overview of the model and its capabilities
  • How to Use TRELLIS 2: step-by-step usage guide
  • How to Install TRELLIS 2: local setup instructions
  • How to Test TRELLIS 2: quality benchmarks and comparisons
Product
What is 3D Art? Complete Guide to Types, Tools & Techniques

What is 3D Art? Complete Guide to Types, Tools & Techniques

Everything about 3D art — types, tools, techniques, and learning paths. Covers digital 3D art, modeling software, AI tools, and how to get started.

avatar for Trellis2 Team
Trellis2 Team
2026/05/29
When Did TRELLIS 2 Come Out? Release Date & Timeline (2026)

When Did TRELLIS 2 Come Out? Release Date & Timeline (2026)

Microsoft TRELLIS 2 was released on December 16, 2025. Complete timeline from paper publication to Hugging Face release, with key milestones and the difference from TRELLIS v1.

avatar for Trellis2 Team
Trellis2 Team
2026/05/04
Image to 3D Model for 3D Printing: Complete Workflow Guide (2026)

Image to 3D Model for 3D Printing: Complete Workflow Guide (2026)

Turn any photo into a 3D-printable model. Covers AI tool selection, mesh repair in Blender, STL export settings, slicer configuration, and troubleshooting common print failures. Works with TRELLIS 2, Meshy AI, and Tripo AI.

avatar for Trellis2 Team
Trellis2 Team
2026/05/03