How to Install IndexTTS2 on Mac (Apple Silicon M1/M2/M3/M4 Guide)

February 28, 2026 ยท Guide ยท 10 min read

IndexTTS2 is one of the best open-source text-to-speech models for voice cloning โ€” but the official setup assumes you have an NVIDIA GPU with CUDA. If you're on a Mac with Apple Silicon, you'll hit a few walls.

This guide walks you through installing IndexTTS2 on Mac (M1, M2, M3, M4), getting it running in CPU mode, and fixing the critical bugs that make the output sound like static.

~30 min Setup time
~15 GB Disk space
16 GB+ RAM needed
~80s / 9s Generation speed

Table of Contents

  1. Prerequisites
  2. Installation
  3. Download Model Checkpoints
  4. Critical Fix: The torchaudio.save Bug
  5. Additional Fixes
  6. Preparing a Voice Sample
  7. Running Inference
  8. MPS vs CPU: What Works
  9. Performance Benchmarks
  10. Troubleshooting
  11. FAQ

1. Prerequisites

Before starting, you'll need:

Install Python 3.11

IndexTTS2 works best with Python 3.11. Python 3.12+ may cause issues with numba and other dependencies.

brew install [email protected] git-lfs
git lfs install

2. Installation

Step 1: Clone the Repository

cd ~/projects  # or wherever you keep projects
git clone https://github.com/IndexTeam/IndexTTS-2.git indexTTS2
cd indexTTS2

Step 2: Create a Virtual Environment

python3.11 -m venv ~/indextts2-venv
source ~/indextts2-venv/bin/activate

Step 3: Install Dependencies

pip install --upgrade pip
pip install -e .

This will install PyTorch (with MPS support), transformers, BigVGAN, and all other dependencies. On Apple Silicon, PyTorch will automatically include MPS backend support.

3. Download Model Checkpoints

IndexTTS2 needs several model files. The main checkpoints should download automatically on first run, but the Qwen emotion model needs manual download.

Step 4: Download Qwen Emotion Model

This is the most commonly missed step. Without the emotion model, the pipeline feeds garbage data to the vocoder.

# Download from HuggingFace
pip install huggingface_hub

python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
    repo_id='IndexTeam/IndexTTS-2',
    filename='qwen0.6bemo4-merge/model.safetensors',
    local_dir='./checkpoints'
)
print('Qwen emotion model downloaded!')
"

Verify the file exists:

ls -lh checkpoints/qwen0.6bemo4-merge/model.safetensors
# Should be ~1.1 GB

4. Critical Fix: The torchaudio.save Bug

๐Ÿ› This Bug Affects ALL Non-CUDA Setups

IndexTTS2's save pipeline uses torchaudio.save(path, wav.type(torch.int16), sr) which silently rescales int16 data to maximum volume. Every sample gets pushed to ยฑ32767, destroying the audio. Your output will sound like static, distorted noise, or heavily clipped speech.

This is not a BigVGAN issue or an MPS issue โ€” the model generates perfect audio, but the save function corrupts it. Read our full analysis.

โœ… The Fix

Open indextts/infer_v2.py and find this line (around line 708):

# FIND this line:
torchaudio.save(output_path, wav.type(torch.int16), sampling_rate)

# REPLACE with:
torchaudio.save(output_path, (wav / 32767.0).clamp(-1.0, 1.0).float(), sampling_rate, bits_per_sample=16)

This saves properly normalized float32 audio instead of broken int16.

5. Additional Fixes

Fix 1: Qwen Emotion Model Guard

If the emotion model fails to load for any reason, the whole pipeline crashes. Add a try/except guard.

In indextts/infer_v2.py, find:

self.qwen_emo = QwenEmotion(os.path.join(self.model_dir, self.cfg.qwen_emo_path))

Replace with:

try:
    self.qwen_emo = QwenEmotion(os.path.join(self.model_dir, self.cfg.qwen_emo_path))
except Exception as e:
    print(f">> QwenEmotion failed: {e}")
    self.qwen_emo = None

Fix 2: BigVGAN Output Normalization

Safety net for rare cases where BigVGAN outputs values slightly outside [-1, 1]. Add before the torch.clamp(32767 * wav line:

# Add BEFORE the torch.clamp line:
wav_max = wav.abs().max()
if wav_max > 1.0:
    wav = wav / wav_max

6. Preparing a Voice Sample

Your voice reference is the most important input. IndexTTS2 clones the voice characteristics from this sample.

Convert any audio to the right format with ffmpeg:

brew install ffmpeg

# Convert and extract a 10-second clip
ffmpeg -i input.mp3 -ss 30 -t 10 -ar 22050 -ac 1 -acodec pcm_s16le voice_sample.wav

Check the sample quality:

python3 -c "
import librosa, numpy as np
y, sr = librosa.load('voice_sample.wav', sr=None)
rms = np.sqrt(np.mean(y**2))
print(f'SR: {sr}, Duration: {len(y)/sr:.1f}s, RMS: {rms:.4f}, Peak: {np.abs(y).max():.4f}')
if rms < 0.05: print('โš ๏ธ  Too quiet โ€” consider amplifying')
if rms > 0.3: print('โš ๏ธ  Very loud โ€” may clip')
if sr != 22050: print('โš ๏ธ  Sample rate should be 22050 Hz')
"

7. Running Inference

Step 5: Generate Speech

source ~/indextts2-venv/bin/activate
cd ~/projects/indexTTS2

python3 -c "
import torch
# Force CPU mode on Apple Silicon (recommended for 16GB Macs)
torch.backends.mps.is_available = lambda: False

from indextts.infer_v2 import IndexTTS2

# Load model
tts = IndexTTS2(model_dir='./checkpoints', cfg_path='./checkpoints/config.yaml')

# Generate speech
tts.infer(
    spk_audio_prompt='voice_sample.wav',
    text='Hello, this is a test of IndexTTS2 running on Apple Silicon.',
    output_path='output.wav',
    verbose=True
)
print('Done! Check output.wav')
"

First run will be slower as PyTorch compiles operations. Subsequent runs are faster.

8. MPS vs CPU: What Works

Apple's Metal Performance Shaders (MPS) can accelerate some components but not all. Here's what we tested on a Mac Mini M4 (16GB):

Component MPS CPU Notes
GPT (token gen) โœ… โœ… Works on both
s2mel (diffusion) โš ๏ธ โœ… MPS OOM with CFG on 16GB
BigVGAN (vocoder) โœ… โœ… Works on both, 4x faster on MPS
Full pipeline โŒ 16GB โœ… CPU recommended for 16GB

โš ๏ธ 16GB Macs: Use CPU Mode

The s2mel diffusion transformer with classifier-free guidance (CFG=0.7) allocates ~14.5GB on MPS. On 16GB Macs, this causes OOM crashes or system-wide freezes. Force CPU mode with torch.backends.mps.is_available = lambda: False.

32GB+ Macs may be able to use full MPS โ€” test with PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to allow unlimited allocation.

9. Performance Benchmarks

Tested on Mac Mini M4 (16GB), CPU mode, default settings (CFG=0.7, 25 diffusion steps):

Metric Value
Audio output ~9 seconds
Total inference time ~80 seconds
RTF (real-time factor) ~8.6x
GPT token generation ~40s
s2mel diffusion ~22s
BigVGAN vocoder ~9s
RAM usage ~4.5 GB

Not real-time, but perfectly usable for batch generation. A 5-minute audio script takes about 45 minutes to generate.

10. Troubleshooting

Output sounds like static / heavily distorted

Apply the torchaudio.save fix. This is the #1 cause of bad audio on non-CUDA setups.

OOM crash or system freeze on Apple Silicon

Force CPU mode: torch.backends.mps.is_available = lambda: False

"QwenEmotion" error on startup

Download the emotion model weights โ€” see Step 4.

Very slow first run

Normal. PyTorch compiles operations on first execution. BigVGAN also downloads its weights (~100MB) from NVIDIA's HuggingFace repo on first use.

Disk full errors with MPS

MPS caches compiled graph files to disk. Ensure at least 5GB free space. Clear the cache: rm -rf ~/Library/Caches/com.apple.metal/

Output voice doesn't match reference

Check your voice sample: 10+ seconds, 22050 Hz, clear speech, no background noise. Shorter or noisy samples produce poor clones.

"numba" or "llvmlite" installation errors

Use Python 3.11 specifically: python3.11 -m venv ~/indextts2-venv

Skip the Setup โ€” Generate Scripts Instantly

exoCreate generates production-ready audio scripts with AI-powered personas. Write the script in minutes, then run it through IndexTTS2 for voice cloning โ€” or use any TTS engine you prefer.

Try exoCreate Free โ†’

FAQ

Can IndexTTS2 run on Mac without an NVIDIA GPU?

Yes. IndexTTS2 runs on Apple Silicon Macs in CPU mode with identical output quality to CUDA. Performance is approximately 8-9x real-time (about 80 seconds to generate 9 seconds of audio on an M4). MPS acceleration works partially but requires more than 16GB of unified memory for full quality settings.

What Python version does IndexTTS2 need on Mac?

Python 3.11 is recommended. Python 3.12+ can cause compatibility issues with numba and llvmlite. Install via Homebrew: brew install [email protected].

Why does IndexTTS2 produce static audio on Mac?

The most common cause is a bug in the save pipeline. torchaudio.save() with int16 data auto-normalizes values to maximum amplitude, destroying dynamic range and producing clipped static. The fix is to save as float32 scaled to [-1, 1] with bits_per_sample=16. See our detailed analysis.

How much disk space does IndexTTS2 need?

About 12-15GB total: main model checkpoints (~8GB), Qwen emotion model (~1.1GB), BigVGAN weights (~100MB, auto-downloaded), plus the Python environment and HuggingFace cache.

Does IndexTTS2 support MPS on Apple Silicon?

Partially. GPT and BigVGAN run on MPS. The s2mel diffusion transformer with classifier-free guidance (CFG=0.7) requires ~14.5GB MPS memory โ€” too much for 16GB Macs. CPU mode is recommended for 16GB machines. 32GB+ Macs may use full MPS with PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0.

How do I clone a voice with IndexTTS2 on Mac?

Provide a 10+ second WAV file of the target voice as the spk_audio_prompt parameter. The sample should be clear speech at 22050 Hz with no background noise. IndexTTS2 clones the voice characteristics and generates new speech in that voice.