How to Install IndexTTS2 on Mac (Apple Silicon M1/M2/M3/M4 Guide)
IndexTTS2 is one of the best open-source text-to-speech models for voice cloning โ but the official setup assumes you have an NVIDIA GPU with CUDA. If you're on a Mac with Apple Silicon, you'll hit a few walls.
This guide walks you through installing IndexTTS2 on Mac (M1, M2, M3, M4), getting it running in CPU mode, and fixing the critical bugs that make the output sound like static.
Table of Contents
1. Prerequisites
Before starting, you'll need:
- Mac with Apple Silicon (M1, M2, M3, or M4)
- 16 GB RAM minimum (32 GB recommended for MPS acceleration)
- 15 GB free disk space for model checkpoints + environment
- Homebrew installed (
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)") - Git and Git LFS
Install Python 3.11
IndexTTS2 works best with Python 3.11. Python 3.12+ may cause issues with numba and other dependencies.
brew install [email protected] git-lfs
git lfs install
2. Installation
Step 1: Clone the Repository
cd ~/projects # or wherever you keep projects
git clone https://github.com/IndexTeam/IndexTTS-2.git indexTTS2
cd indexTTS2
Step 2: Create a Virtual Environment
python3.11 -m venv ~/indextts2-venv
source ~/indextts2-venv/bin/activate
Step 3: Install Dependencies
pip install --upgrade pip
pip install -e .
This will install PyTorch (with MPS support), transformers, BigVGAN, and all other dependencies. On Apple Silicon, PyTorch will automatically include MPS backend support.
3. Download Model Checkpoints
IndexTTS2 needs several model files. The main checkpoints should download automatically on first run, but the Qwen emotion model needs manual download.
Step 4: Download Qwen Emotion Model
This is the most commonly missed step. Without the emotion model, the pipeline feeds garbage data to the vocoder.
# Download from HuggingFace
pip install huggingface_hub
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
repo_id='IndexTeam/IndexTTS-2',
filename='qwen0.6bemo4-merge/model.safetensors',
local_dir='./checkpoints'
)
print('Qwen emotion model downloaded!')
"
Verify the file exists:
ls -lh checkpoints/qwen0.6bemo4-merge/model.safetensors
# Should be ~1.1 GB
4. Critical Fix: The torchaudio.save Bug
๐ This Bug Affects ALL Non-CUDA Setups
IndexTTS2's save pipeline uses torchaudio.save(path, wav.type(torch.int16), sr) which silently rescales int16 data to maximum volume. Every sample gets pushed to ยฑ32767, destroying the audio. Your output will sound like static, distorted noise, or heavily clipped speech.
This is not a BigVGAN issue or an MPS issue โ the model generates perfect audio, but the save function corrupts it. Read our full analysis.
โ The Fix
Open indextts/infer_v2.py and find this line (around line 708):
# FIND this line:
torchaudio.save(output_path, wav.type(torch.int16), sampling_rate)
# REPLACE with:
torchaudio.save(output_path, (wav / 32767.0).clamp(-1.0, 1.0).float(), sampling_rate, bits_per_sample=16)
This saves properly normalized float32 audio instead of broken int16.
5. Additional Fixes
Fix 1: Qwen Emotion Model Guard
If the emotion model fails to load for any reason, the whole pipeline crashes. Add a try/except guard.
In indextts/infer_v2.py, find:
self.qwen_emo = QwenEmotion(os.path.join(self.model_dir, self.cfg.qwen_emo_path))
Replace with:
try:
self.qwen_emo = QwenEmotion(os.path.join(self.model_dir, self.cfg.qwen_emo_path))
except Exception as e:
print(f">> QwenEmotion failed: {e}")
self.qwen_emo = None
Fix 2: BigVGAN Output Normalization
Safety net for rare cases where BigVGAN outputs values slightly outside [-1, 1]. Add before the torch.clamp(32767 * wav line:
# Add BEFORE the torch.clamp line:
wav_max = wav.abs().max()
if wav_max > 1.0:
wav = wav / wav_max
6. Preparing a Voice Sample
Your voice reference is the most important input. IndexTTS2 clones the voice characteristics from this sample.
- Duration: 10 seconds minimum (longer is better)
- Format: WAV, 22050 Hz, 16-bit, mono
- Content: Clear speech, no background noise or music
- Volume: RMS around 0.08-0.12 (not too quiet, not clipping)
Convert any audio to the right format with ffmpeg:
brew install ffmpeg
# Convert and extract a 10-second clip
ffmpeg -i input.mp3 -ss 30 -t 10 -ar 22050 -ac 1 -acodec pcm_s16le voice_sample.wav
Check the sample quality:
python3 -c "
import librosa, numpy as np
y, sr = librosa.load('voice_sample.wav', sr=None)
rms = np.sqrt(np.mean(y**2))
print(f'SR: {sr}, Duration: {len(y)/sr:.1f}s, RMS: {rms:.4f}, Peak: {np.abs(y).max():.4f}')
if rms < 0.05: print('โ ๏ธ Too quiet โ consider amplifying')
if rms > 0.3: print('โ ๏ธ Very loud โ may clip')
if sr != 22050: print('โ ๏ธ Sample rate should be 22050 Hz')
"
7. Running Inference
Step 5: Generate Speech
source ~/indextts2-venv/bin/activate
cd ~/projects/indexTTS2
python3 -c "
import torch
# Force CPU mode on Apple Silicon (recommended for 16GB Macs)
torch.backends.mps.is_available = lambda: False
from indextts.infer_v2 import IndexTTS2
# Load model
tts = IndexTTS2(model_dir='./checkpoints', cfg_path='./checkpoints/config.yaml')
# Generate speech
tts.infer(
spk_audio_prompt='voice_sample.wav',
text='Hello, this is a test of IndexTTS2 running on Apple Silicon.',
output_path='output.wav',
verbose=True
)
print('Done! Check output.wav')
"
First run will be slower as PyTorch compiles operations. Subsequent runs are faster.
8. MPS vs CPU: What Works
Apple's Metal Performance Shaders (MPS) can accelerate some components but not all. Here's what we tested on a Mac Mini M4 (16GB):
| Component | MPS | CPU | Notes |
|---|---|---|---|
| GPT (token gen) | โ | โ | Works on both |
| s2mel (diffusion) | โ ๏ธ | โ | MPS OOM with CFG on 16GB |
| BigVGAN (vocoder) | โ | โ | Works on both, 4x faster on MPS |
| Full pipeline | โ 16GB | โ | CPU recommended for 16GB |
โ ๏ธ 16GB Macs: Use CPU Mode
The s2mel diffusion transformer with classifier-free guidance (CFG=0.7) allocates ~14.5GB on MPS. On 16GB Macs, this causes OOM crashes or system-wide freezes. Force CPU mode with torch.backends.mps.is_available = lambda: False.
32GB+ Macs may be able to use full MPS โ test with PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to allow unlimited allocation.
9. Performance Benchmarks
Tested on Mac Mini M4 (16GB), CPU mode, default settings (CFG=0.7, 25 diffusion steps):
| Metric | Value |
|---|---|
| Audio output | ~9 seconds |
| Total inference time | ~80 seconds |
| RTF (real-time factor) | ~8.6x |
| GPT token generation | ~40s |
| s2mel diffusion | ~22s |
| BigVGAN vocoder | ~9s |
| RAM usage | ~4.5 GB |
Not real-time, but perfectly usable for batch generation. A 5-minute audio script takes about 45 minutes to generate.
10. Troubleshooting
Output sounds like static / heavily distorted
Apply the torchaudio.save fix. This is the #1 cause of bad audio on non-CUDA setups.
OOM crash or system freeze on Apple Silicon
Force CPU mode: torch.backends.mps.is_available = lambda: False
"QwenEmotion" error on startup
Download the emotion model weights โ see Step 4.
Very slow first run
Normal. PyTorch compiles operations on first execution. BigVGAN also downloads its weights (~100MB) from NVIDIA's HuggingFace repo on first use.
Disk full errors with MPS
MPS caches compiled graph files to disk. Ensure at least 5GB free space. Clear the cache: rm -rf ~/Library/Caches/com.apple.metal/
Output voice doesn't match reference
Check your voice sample: 10+ seconds, 22050 Hz, clear speech, no background noise. Shorter or noisy samples produce poor clones.
"numba" or "llvmlite" installation errors
Use Python 3.11 specifically: python3.11 -m venv ~/indextts2-venv
Skip the Setup โ Generate Scripts Instantly
exoCreate generates production-ready audio scripts with AI-powered personas. Write the script in minutes, then run it through IndexTTS2 for voice cloning โ or use any TTS engine you prefer.
Try exoCreate Free โFAQ
Can IndexTTS2 run on Mac without an NVIDIA GPU?
Yes. IndexTTS2 runs on Apple Silicon Macs in CPU mode with identical output quality to CUDA. Performance is approximately 8-9x real-time (about 80 seconds to generate 9 seconds of audio on an M4). MPS acceleration works partially but requires more than 16GB of unified memory for full quality settings.
What Python version does IndexTTS2 need on Mac?
Python 3.11 is recommended. Python 3.12+ can cause compatibility issues with numba and llvmlite. Install via Homebrew: brew install [email protected].
Why does IndexTTS2 produce static audio on Mac?
The most common cause is a bug in the save pipeline. torchaudio.save() with int16 data auto-normalizes values to maximum amplitude, destroying dynamic range and producing clipped static. The fix is to save as float32 scaled to [-1, 1] with bits_per_sample=16. See our detailed analysis.
How much disk space does IndexTTS2 need?
About 12-15GB total: main model checkpoints (~8GB), Qwen emotion model (~1.1GB), BigVGAN weights (~100MB, auto-downloaded), plus the Python environment and HuggingFace cache.
Does IndexTTS2 support MPS on Apple Silicon?
Partially. GPT and BigVGAN run on MPS. The s2mel diffusion transformer with classifier-free guidance (CFG=0.7) requires ~14.5GB MPS memory โ too much for 16GB Macs. CPU mode is recommended for 16GB machines. 32GB+ Macs may use full MPS with PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0.
How do I clone a voice with IndexTTS2 on Mac?
Provide a 10+ second WAV file of the target voice as the spk_audio_prompt parameter. The sample should be clear speech at 22050 Hz with no background noise. IndexTTS2 clones the voice characteristics and generates new speech in that voice.