How We Fixed IndexTTS2 on Apple Silicon (The torchaudio.save Bug)

February 28, 2026 · Dev Log · 6 min read

We spent hours debugging IndexTTS2 on a Mac Mini M4. Every single output sounded like static — choppy, distorted, unusable. We blamed BigVGAN. We blamed MPS. We tried CPU mode, different voice samples, different diffusion steps. Nothing worked.

The model was producing perfect audio the entire time. The bug was in one line of the save pipeline.

The Symptoms

If you're running IndexTTS2 on Apple Silicon (or honestly, any non-CUDA setup) and getting audio that sounds like:

You probably have this bug.

99.96% Samples clipped
0.0 dB Dynamic range
1 line Root cause

The Root Cause

In indextts/infer_v2.py, the audio save pipeline does this:

# The original code (BROKEN)
wav = torch.clamp(32767 * wav, -32767.0, 32767.0)
torchaudio.save(output_path, wav.type(torch.int16), sampling_rate)

This looks reasonable. BigVGAN outputs float audio in [-1, 1], you scale it to int16 range, cast to int16, and save. Right?

🐛 The Bug

torchaudio.save() with torch.int16 input auto-normalizes the data to fill the entire [-32768, 32767] range. A value of 5000 becomes 32767. A value of -5000 becomes -32768. Every sample gets pushed to maximum amplitude.

Your 0.09 RMS audio becomes 0.99 RMS. Your carefully shaped waveform becomes a clipped wall of noise.

Here's the proof:

# Saving int16 with torchaudio — it rescales!
>>> wav = torch.tensor([[0.0, 5000.0, -5000.0, 20000.0, -20000.0]])
>>> wav_int16 = wav.type(torch.int16)
>>> torchaudio.save('/tmp/test.wav', wav_int16, 22050)
>>> sf.read('/tmp/test.wav', dtype='int16')
array([0, 32767, -32768, 32767, -32768])

Every value that isn't zero gets slammed to ±32767. This is torchaudio "helpfully" normalizing your audio.

The Fix

✅ The Fix (2 seconds)

Save as float32 in [-1, 1] range with explicit bit depth, instead of casting to int16:

# The fix — save as normalized float with explicit bit depth
torchaudio.save(
    output_path,
    (wav / 32767.0).clamp(-1.0, 1.0).float(),
    sampling_rate,
    bits_per_sample=16
)

That's it. Same file format, same quality, but torchaudio now knows the data is already properly scaled.

Two More Fixes You'll Need

1. Missing Qwen Emotion Model

IndexTTS2's emotion system uses a Qwen 0.6B model. If the weights aren't downloaded (they're separate from the main checkpoints), the pipeline feeds garbage data to BigVGAN. Add a guard:

# In __init__, wrap QwenEmotion loading
try:
    self.qwen_emo = QwenEmotion(
        os.path.join(self.model_dir, self.cfg.qwen_emo_path)
    )
except Exception as e:
    print(f"QwenEmotion failed: {e}")
    self.qwen_emo = None

Download the emotion model weights from the IndexTTS-2 HuggingFace repo — specifically the qwen0.6bemo4-merge/model.safetensors file (~1.1GB).

2. BigVGAN Output Normalization

Safety net for edge cases where BigVGAN outputs values slightly outside [-1, 1]:

# Before the int16 scaling line
wav_max = wav.abs().max()
if wav_max > 1.0:
    wav = wav / wav_max

Running on Apple Silicon: What Actually Works

We tested on a Mac Mini M4 with 16GB unified memory. Here's what we found:

Our recommendation: Use CPU mode with the default settings (CFG=0.7, 25 diffusion steps). It's slower but produces the same quality as CUDA. For production batches, run overnight.

# Force CPU mode on Apple Silicon
import torch
torch.backends.mps.is_available = lambda: False

from indextts.infer_v2 import IndexTTS2
tts = IndexTTS2(model_dir='./checkpoints', cfg_path='./checkpoints/config.yaml')

tts.infer(
    spk_audio_prompt='voice_sample.wav',
    text='Your text here.',
    output_path='output.wav',
    verbose=True
)

Voice Sample Tips

Your reference audio matters more than any hyperparameter:

The Debugging Lesson

Always verify the saved file, not just the model output tensor.

We spent hours looking at BigVGAN output tensors, diffusion step counts, MPS vs CPU precision, voice sample quality — everything upstream. The model was perfect. The save function was silently destroying the audio on every single run.

If your model outputs look correct in the debugger but the WAV file sounds wrong, check your save pipeline. torchaudio.save() with int16 data is the most common silent killer.

Building AI Audio Content?

exoCreate is the AI script generator built for audio creators — erotic audio, ASMR, hypnosis, podcasts. Write scripts with persona-aware AI, then bring them to life with TTS.

Try exoCreate Free →

FAQ

Why does IndexTTS2 produce static or garbage audio on Mac?

The most common cause is torchaudio.save() rescaling int16 data. When you cast float audio to torch.int16 and save with torchaudio, it auto-normalizes the values to fill the entire [-32768, 32767] range, destroying dynamic range and producing clipped, static-sounding audio. The fix is to save as float32 scaled to [-1, 1] with bits_per_sample=16.

Can IndexTTS2 run on Apple Silicon without a CUDA GPU?

Yes. IndexTTS2 runs in CPU mode on Apple Silicon Macs with identical output quality. Performance is roughly 8-9x real-time factor (80 seconds for 9 seconds of audio on an M4). The s2mel diffusion transformer with CFG requires too much memory for MPS on 16GB Macs, but CPU mode with the original settings (CFG=0.7, 25 steps) works perfectly.

Do I need the Qwen emotion model for IndexTTS2?

The emotion model improves output quality significantly. Without it, the pipeline may feed garbage data to BigVGAN. Download qwen0.6bemo4-merge/model.safetensors (~1.1GB) from the IndexTTS-2 HuggingFace repo and place it in your checkpoints directory.

How long should my voice reference sample be?

At least 10 seconds of clear speech at 22050 Hz. Shorter samples (5 seconds or less) produce noticeably worse voice cloning quality. The sample should have an RMS level around 0.08-0.12 with no background noise or music.