How We Fixed IndexTTS2 on Apple Silicon (The torchaudio.save Bug)
We spent hours debugging IndexTTS2 on a Mac Mini M4. Every single output sounded like static — choppy, distorted, unusable. We blamed BigVGAN. We blamed MPS. We tried CPU mode, different voice samples, different diffusion steps. Nothing worked.
The model was producing perfect audio the entire time. The bug was in one line of the save pipeline.
The Symptoms
If you're running IndexTTS2 on Apple Silicon (or honestly, any non-CUDA setup) and getting audio that sounds like:
- Static or white noise
- Heavily distorted/clipped speech
- "Choppy" output that's technically speech but sounds terrible
- Waveforms that look like solid blocks instead of audio
You probably have this bug.
The Root Cause
In indextts/infer_v2.py, the audio save pipeline does this:
# The original code (BROKEN)
wav = torch.clamp(32767 * wav, -32767.0, 32767.0)
torchaudio.save(output_path, wav.type(torch.int16), sampling_rate)
This looks reasonable. BigVGAN outputs float audio in [-1, 1], you scale it to int16 range, cast to int16, and save. Right?
🐛 The Bug
torchaudio.save() with torch.int16 input auto-normalizes the data to fill the entire [-32768, 32767] range. A value of 5000 becomes 32767. A value of -5000 becomes -32768. Every sample gets pushed to maximum amplitude.
Your 0.09 RMS audio becomes 0.99 RMS. Your carefully shaped waveform becomes a clipped wall of noise.
Here's the proof:
# Saving int16 with torchaudio — it rescales!
>>> wav = torch.tensor([[0.0, 5000.0, -5000.0, 20000.0, -20000.0]])
>>> wav_int16 = wav.type(torch.int16)
>>> torchaudio.save('/tmp/test.wav', wav_int16, 22050)
>>> sf.read('/tmp/test.wav', dtype='int16')
array([0, 32767, -32768, 32767, -32768])
Every value that isn't zero gets slammed to ±32767. This is torchaudio "helpfully" normalizing your audio.
The Fix
✅ The Fix (2 seconds)
Save as float32 in [-1, 1] range with explicit bit depth, instead of casting to int16:
# The fix — save as normalized float with explicit bit depth
torchaudio.save(
output_path,
(wav / 32767.0).clamp(-1.0, 1.0).float(),
sampling_rate,
bits_per_sample=16
)
That's it. Same file format, same quality, but torchaudio now knows the data is already properly scaled.
Two More Fixes You'll Need
1. Missing Qwen Emotion Model
IndexTTS2's emotion system uses a Qwen 0.6B model. If the weights aren't downloaded (they're separate from the main checkpoints), the pipeline feeds garbage data to BigVGAN. Add a guard:
# In __init__, wrap QwenEmotion loading
try:
self.qwen_emo = QwenEmotion(
os.path.join(self.model_dir, self.cfg.qwen_emo_path)
)
except Exception as e:
print(f"QwenEmotion failed: {e}")
self.qwen_emo = None
Download the emotion model weights from the IndexTTS-2 HuggingFace repo — specifically the qwen0.6bemo4-merge/model.safetensors file (~1.1GB).
2. BigVGAN Output Normalization
Safety net for edge cases where BigVGAN outputs values slightly outside [-1, 1]:
# Before the int16 scaling line
wav_max = wav.abs().max()
if wav_max > 1.0:
wav = wav / wav_max
Running on Apple Silicon: What Actually Works
We tested on a Mac Mini M4 with 16GB unified memory. Here's what we found:
- CPU mode works. All components (GPT, s2mel diffusion, BigVGAN) run correctly on CPU. Performance: ~80 seconds for 9 seconds of audio (RTF ~8.6x).
- MPS partially works. GPT and BigVGAN run fine on MPS. But the s2mel diffusion transformer with CFG (classifier-free guidance) at 0.7 needs ~14.5GB of MPS memory — too much for 16GB Macs.
- MPS without CFG fits but sounds worse. Setting
inference_cfg_rate=0.0halves memory usage and runs on MPS, but CFG is critical for quality. - Full MPS with
PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0— crashed the machine (swap death).
Our recommendation: Use CPU mode with the default settings (CFG=0.7, 25 diffusion steps). It's slower but produces the same quality as CUDA. For production batches, run overnight.
# Force CPU mode on Apple Silicon
import torch
torch.backends.mps.is_available = lambda: False
from indextts.infer_v2 import IndexTTS2
tts = IndexTTS2(model_dir='./checkpoints', cfg_path='./checkpoints/config.yaml')
tts.infer(
spk_audio_prompt='voice_sample.wav',
text='Your text here.',
output_path='output.wav',
verbose=True
)
Voice Sample Tips
Your reference audio matters more than any hyperparameter:
- Use 10+ seconds of clear speech (not 5s — the model needs more data to clone well)
- RMS around 0.08-0.12 is ideal. Too quiet (< 0.05) gives weak output
- Clean recordings only — no background music, no reverb, no compression artifacts
- 22050 Hz sample rate — match the model's native rate
The Debugging Lesson
Always verify the saved file, not just the model output tensor.
We spent hours looking at BigVGAN output tensors, diffusion step counts, MPS vs CPU precision, voice sample quality — everything upstream. The model was perfect. The save function was silently destroying the audio on every single run.
If your model outputs look correct in the debugger but the WAV file sounds wrong, check your save pipeline. torchaudio.save() with int16 data is the most common silent killer.
Building AI Audio Content?
exoCreate is the AI script generator built for audio creators — erotic audio, ASMR, hypnosis, podcasts. Write scripts with persona-aware AI, then bring them to life with TTS.
Try exoCreate Free →FAQ
Why does IndexTTS2 produce static or garbage audio on Mac?
The most common cause is torchaudio.save() rescaling int16 data. When you cast float audio to torch.int16 and save with torchaudio, it auto-normalizes the values to fill the entire [-32768, 32767] range, destroying dynamic range and producing clipped, static-sounding audio. The fix is to save as float32 scaled to [-1, 1] with bits_per_sample=16.
Can IndexTTS2 run on Apple Silicon without a CUDA GPU?
Yes. IndexTTS2 runs in CPU mode on Apple Silicon Macs with identical output quality. Performance is roughly 8-9x real-time factor (80 seconds for 9 seconds of audio on an M4). The s2mel diffusion transformer with CFG requires too much memory for MPS on 16GB Macs, but CPU mode with the original settings (CFG=0.7, 25 steps) works perfectly.
Do I need the Qwen emotion model for IndexTTS2?
The emotion model improves output quality significantly. Without it, the pipeline may feed garbage data to BigVGAN. Download qwen0.6bemo4-merge/model.safetensors (~1.1GB) from the IndexTTS-2 HuggingFace repo and place it in your checkpoints directory.
How long should my voice reference sample be?
At least 10 seconds of clear speech at 22050 Hz. Shorter samples (5 seconds or less) produce noticeably worse voice cloning quality. The sample should have an RMS level around 0.08-0.12 with no background noise or music.