How do I fix the BigVGAN output on IndexTTS2?

BigVGAN output is usually fine — the issue is downstream. Check: (1) Is torchaudio.save() rescaling your int16 data? (2) Is the Qwen emotion model loaded? Missing emotion weights cause garbage pipeline data. (3) Are you normalizing BigVGAN output to [-1,1] before int16 scaling?

How We Fixed IndexTTS2 on Apple Silicon (The torchaudio.save Bug)

Q: Why does IndexTTS2 produce static or garbage audio on Mac?

The most common cause is torchaudio.save() rescaling int16 data. When you cast float audio to torch.int16 and save with torchaudio, it auto-normalizes the values to fill the entire [-32768, 32767] range, destroying the dynamic range and producing clipped, static-sounding audio. The fix is to save as float32 scaled to [-1, 1] instead.

Q: Can IndexTTS2 run on Apple Silicon without a CUDA GPU?

Yes. IndexTTS2 runs on CPU mode on Apple Silicon Macs. The GPT and BigVGAN components work on MPS, but the s2mel diffusion transformer with CFG requires too much memory for 16GB Macs. CPU mode produces identical quality output at roughly 8-9x real-time factor (80 seconds for 9 seconds of audio).

February 28, 2026 · Dev Log · 6 min read

We spent hours debugging IndexTTS2 on a Mac Mini M4. Every single output sounded like static — choppy, distorted, unusable. We blamed BigVGAN. We blamed MPS. We tried CPU mode, different voice samples, different diffusion steps. Nothing worked.

The model was producing perfect audio the entire time. The bug was in one line of the save pipeline.

The Symptoms

If you're running IndexTTS2 on Apple Silicon (or honestly, any non-CUDA setup) and getting audio that sounds like:

Static or white noise
Heavily distorted/clipped speech
"Choppy" output that's technically speech but sounds terrible
Waveforms that look like solid blocks instead of audio

You probably have this bug.

99.96% Samples clipped

0.0 dB Dynamic range

1 line Root cause

The Root Cause

In indextts/infer_v2.py, the audio save pipeline does this:

# The original code (BROKEN)
wav = torch.clamp(32767 * wav, -32767.0, 32767.0)
torchaudio.save(output_path, wav.type(torch.int16), sampling_rate)

This looks reasonable. BigVGAN outputs float audio in [-1, 1], you scale it to int16 range, cast to int16, and save. Right?

🐛 The Bug

torchaudio.save() with torch.int16 input auto-normalizes the data to fill the entire [-32768, 32767] range. A value of 5000 becomes 32767. A value of -5000 becomes -32768. Every sample gets pushed to maximum amplitude.

Your 0.09 RMS audio becomes 0.99 RMS. Your carefully shaped waveform becomes a clipped wall of noise.

Here's the proof:

# Saving int16 with torchaudio — it rescales!
>>> wav = torch.tensor([[0.0, 5000.0, -5000.0, 20000.0, -20000.0]])
>>> wav_int16 = wav.type(torch.int16)
>>> torchaudio.save('/tmp/test.wav', wav_int16, 22050)
>>> sf.read('/tmp/test.wav', dtype='int16')
array([0, 32767, -32768, 32767, -32768])

Every value that isn't zero gets slammed to ±32767. This is torchaudio "helpfully" normalizing your audio.

The Fix

✅ The Fix (2 seconds)

Save as float32 in [-1, 1] range with explicit bit depth, instead of casting to int16:

# The fix — save as normalized float with explicit bit depth
torchaudio.save(
    output_path,
    (wav / 32767.0).clamp(-1.0, 1.0).float(),
    sampling_rate,
    bits_per_sample=16
)

That's it. Same file format, same quality, but torchaudio now knows the data is already properly scaled.

Two More Fixes You'll Need

1. Missing Qwen Emotion Model

IndexTTS2's emotion system uses a Qwen 0.6B model. If the weights aren't downloaded (they're separate from the main checkpoints), the pipeline feeds garbage data to BigVGAN. Add a guard:

# In __init__, wrap QwenEmotion loading
try:
    self.qwen_emo = QwenEmotion(
        os.path.join(self.model_dir, self.cfg.qwen_emo_path)
    )
except Exception as e:
    print(f"QwenEmotion failed: {e}")
    self.qwen_emo = None

Download the emotion model weights from the IndexTTS-2 HuggingFace repo — specifically the qwen0.6bemo4-merge/model.safetensors file (~1.1GB).

2. BigVGAN Output Normalization

Safety net for edge cases where BigVGAN outputs values slightly outside [-1, 1]:

# Before the int16 scaling line
wav_max = wav.abs().max()
if wav_max > 1.0:
    wav = wav / wav_max

Running on Apple Silicon: What Actually Works

We tested on a Mac Mini M4 with 16GB unified memory. Here's what we found:

CPU mode works. All components (GPT, s2mel diffusion, BigVGAN) run correctly on CPU. Performance: ~80 seconds for 9 seconds of audio (RTF ~8.6x).
MPS partially works. GPT and BigVGAN run fine on MPS. But the s2mel diffusion transformer with CFG (classifier-free guidance) at 0.7 needs ~14.5GB of MPS memory — too much for 16GB Macs.
MPS without CFG fits but sounds worse. Setting inference_cfg_rate=0.0 halves memory usage and runs on MPS, but CFG is critical for quality.
Full MPS with PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 — crashed the machine (swap death).

Our recommendation: Use CPU mode with the default settings (CFG=0.7, 25 diffusion steps). It's slower but produces the same quality as CUDA. For production batches, run overnight.

# Force CPU mode on Apple Silicon
import torch
torch.backends.mps.is_available = lambda: False

from indextts.infer_v2 import IndexTTS2
tts = IndexTTS2(model_dir='./checkpoints', cfg_path='./checkpoints/config.yaml')

tts.infer(
    spk_audio_prompt='voice_sample.wav',
    text='Your text here.',
    output_path='output.wav',
    verbose=True
)

Voice Sample Tips

Your reference audio matters more than any hyperparameter:

Use 10+ seconds of clear speech (not 5s — the model needs more data to clone well)
RMS around 0.08-0.12 is ideal. Too quiet (< 0.05) gives weak output
Clean recordings only — no background music, no reverb, no compression artifacts
22050 Hz sample rate — match the model's native rate

The Debugging Lesson

Always verify the saved file, not just the model output tensor.

We spent hours looking at BigVGAN output tensors, diffusion step counts, MPS vs CPU precision, voice sample quality — everything upstream. The model was perfect. The save function was silently destroying the audio on every single run.

If your model outputs look correct in the debugger but the WAV file sounds wrong, check your save pipeline. torchaudio.save() with int16 data is the most common silent killer.

Building AI Audio Content?

exoCreate is the AI script generator built for audio creators — erotic audio, ASMR, hypnosis, podcasts. Write scripts with persona-aware AI, then bring them to life with TTS.

Try exoCreate Free →

FAQ

Why does IndexTTS2 produce static or garbage audio on Mac?

The most common cause is torchaudio.save() rescaling int16 data. When you cast float audio to torch.int16 and save with torchaudio, it auto-normalizes the values to fill the entire [-32768, 32767] range, destroying dynamic range and producing clipped, static-sounding audio. The fix is to save as float32 scaled to [-1, 1] with bits_per_sample=16.

Can IndexTTS2 run on Apple Silicon without a CUDA GPU?

Yes. IndexTTS2 runs in CPU mode on Apple Silicon Macs with identical output quality. Performance is roughly 8-9x real-time factor (80 seconds for 9 seconds of audio on an M4). The s2mel diffusion transformer with CFG requires too much memory for MPS on 16GB Macs, but CPU mode with the original settings (CFG=0.7, 25 steps) works perfectly.

Do I need the Qwen emotion model for IndexTTS2?

The emotion model improves output quality significantly. Without it, the pipeline may feed garbage data to BigVGAN. Download qwen0.6bemo4-merge/model.safetensors (~1.1GB) from the IndexTTS-2 HuggingFace repo and place it in your checkpoints directory.

How long should my voice reference sample be?

At least 10 seconds of clear speech at 22050 Hz. Shorter samples (5 seconds or less) produce noticeably worse voice cloning quality. The sample should have an RMS level around 0.08-0.12 with no background noise or music.