Odunolaoluwa Shadrack Jenrola

F5-LoRA: Low Rank Adaptation for Text to Speech

Github Repository here

F5-TTS is a ~500M-parameter non-autoregressive text-to-speech model built around a diffusion transformer. It drops several components you’d normally expect in NAR diffusion TTS systems like the duration predictor, text encoder, and phoneme alignment. Despite its minimal design, it performs surprisingly well and beats many models in its parameter class.

The model is trained with a speech-infilling objective: you provide text plus a masked region of audio, and it learns to reconstruct the missing segment. At inference time, the official implementation uses classifier-free guidance to balance the influence of unmasked audio on the reconstruction, which is especially important for voice cloning.

Out of the box, F5-TTS does excellent voice cloning. It only struggles with edge cases, such as emotional prosody, conversational feel or certain accents, including African-accented speech.

Low-Rank Adaptation (LoRA) enables efficient fine-tuning of large models by freezing the base weights and injecting small, trainable low-rank matrices into specific modules.

Formally, for a linear projection

W\Rdout×din

LoRA replaces the weight update with a decomposed low-rank form

ΔW=BA,

where

A\Rr×din,B\Rdout×r

During training, the effective projection becomes

W=W+α·ΔW

where α is a scaling factor usually called alpha in popular texts, and only A and B are updated. This keeps compute and memory extremely low. Those injected weights are collectively called an adapter. Since only the adapter are trained, you can maintain multiple adapters for different voices or effects without touching the main model. Training is fast and cheap because the adapters are tiny. In Pytorch, a single Linear layer with LoRA weights attached to it looks like this

import copy
import math
import torch
from torch.nn import functional as F
from torch import nn

class LoraLinear(nn.Module):
    def __init__(self, layer:nn.Linear, r = 4, alpha = 8):
        super().__init__()
        self.base = copy.deepcopy(layer)
        self.scale = alpha / r

        dtype = layer.weight.dtype
        device = layer.weight.device

        self.lora_A = nn.Parameter(torch.randn(r, layer.in_features, dtype = dtype, device = device))
        self.lora_B = nn.Parameter(torch.zeros(layer.out_features, r, dtype = dtype, device = device))
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        self.lora_A.data *= 0.01

        self.base.requires_grad_(False)

    def forward(self, x):
        output = self.base(x)
        delta = (self.lora_B @ self.lora_A) * self.scale
        delta = delta.to(dtype = x.dtype, device = x.device)
        lora_out = torch.nn.functional.linear(x, delta)
        return output + lora_out

So, regular nn.Linear layers are replaced with LoraLinear with the original weights of the linear linear layer frozen and set as self.base.

The goal is to allow for LoRA-based fine-tuning for F5-TTS. To get this to work, I developed F5-LoRA, an open-source repository for training LoRA adapters that produce custom voice effects and high-quality voice cloning on top of F5-TTS. I reused pieces from the original implementation, but kept things separate since the upstream repo’s structure wasn’t the most convenient place to integrate adapter logic cleanly.

Instead of relying on existing LoRA libraries such as PEFT, I wrote a lightweight LoRA manager. It handles initializing LoRA parameters, merging/unmerging adapters, and dynamically swapping between different adapter sets. The manager is intentionally simple and targets every linear layer in the diffusion transformer like the attention projections, MLP layers, gate projections, timestep embeddings. Essentially all nn.Linear modules except those in the vocoder. You can check it out in the code repo.

I also replaced Accelerate with PyTorch Lightning, which I find more comfortable for managing training workflows and experiment structure. You can check out the Readme for quickstart examples.

Here are some examples of finetuned adapters from F5-LoRA:

  1. Voice Effects
    1. Reference Audio: drive link
    2. Output Audio: drive link
  2. Voice Cloning
    1. Reference Audio: drive link
    2. Output Audio: drive link