wavlm-vocoder-french

News β€” March 2026: This work was accepted at JEP 2026.
This repository hosts the main checkpoint associated with the paper, together with evaluation results, usage instructions, and links to the public demo and code.

A research checkpoint for French speech reconstruction from frozen WavLM representations, released as the companion model of our JEP 2026 accepted paper on WavLM-to-Audio vocoding for French speech.

πŸ”— Code: github.com/hi-paris/wavlm-vocoder-french
🎧 Demo: hi-paris.github.io/wavlm2audio-demo


Associated paper

WavLM-to-Audio Vocoding in French: Layer Ablation Study and Adversarial Supervision for Continuous Voice Conversion
Nassima Ould Ouali, Awais Hussein Sani, Reda Dehak, Eric Moulines
Accepted at JEP 2026

This paper studies:

  • layer-wise WavLM conditioning
  • learned weighted fusion of upper WavLM layers
  • adversarial versus non-adversarial training
  • the feasibility of faithful French speech reconstruction from frozen self-supervised speech representations

Overview

This repository presents a WavLM-to-Audio neural vocoder for French speech. The system reconstructs waveform audio from frozen internal representations extracted from WavLM-Base+, using a learnable layer fusion mechanism, a convolutional adapter, and a HiFi-GAN-style neural generator.

This work is positioned as a foundational stage-1 component: before performing transformations in a continuous self-supervised latent space for voice conversion, it is necessary to ensure that the latent representation can first be decoded back into faithful audio.


Why this project matters

Modern voice conversion increasingly relies on continuous latent representations extracted from large self-supervised speech models such as Wav2Vec 2.0, HuBERT, and WavLM. However, reliable waveform reconstruction from such representations remains a technical challenge, especially for French speech.

This repository documents a French WavLM-to-Audio vocoder developed within a broader research effort on modular voice conversion. It provides a reconstructive decoder that can serve as a basis for future continuous voice conversion systems operating in the WavLM latent space.


Main idea

The architecture follows a three-stage pipeline:

  1. Frozen WavLM-Base+ extracts internal speech representations
  2. Layer fusion + convolutional adapter combines and adapts the selected WavLM layers
  3. A HiFi-GAN-style generator reconstructs the waveform at 16 kHz

The model explores:

  • fixed last-layer conditioning
  • averaging of the last N WavLM layers
  • learned weighted fusion of the last N WavLM layers
  • training with and without adversarial supervision

Selected demo checkpoint

The main checkpoint released in this repository is checkpoint_step180000.pt.

It corresponds to the main public demonstration checkpoint used for the JEP 2026 communication material and qualitative listening examples.

The quantitative results reported below summarize the evaluation setting used in the accepted paper on a short French held-out reconstruction benchmark.


Training data

The model was trained on cleaned French speech data built from public corpora:

Corpus Duration
SIWIS 10.9 h
M-AILABS French 160.7 h
Common Voice French 66.7 h
Total 238.3 h

The data preparation pipeline includes mono conversion, resampling to 16 kHz, amplitude normalization, duration filtering, silence filtering, acoustic quality control, and manual inspection.


Model configuration

Parameter Value
WavLM backbone microsoft/wavlm-base-plus
Hidden dimension 256
Number of adapter layers 6
Kernel size 7
Weighted WavLM layers True
Snake activation False
Sample rate 16 kHz
Segment length 16000 samples

Evaluation summary

The model is evaluated in a short-audio reconstruction setting on a stratified subset of 15 test utterances covering unseen speakers, variable durations (1.5 s to 5 s), and diverse phonetic content.

These results correspond to the evaluation setting reported in the accepted JEP 2026 paper.

Impact of adversarial supervision

Model MCD ↓ Mel-L1 ↓ PESQ ↑ STOI ↑ V/UV F1 ↑ F0 RMSE ↓ F0 Corr ↑
Without GAN 9.72 1.55 1.11 0.74 0.878 10.1 0.83
+ MPD/MSD + FM 8.43 1.17 1.28 0.86 0.932 7.7 0.96
Relative gain -13.3% -24.5% +15.3% +16.2% +6.1% -23.8% +15.7%

Adversarial supervision is critical for improving spectral quality, intelligibility, and prosodic fidelity in WavLM-to-Audio reconstruction.


Inference

# Clone the codebase
git clone https://github.com/hi-paris/wavlm-vocoder-french.git
cd wavlm-vocoder-french
pip install -e .

# Run inference
python scripts/infer.py \
  --checkpoint checkpoint_step180000.pt \
  --input_dir /path/to/audio \
  --output_dir /path/to/output \
  --num_samples 10

### How to use this checkpoint

1. Clone the associated codebase (link above)
2. Place `checkpoint_step180000.pt` in your working directory
3. Run the inference script on short French waveform inputs

### Recommended input conditions

- Short French speech segments
- Mono audio at 16 kHz (or audio resampleable to 16 kHz)
- Short-duration utterances (1.5 s to 5 s) for best consistency with evaluation setting

### Expected output

The inference script generates reconstructed waveforms saved as `{stem}_output.wav` in the specified output directory.

---

## Intended use

- Research on French speech reconstruction from self-supervised latent representations
- Stage-1 decoding experiments for future continuous voice conversion
- Analysis of WavLM latent representations for speech generation
- Demonstration of a French WavLM-based vocoder

---

## Limitations

This model is **not** a full voice conversion system. It is a **reconstructive neural decoder** designed as a foundational component.

- No explicit speaker identity conversion
- No explicit style transfer
- No prosody control module
- Evaluation primarily on short audio segments
- Stage-1 system only β€” not a complete end-to-end VC pipeline

---

## Project context

This model corresponds to the reconstructive decoding stage required before future experiments on continuous latent-space transformation and controllable voice conversion. The associated paper studies layer-wise WavLM conditioning, learned weighted layer fusion, the effect of adversarial supervision, and the feasibility of faithful French speech reconstruction from frozen WavLM representations.

---

## Citation
```bibtex
@misc{ouldouali2026wavlm2audiofr,
  title={WavLM-to-Audio Vocoding in French: Layer Ablation Study and Adversarial Supervision for Continuous Voice Conversion},
  author={Nassima Ould Ouali and Awais Hussain Sani and Reda Dehak and Eric Moulines},
  year={2026},
  note={Accepted at JEP 2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Evaluation results

  • WORLD-MCD on Stratified short French test segments
    self-reported
    8.430
  • Log-Mel L1 on Stratified short French test segments
    self-reported
    1.170
  • PESQ on Stratified short French test segments
    self-reported
    1.280
  • STOI on Stratified short French test segments
    self-reported
    0.860
  • F0 RMSE on Stratified short French test segments
    self-reported
    7.700
  • F0 Correlation on Stratified short French test segments
    self-reported
    0.960
  • V/UV F1 on Stratified short French test segments
    self-reported
    0.932