wavlm-vocoder-french
News β March 2026: This work was accepted at JEP 2026.
This repository hosts the main checkpoint associated with the paper, together with evaluation results, usage instructions, and links to the public demo and code.
A research checkpoint for French speech reconstruction from frozen WavLM representations, released as the companion model of our JEP 2026 accepted paper on WavLM-to-Audio vocoding for French speech.
π Code: github.com/hi-paris/wavlm-vocoder-french
π§ Demo: hi-paris.github.io/wavlm2audio-demo
Associated paper
WavLM-to-Audio Vocoding in French: Layer Ablation Study and Adversarial Supervision for Continuous Voice Conversion
Nassima Ould Ouali, Awais Hussein Sani, Reda Dehak, Eric Moulines
Accepted at JEP 2026
This paper studies:
- layer-wise WavLM conditioning
- learned weighted fusion of upper WavLM layers
- adversarial versus non-adversarial training
- the feasibility of faithful French speech reconstruction from frozen self-supervised speech representations
Overview
This repository presents a WavLM-to-Audio neural vocoder for French speech. The system reconstructs waveform audio from frozen internal representations extracted from WavLM-Base+, using a learnable layer fusion mechanism, a convolutional adapter, and a HiFi-GAN-style neural generator.
This work is positioned as a foundational stage-1 component: before performing transformations in a continuous self-supervised latent space for voice conversion, it is necessary to ensure that the latent representation can first be decoded back into faithful audio.
Why this project matters
Modern voice conversion increasingly relies on continuous latent representations extracted from large self-supervised speech models such as Wav2Vec 2.0, HuBERT, and WavLM. However, reliable waveform reconstruction from such representations remains a technical challenge, especially for French speech.
This repository documents a French WavLM-to-Audio vocoder developed within a broader research effort on modular voice conversion. It provides a reconstructive decoder that can serve as a basis for future continuous voice conversion systems operating in the WavLM latent space.
Main idea
The architecture follows a three-stage pipeline:
- Frozen WavLM-Base+ extracts internal speech representations
- Layer fusion + convolutional adapter combines and adapts the selected WavLM layers
- A HiFi-GAN-style generator reconstructs the waveform at 16 kHz
The model explores:
- fixed last-layer conditioning
- averaging of the last N WavLM layers
- learned weighted fusion of the last N WavLM layers
- training with and without adversarial supervision
Selected demo checkpoint
The main checkpoint released in this repository is checkpoint_step180000.pt.
It corresponds to the main public demonstration checkpoint used for the JEP 2026 communication material and qualitative listening examples.
The quantitative results reported below summarize the evaluation setting used in the accepted paper on a short French held-out reconstruction benchmark.
Training data
The model was trained on cleaned French speech data built from public corpora:
| Corpus | Duration |
|---|---|
| SIWIS | 10.9 h |
| M-AILABS French | 160.7 h |
| Common Voice French | 66.7 h |
| Total | 238.3 h |
The data preparation pipeline includes mono conversion, resampling to 16 kHz, amplitude normalization, duration filtering, silence filtering, acoustic quality control, and manual inspection.
Model configuration
| Parameter | Value |
|---|---|
| WavLM backbone | microsoft/wavlm-base-plus |
| Hidden dimension | 256 |
| Number of adapter layers | 6 |
| Kernel size | 7 |
| Weighted WavLM layers | True |
| Snake activation | False |
| Sample rate | 16 kHz |
| Segment length | 16000 samples |
Evaluation summary
The model is evaluated in a short-audio reconstruction setting on a stratified subset of 15 test utterances covering unseen speakers, variable durations (1.5 s to 5 s), and diverse phonetic content.
These results correspond to the evaluation setting reported in the accepted JEP 2026 paper.
Impact of adversarial supervision
| Model | MCD β | Mel-L1 β | PESQ β | STOI β | V/UV F1 β | F0 RMSE β | F0 Corr β |
|---|---|---|---|---|---|---|---|
| Without GAN | 9.72 | 1.55 | 1.11 | 0.74 | 0.878 | 10.1 | 0.83 |
| + MPD/MSD + FM | 8.43 | 1.17 | 1.28 | 0.86 | 0.932 | 7.7 | 0.96 |
| Relative gain | -13.3% | -24.5% | +15.3% | +16.2% | +6.1% | -23.8% | +15.7% |
Adversarial supervision is critical for improving spectral quality, intelligibility, and prosodic fidelity in WavLM-to-Audio reconstruction.
Inference
# Clone the codebase
git clone https://github.com/hi-paris/wavlm-vocoder-french.git
cd wavlm-vocoder-french
pip install -e .
# Run inference
python scripts/infer.py \
--checkpoint checkpoint_step180000.pt \
--input_dir /path/to/audio \
--output_dir /path/to/output \
--num_samples 10
### How to use this checkpoint
1. Clone the associated codebase (link above)
2. Place `checkpoint_step180000.pt` in your working directory
3. Run the inference script on short French waveform inputs
### Recommended input conditions
- Short French speech segments
- Mono audio at 16 kHz (or audio resampleable to 16 kHz)
- Short-duration utterances (1.5 s to 5 s) for best consistency with evaluation setting
### Expected output
The inference script generates reconstructed waveforms saved as `{stem}_output.wav` in the specified output directory.
---
## Intended use
- Research on French speech reconstruction from self-supervised latent representations
- Stage-1 decoding experiments for future continuous voice conversion
- Analysis of WavLM latent representations for speech generation
- Demonstration of a French WavLM-based vocoder
---
## Limitations
This model is **not** a full voice conversion system. It is a **reconstructive neural decoder** designed as a foundational component.
- No explicit speaker identity conversion
- No explicit style transfer
- No prosody control module
- Evaluation primarily on short audio segments
- Stage-1 system only β not a complete end-to-end VC pipeline
---
## Project context
This model corresponds to the reconstructive decoding stage required before future experiments on continuous latent-space transformation and controllable voice conversion. The associated paper studies layer-wise WavLM conditioning, learned weighted layer fusion, the effect of adversarial supervision, and the feasibility of faithful French speech reconstruction from frozen WavLM representations.
---
## Citation
```bibtex
@misc{ouldouali2026wavlm2audiofr,
title={WavLM-to-Audio Vocoding in French: Layer Ablation Study and Adversarial Supervision for Continuous Voice Conversion},
author={Nassima Ould Ouali and Awais Hussain Sani and Reda Dehak and Eric Moulines},
year={2026},
note={Accepted at JEP 2026}
}
Evaluation results
- WORLD-MCD on Stratified short French test segmentsself-reported8.430
- Log-Mel L1 on Stratified short French test segmentsself-reported1.170
- PESQ on Stratified short French test segmentsself-reported1.280
- STOI on Stratified short French test segmentsself-reported0.860
- F0 RMSE on Stratified short French test segmentsself-reported7.700
- F0 Correlation on Stratified short French test segmentsself-reported0.960
- V/UV F1 on Stratified short French test segmentsself-reported0.932