feat: code samples and expanded README.md

Browse files

Some audio files were also added for verification.

Files changed (7) hide show

.gitattributes +1 -0
README.md +58 -23
audio-samples/entrato-it.wav +3 -0
audio-samples/italiens-fr.wav +3 -0
audio-samples/tsenkher-fr.wav +3 -0
main.py +76 -0
tokenizer/vocab.json → vocab.json +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+*.wav filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -60,14 +60,29 @@ It is a phonemization model, that works both for French and Italian.
 Given an audio file, it will output the words heard using [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet).
 It does not use a language model, so it has a low likelihood of trying to map an audio on existing words.
 The training was conducted as a part of the [NCCR Evolving Language](https://evolvinglanguage.ch/) group,
 a Swiss research institute on language.
-## Usage
 Currently, everything is managed through PyTorch.
 ```python
 import json
 import torch
@@ -77,7 +92,7 @@ import transformers
 import phoneme_recognizer
 # Load the model with weights
-with open("tokenizer/vocab.json", "r") as file:
     phonemes_dict = json.load(file)
 model = phoneme_recognizer.PhonemeRecognizer(phonemes_dict=phonemes_dict)
@@ -85,23 +100,29 @@ checkpoint = torch.load("model.pth")
 model.load_state_dict(checkpoint)
 # Prepare the input data
-feature_extractor = transformers.AutoFeatureExtractor.from_pretrained(
-    "microsoft/wavlm-base-plus"
-)
 SAMPLING_RATE = 16_000
-audio_array, frequency = torchaudio.load("file.wav")
 if frequency != SAMPLING_RATE:
     raise ValueError(f"Input audio frequency should be {SAMPLING_RATE} Hz, it it {frequency} Hz.")
-inputs = feature_extractor(audio_array, SAMPLING_RATE)
 inputs["language"] = "fr"  # or "it"
 # Do inference
 with torch.no_grad():
     logits = model(**inputs)
-prediction = model.classify_to_phonemes(logits.unsqueeze(0))
-print("Final phonemes are:", "".join(prediction[0]))
 ```
 ## Intended public
@@ -110,25 +131,18 @@ This model was mainly thought for clinicians that need audio transcriptions on a
 As the training was conducted on adult voices, it has the same speech recognition biases as "normal" adult voices,
 which means it corrects accents as long as they are well spread.
-## Model architecture
-The model contains WavLM Base+, with a linear classifier on top.
-This linear classifier has the following input:
-- The first input is the language (0 for French, 1 for Italian).
-- The next 768 are the raw outputs of WavLM Base+.
-To get phonemes from this output, you can simply use an arg max and map the indices over
-`vocab.json`.
-## Dataset generation
 The dataset was adapted from Common Voice 17.0, French + Italian versions.
-To get an API representation of the sentences, a phonemizer was used.
 The language of each sample (either French or Italian) was also saved as a dataset feature.
-## Training procedure
 Only the training split of Common Voice 17.0 is used during training.
@@ -143,11 +157,23 @@ For the second phase of training, we unfreeze the transformer.
 We start the same training procedure, a tri-state linear warm-up from scratch.
 At the time of writing, the model only completed a single epoch of training.
-## Results
 The results are measure in Phoneme Error Rate, PER for short.
 Using the validation set of Common Voice 17.0, we achieve less than 13% of PER.
 ## Related works
 The model was created as a successor, and an extension, to [Cnam-LMSSC/wav2vec2-french-phonemizer](https://huggingface.co/Cnam-LMSSC/wav2vec2-french-phonemizer).
@@ -159,3 +185,12 @@ Not the same kind of measurement.
 On the previous model, PER is measured on the training set (with a risk of overfitting),
 while our PER is on some data the model never saw.
 For reference, we achieved 2% PER on the training set with 100 epochs, yet it was still 18% PER on the validation set.

 Given an audio file, it will output the words heard using [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet).
 It does not use a language model, so it has a low likelihood of trying to map an audio on existing words.
+## Model Details
+- Developed by: HugoFara
+- Funded by: [NCCR Evolving Language](https://evolvinglanguage.ch/)
 The training was conducted as a part of the [NCCR Evolving Language](https://evolvinglanguage.ch/) group,
 a Swiss research institute on language.
+## Uses
+The model works with French and Italian audios.
 Currently, everything is managed through PyTorch.
+Let's transcribe this audio:
+!["Sa capitale est Tsenkher"](audio-samples/tsenkher-fr.wav)
+You can use the following code.
 ```python
+"""
+Simple demonstration.
+See main.py for a more complete demonstration.
+"""
 import json
 import torch
 import phoneme_recognizer
 # Load the model with weights
+with open("vocab.json", "r") as file:
     phonemes_dict = json.load(file)
 model = phoneme_recognizer.PhonemeRecognizer(phonemes_dict=phonemes_dict)
 model.load_state_dict(checkpoint)
 # Prepare the input data
 SAMPLING_RATE = 16_000
+audio_array, frequency = torchaudio.load("audio-samples/tsenkher-fr.wav")
 if frequency != SAMPLING_RATE:
     raise ValueError(f"Input audio frequency should be {SAMPLING_RATE} Hz, it it {frequency} Hz.")
+feature_extractor = transformers.AutoFeatureExtractor.from_pretrained(
+    "microsoft/wavlm-base-plus"
+)
+inputs = feature_extractor(
+    audio_array.squeeze(),
+    sampling_rate=SAMPLING_RATE,
+    padding=True,
+    return_tensors="pt",
+)
 inputs["language"] = "fr"  # or "it"
 # Do inference
 with torch.no_grad():
     logits = model(**inputs)
+prediction = model.classify_to_phonemes(logits)[0]
+print("Final phonemes are:", "".join(prediction))
+# Should output: "sakapitalɛtsɑ̃kɛʁ"
 ```
 ## Intended public
 As the training was conducted on adult voices, it has the same speech recognition biases as "normal" adult voices,
 which means it corrects accents as long as they are well spread.
+Do not use this model for any harmful purpose.
+## Training Details
+### Training Data
 The dataset was adapted from Common Voice 17.0, French + Italian versions.
+To get an API representation of the sentences, a phonemizer from text was used:
+[charsiu/g2p_multilingual_byT5_small_100](https://huggingface.co/charsiu/g2p_multilingual_byT5_small_100).
 The language of each sample (either French or Italian) was also saved as a dataset feature.
+### Training Procedure
 Only the training split of Common Voice 17.0 is used during training.
 We start the same training procedure, a tri-state linear warm-up from scratch.
 At the time of writing, the model only completed a single epoch of training.
+## Evaluation
 The results are measure in Phoneme Error Rate, PER for short.
 Using the validation set of Common Voice 17.0, we achieve less than 13% of PER.
+## Technical Specifications
+The model contains WavLM Base+, with a linear classifier on top.
+This linear classifier has the following input:
+- The first input is the language (0 for French, 1 for Italian).
+- The next 768 are the raw outputs of WavLM Base+.
+To get phonemes from this output, you can simply use an arg max and map the indices over
+`vocab.json`.
 ## Related works
 The model was created as a successor, and an extension, to [Cnam-LMSSC/wav2vec2-french-phonemizer](https://huggingface.co/Cnam-LMSSC/wav2vec2-french-phonemizer).
 On the previous model, PER is measured on the training set (with a risk of overfitting),
 while our PER is on some data the model never saw.
 For reference, we achieved 2% PER on the training set with 100 epochs, yet it was still 18% PER on the validation set.
+See also this very good multilanguage version: [ASR-Project/Multilingual-PR](https://github.com/ASR-project/Multilingual-PR).
+## Todo list
+- [ ] Data augmentation to finish the model training
+- [ ] Cleaner dataset with a better phonemizer.
+- [ ] More powerful model using WavLM Large.
+- [ ] More evaluation results.

audio-samples/entrato-it.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1408499d421d1326eefd2ca003326426b508823d1e65505a8758b05cf5213a45
+size 116814

audio-samples/italiens-fr.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bea76537859290d430c65d9f54ac6ac42359c05d55d7f4825a99c43c75090c09
+size 140622

audio-samples/tsenkher-fr.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:efd264ba0cc41b4e4109253ac00800c675f3a65b4a6ee9d7c1590970dc8d2423
+size 115278

main.py ADDED Viewed

	@@ -0,0 +1,76 @@

+"""Just a demo code to use the model."""
+import json
+import torch
+import torchaudio
+import transformers
+import phoneme_recognizer
+# Load the model with weights
+with open("vocab.json", "r") as file:
+    phonemes_dict = json.load(file)
+model = phoneme_recognizer.PhonemeRecognizer(phonemes_dict=phonemes_dict)
+checkpoint = torch.load("model.pth", map_location="cpu")
+model.load_state_dict(checkpoint)
+# Prepare the input data
+SAMPLING_RATE = 16_000
+audio_files = [
+    {
+        "path": "audio-samples/tsenkher-fr.wav",
+        "language": "fr",
+        "text": "Sa capitale est Tsenkher."
+    },
+    {
+        "path": "audio-samples/italiens-fr.wav",
+        "language": "fr",
+        "text": "Les Italiens ont été les premiers à réagir."
+    },
+    {
+        "path": "audio-samples/entrato-it.wav",
+        "language": "it",
+        "text": "Ma nessuno può esservi entrato!"
+    }
+]
+feature_extractor = transformers.AutoFeatureExtractor.from_pretrained(
+    "microsoft/wavlm-base-plus"
+)
+audio_arrays = []
+for audio in audio_files:
+    audio_array, frequency = torchaudio.load(audio["path"])
+    if frequency != SAMPLING_RATE:
+        raise ValueError(
+            f"Input audio frequency should be {SAMPLING_RATE} Hz, it it {frequency} Hz."
+        )
+    audio_arrays.append(audio_array[0].numpy())
+inputs = feature_extractor(
+    audio_arrays,
+    sampling_rate=SAMPLING_RATE,
+    padding=True,
+    return_tensors="pt",
+)
+inputs["language"] = [row["language"] for row in audio_files]  # "fr" or "it"
+# Do inference
+with torch.no_grad():
+    logits = model(**inputs)
+predictions = model.classify_to_phonemes(logits)
+column_length = 34
+print(
+    "Input file".center(column_length),
+    "Predicted phonemes".center(column_length),
+    "Original text".center(column_length),
+    sep=" | "
+)
+for file, prediction in zip(audio_files, predictions):
+    print(
+        file["path"].center(column_length),
+        "".join(prediction).center(column_length),
+        file["text"],
+        sep=" | "
+    )

tokenizer/vocab.json → vocab.json RENAMED Viewed

File without changes