hugofara commited on
Commit
f6be2ea
·
1 Parent(s): 025252d

feat: code samples and expanded README.md

Browse files

Some audio files were also added for verification.

.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.wav filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -60,14 +60,29 @@ It is a phonemization model, that works both for French and Italian.
60
  Given an audio file, it will output the words heard using [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet).
61
  It does not use a language model, so it has a low likelihood of trying to map an audio on existing words.
62
 
 
 
 
 
 
63
  The training was conducted as a part of the [NCCR Evolving Language](https://evolvinglanguage.ch/) group,
64
  a Swiss research institute on language.
65
 
66
- ## Usage
67
 
 
68
  Currently, everything is managed through PyTorch.
 
 
 
 
 
69
 
70
  ```python
 
 
 
 
71
  import json
72
 
73
  import torch
@@ -77,7 +92,7 @@ import transformers
77
  import phoneme_recognizer
78
 
79
  # Load the model with weights
80
- with open("tokenizer/vocab.json", "r") as file:
81
  phonemes_dict = json.load(file)
82
 
83
  model = phoneme_recognizer.PhonemeRecognizer(phonemes_dict=phonemes_dict)
@@ -85,23 +100,29 @@ checkpoint = torch.load("model.pth")
85
  model.load_state_dict(checkpoint)
86
 
87
  # Prepare the input data
88
- feature_extractor = transformers.AutoFeatureExtractor.from_pretrained(
89
- "microsoft/wavlm-base-plus"
90
- )
91
  SAMPLING_RATE = 16_000
92
 
93
- audio_array, frequency = torchaudio.load("file.wav")
94
  if frequency != SAMPLING_RATE:
95
  raise ValueError(f"Input audio frequency should be {SAMPLING_RATE} Hz, it it {frequency} Hz.")
96
- inputs = feature_extractor(audio_array, SAMPLING_RATE)
 
 
 
 
 
 
 
 
97
  inputs["language"] = "fr" # or "it"
98
 
99
  # Do inference
100
  with torch.no_grad():
101
  logits = model(**inputs)
102
 
103
- prediction = model.classify_to_phonemes(logits.unsqueeze(0))
104
- print("Final phonemes are:", "".join(prediction[0]))
 
105
  ```
106
 
107
  ## Intended public
@@ -110,25 +131,18 @@ This model was mainly thought for clinicians that need audio transcriptions on a
110
  As the training was conducted on adult voices, it has the same speech recognition biases as "normal" adult voices,
111
  which means it corrects accents as long as they are well spread.
112
 
113
- ## Model architecture
114
-
115
- The model contains WavLM Base+, with a linear classifier on top.
116
-
117
- This linear classifier has the following input:
118
-
119
- - The first input is the language (0 for French, 1 for Italian).
120
- - The next 768 are the raw outputs of WavLM Base+.
121
 
122
- To get phonemes from this output, you can simply use an arg max and map the indices over
123
- `vocab.json`.
124
 
125
- ## Dataset generation
126
 
127
  The dataset was adapted from Common Voice 17.0, French + Italian versions.
128
- To get an API representation of the sentences, a phonemizer was used.
 
129
  The language of each sample (either French or Italian) was also saved as a dataset feature.
130
 
131
- ## Training procedure
132
 
133
  Only the training split of Common Voice 17.0 is used during training.
134
 
@@ -143,11 +157,23 @@ For the second phase of training, we unfreeze the transformer.
143
  We start the same training procedure, a tri-state linear warm-up from scratch.
144
  At the time of writing, the model only completed a single epoch of training.
145
 
146
- ## Results
147
 
148
  The results are measure in Phoneme Error Rate, PER for short.
149
  Using the validation set of Common Voice 17.0, we achieve less than 13% of PER.
150
 
 
 
 
 
 
 
 
 
 
 
 
 
151
  ## Related works
152
 
153
  The model was created as a successor, and an extension, to [Cnam-LMSSC/wav2vec2-french-phonemizer](https://huggingface.co/Cnam-LMSSC/wav2vec2-french-phonemizer).
@@ -159,3 +185,12 @@ Not the same kind of measurement.
159
  On the previous model, PER is measured on the training set (with a risk of overfitting),
160
  while our PER is on some data the model never saw.
161
  For reference, we achieved 2% PER on the training set with 100 epochs, yet it was still 18% PER on the validation set.
 
 
 
 
 
 
 
 
 
 
60
  Given an audio file, it will output the words heard using [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet).
61
  It does not use a language model, so it has a low likelihood of trying to map an audio on existing words.
62
 
63
+ ## Model Details
64
+
65
+ - Developed by: HugoFara
66
+ - Funded by: [NCCR Evolving Language](https://evolvinglanguage.ch/)
67
+
68
  The training was conducted as a part of the [NCCR Evolving Language](https://evolvinglanguage.ch/) group,
69
  a Swiss research institute on language.
70
 
71
+ ## Uses
72
 
73
+ The model works with French and Italian audios.
74
  Currently, everything is managed through PyTorch.
75
+ Let's transcribe this audio:
76
+
77
+ !["Sa capitale est Tsenkher"](audio-samples/tsenkher-fr.wav)
78
+
79
+ You can use the following code.
80
 
81
  ```python
82
+ """
83
+ Simple demonstration.
84
+ See main.py for a more complete demonstration.
85
+ """
86
  import json
87
 
88
  import torch
 
92
  import phoneme_recognizer
93
 
94
  # Load the model with weights
95
+ with open("vocab.json", "r") as file:
96
  phonemes_dict = json.load(file)
97
 
98
  model = phoneme_recognizer.PhonemeRecognizer(phonemes_dict=phonemes_dict)
 
100
  model.load_state_dict(checkpoint)
101
 
102
  # Prepare the input data
 
 
 
103
  SAMPLING_RATE = 16_000
104
 
105
+ audio_array, frequency = torchaudio.load("audio-samples/tsenkher-fr.wav")
106
  if frequency != SAMPLING_RATE:
107
  raise ValueError(f"Input audio frequency should be {SAMPLING_RATE} Hz, it it {frequency} Hz.")
108
+ feature_extractor = transformers.AutoFeatureExtractor.from_pretrained(
109
+ "microsoft/wavlm-base-plus"
110
+ )
111
+ inputs = feature_extractor(
112
+ audio_array.squeeze(),
113
+ sampling_rate=SAMPLING_RATE,
114
+ padding=True,
115
+ return_tensors="pt",
116
+ )
117
  inputs["language"] = "fr" # or "it"
118
 
119
  # Do inference
120
  with torch.no_grad():
121
  logits = model(**inputs)
122
 
123
+ prediction = model.classify_to_phonemes(logits)[0]
124
+ print("Final phonemes are:", "".join(prediction))
125
+ # Should output: "sakapitalɛtsɑ̃kɛʁ"
126
  ```
127
 
128
  ## Intended public
 
131
  As the training was conducted on adult voices, it has the same speech recognition biases as "normal" adult voices,
132
  which means it corrects accents as long as they are well spread.
133
 
134
+ Do not use this model for any harmful purpose.
 
 
 
 
 
 
 
135
 
136
+ ## Training Details
 
137
 
138
+ ### Training Data
139
 
140
  The dataset was adapted from Common Voice 17.0, French + Italian versions.
141
+ To get an API representation of the sentences, a phonemizer from text was used:
142
+ [charsiu/g2p_multilingual_byT5_small_100](https://huggingface.co/charsiu/g2p_multilingual_byT5_small_100).
143
  The language of each sample (either French or Italian) was also saved as a dataset feature.
144
 
145
+ ### Training Procedure
146
 
147
  Only the training split of Common Voice 17.0 is used during training.
148
 
 
157
  We start the same training procedure, a tri-state linear warm-up from scratch.
158
  At the time of writing, the model only completed a single epoch of training.
159
 
160
+ ## Evaluation
161
 
162
  The results are measure in Phoneme Error Rate, PER for short.
163
  Using the validation set of Common Voice 17.0, we achieve less than 13% of PER.
164
 
165
+ ## Technical Specifications
166
+
167
+ The model contains WavLM Base+, with a linear classifier on top.
168
+
169
+ This linear classifier has the following input:
170
+
171
+ - The first input is the language (0 for French, 1 for Italian).
172
+ - The next 768 are the raw outputs of WavLM Base+.
173
+
174
+ To get phonemes from this output, you can simply use an arg max and map the indices over
175
+ `vocab.json`.
176
+
177
  ## Related works
178
 
179
  The model was created as a successor, and an extension, to [Cnam-LMSSC/wav2vec2-french-phonemizer](https://huggingface.co/Cnam-LMSSC/wav2vec2-french-phonemizer).
 
185
  On the previous model, PER is measured on the training set (with a risk of overfitting),
186
  while our PER is on some data the model never saw.
187
  For reference, we achieved 2% PER on the training set with 100 epochs, yet it was still 18% PER on the validation set.
188
+
189
+ See also this very good multilanguage version: [ASR-Project/Multilingual-PR](https://github.com/ASR-project/Multilingual-PR).
190
+
191
+ ## Todo list
192
+
193
+ - [ ] Data augmentation to finish the model training
194
+ - [ ] Cleaner dataset with a better phonemizer.
195
+ - [ ] More powerful model using WavLM Large.
196
+ - [ ] More evaluation results.
audio-samples/entrato-it.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1408499d421d1326eefd2ca003326426b508823d1e65505a8758b05cf5213a45
3
+ size 116814
audio-samples/italiens-fr.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bea76537859290d430c65d9f54ac6ac42359c05d55d7f4825a99c43c75090c09
3
+ size 140622
audio-samples/tsenkher-fr.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:efd264ba0cc41b4e4109253ac00800c675f3a65b4a6ee9d7c1590970dc8d2423
3
+ size 115278
main.py ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Just a demo code to use the model."""
2
+ import json
3
+
4
+ import torch
5
+ import torchaudio
6
+ import transformers
7
+
8
+ import phoneme_recognizer
9
+
10
+ # Load the model with weights
11
+ with open("vocab.json", "r") as file:
12
+ phonemes_dict = json.load(file)
13
+
14
+ model = phoneme_recognizer.PhonemeRecognizer(phonemes_dict=phonemes_dict)
15
+ checkpoint = torch.load("model.pth", map_location="cpu")
16
+ model.load_state_dict(checkpoint)
17
+
18
+ # Prepare the input data
19
+ SAMPLING_RATE = 16_000
20
+
21
+ audio_files = [
22
+ {
23
+ "path": "audio-samples/tsenkher-fr.wav",
24
+ "language": "fr",
25
+ "text": "Sa capitale est Tsenkher."
26
+ },
27
+ {
28
+ "path": "audio-samples/italiens-fr.wav",
29
+ "language": "fr",
30
+ "text": "Les Italiens ont été les premiers à réagir."
31
+ },
32
+ {
33
+ "path": "audio-samples/entrato-it.wav",
34
+ "language": "it",
35
+ "text": "Ma nessuno può esservi entrato!"
36
+ }
37
+ ]
38
+ feature_extractor = transformers.AutoFeatureExtractor.from_pretrained(
39
+ "microsoft/wavlm-base-plus"
40
+ )
41
+ audio_arrays = []
42
+ for audio in audio_files:
43
+ audio_array, frequency = torchaudio.load(audio["path"])
44
+ if frequency != SAMPLING_RATE:
45
+ raise ValueError(
46
+ f"Input audio frequency should be {SAMPLING_RATE} Hz, it it {frequency} Hz."
47
+ )
48
+ audio_arrays.append(audio_array[0].numpy())
49
+
50
+ inputs = feature_extractor(
51
+ audio_arrays,
52
+ sampling_rate=SAMPLING_RATE,
53
+ padding=True,
54
+ return_tensors="pt",
55
+ )
56
+ inputs["language"] = [row["language"] for row in audio_files] # "fr" or "it"
57
+
58
+ # Do inference
59
+ with torch.no_grad():
60
+ logits = model(**inputs)
61
+
62
+ predictions = model.classify_to_phonemes(logits)
63
+ column_length = 34
64
+ print(
65
+ "Input file".center(column_length),
66
+ "Predicted phonemes".center(column_length),
67
+ "Original text".center(column_length),
68
+ sep=" | "
69
+ )
70
+ for file, prediction in zip(audio_files, predictions):
71
+ print(
72
+ file["path"].center(column_length),
73
+ "".join(prediction).center(column_length),
74
+ file["text"],
75
+ sep=" | "
76
+ )
tokenizer/vocab.json → vocab.json RENAMED
File without changes