Subtitle AI

Service
Machine Learning
Brief Summary

Subtitle AI is a project focused on improving existing audio-to-text machine learning models, focused on transfer learning and hyperparameter optimization.

Created
December 23, 2023
Tag
Featured
image

Overview

We were utilizing the pre-existing Wav2Vec2.0 audio-to-text language model developed by Facebook (now Meta), to try transfer learning and hyperparameter tuning to improve performance for noisy audio.

Approach

We utilized the pre-trained Wav2Vec model to train on our data and fine-tune hyperparameters in order to optimize the performance of the model. The Wav2Vec model is pre-trained on 16 kHz frequency, a speech model that accepts a float array corresponding to the raw waveform of the speech signal. We preprocessed all of the incoming data utilizing a noise reduction library before passing it into the model to decode. We convert the audio to text, passing the prediction to the tokenizer decode to get the transcription. We are utilizing the LibriSpeech ASR Corpus, utilizing the clean testing data as the dataset that we are working with.

Outcomes

With vast amounts of fine-tuning to the Wav2Vec2 model already, changes in the hyperparameter configuration of models will result in minimal performance changes. This implies that fine-tuning on the model will have a much higher impact on the model’s performance, rather than specific hyperparameters. We were able to achieve an average of 4.13% word error rate on the full set of testing data.

Given more time with this project, we want to further optimize the performance of the Wav2Vec2 model. We would do this by starting with an earlier checkpoint of Wav2Vec2 (either 10-minute or 100-hour fine-tuned models), and then perform downstream training by splitting our dataset into training, validation, and testing subsets, and fine-tuning the model by adjusting the learning rate, vocabulary size, etc. in order to improve overall performance and reduce the word error rate.