Machine Learning Speech to Text Model

The Usage of Machine Learning to Restore Speech & Language from Wax Cylinder and Early Disc Formats

The Ambientscape Machine Learning Speech-to-Text Model in 2025

In our 4th Principle on 'Usage of Artificial Intelligence Principles and Framework' we highlighted the importance of transparency when using AI technology.

The Ambientscape Project is currently training a structured Machine Learning prediction model that converts spoken audio from wax cylinders and early disc recordings transcribed to text format using an open source speech model based on the 'Wav2vec 2.0 encoder.' This particular encoder was first released by Facebook where it was trained using a self supervised objective of 60,000+ hours of read audio books from the LibriVox Project speech to text transcription training language.

Aim

Through our usage of Wav2vec 2.0 we aim to expand and fine tune the model to where it could assist in helping to restore historical spoken audio by interpreting difficult to decipher speech from record formats such as wax cylinders, transcription discs and 78s particularly for cylinders containing endangered languages and to predict more accurate translated speech printed to text from digitised audio data.

The Encoder

The Ambientscape Project uses an open source ASR (Automatic Speech Recognition) variation of the Wav2vec 2.0 base encoder adapted from 'Wav2vec2-large-robust-ft-libri-960h,' the paper is here, this allows us to tailor the code accordingly and to further develop the training using more appropriate specific datasets and training data with better output.

Wav2Vec (Technical Definition)

Wav2Vec is a framework for self-supervised learning of representations from raw audio data. The Wav2vec model is an encoder that converts audio features into the sequence of probability distribution (in negative log-likelihood) over labels.

Choosing the Encoder

The reason for using Wav2vec 2.0 is simple, we felt it had the right balance of accuracy and speed to work best with our equipment.

Challenges

Encoding Audio from old records can present challenges which involve audio based differences that the current model is not accustomed to hence why the training is so important, this includes limited dynamic range, noise, fragmented audio, speed variations, erratic pitch change. Some of these differences can be altered in the editing process however this can be time consuming and will always be limited by the recording itself. It makes more sense to train the model to understand the wax cylinder and disc format by training it with the appropriate suitable audio data.

What about Deep Learning?

It could be argued that Deep Learning techniques would be preferable for our training methods than Machine Learning because early audio speech samples may be more aligned with unstructured data where neural networking (using interconnecting nodes in a structure resembling the human brain) could potentially be a better process for more precise results, this may be something we incorporate at a later date.

Future Prospect

The methods employed in this machine learning training should enable a type of restoration process where more detailed information about speech can be gathered to enhance the preservation of language through text which would not be possible using traditional archiving techniques.

The model will be tested with recordings from the Ambientscape Archive.

Equipment used for Machine Learning

Computer: Decommissioned recycled 24 Core Xeon Server, up to 512GB Ram and Nvidia 3090 GPU.

Software: Windows 10 running Pytorch through the Anaconda A.I. Operating System.

(Core count, memory and GPU usage are strictly regulated based on the training task at hand using only 1 computer which is all that we require limiting power usage and energy dissipation).

'Processing for M.L. and Deep Learning is known to consume a high level

of energy, going forward we will be looking for other more sustainable ways to use less power during the training process.'

Related Links:

The Readings

Embracing Artificial Intelligence, for Preserving Dying Languages

Toward a Realistic Model of Speech Processing in the Brain, Pdf

Wav2vec 2.0 - Learning the Structure of Speech from Raw Audio

Wav2vec 2.0 Framework for Self-Supervised Learning Neurosys

Pytorch: Speech Recognition with Wav2Vec2: - Author Moto Hira

Revitalizing Endangered Languages: - A.I. - Powered Languages

Robust Wav2vec 2.0: Analyzing Domain Shift Pre-Training Paper

Source Codes