Speech recognition is a computational linguistics sub-field focused on enabling
computers to accurately transcribe spoken words into text. It drives digital transformation,
spanning education, industry, healthcare, and emerging IoT and ML applications. Research in
this field is rapidly advancing as scientists endeavor to broaden computers' abilities in processing
spoken language. Feature extraction which is one of the steps in the process, transforms raw audio
into machine-readable data for analysis. It is vital for machine learning and pattern recognition
tasks. This paper presents a groundbreaking advancement in speech recognition with the
introduction of the wav2vec 2.0 model which is a self- supervised feature extractor. Departing
from conventional supervised methods, this model achieves superior performance by initially
learning representations from unlabeled speech audio and subsequently fine-tuning on
transcribed speech. Utilizing latent space masking and a task involving contrast, the model
efficiently learns contextualized representations, demonstrating remarkable adaptability on the
Libri-Light dataset. Even with minimal labeled data, wav2vec2.0 outperforms previous cutting
edge semi-supervised approaches, showcasing its potential for robust speech recognition in
scenarios with limited labeled data—a significant breakthrough for the broader accessibility of
speech recognition technology.