User Guide
Project Overview
What this system is and why it was built
The Arabic Audio Intelligence System is a deep-learning-powered platform that converts spoken Arabic audio into searchable, analysable, and summarised text. It was built as an academic deliverable addressing the real-world challenge of processing Arabic speech, a language with complex morphology, optional diacritics, and dozens of spoken dialects, entirely with open-source neural networks.
Rather than a single-purpose transcription tool, the system implements a fully integrated pipeline: raw audio goes in, and structured knowledge (transcripts, speaker timelines, emotional states, summaries, indexed notes) comes out. Every component is independently accessible through a module-driven interface. The system supports both fully local inference (no internet required after initial model download) and a cloud-accelerated mode via the OpenAI Whisper API for instant transcription.
The project satisfies every item in the course specification, including all optional advanced tasks (speaker diarization, emotion detection, text summarization, and a voice chat messenger with search capabilities). Four ASR models are available: OpenAI Whisper API (cloud), Whisper Local (HuggingFace), Wav2Vec2 (HuggingFace), and a custom CNN+LSTM trained from scratch.
| Requirement | File / Module | Status |
|---|---|---|
| Speech-to-Text CNN+LSTM (from scratch) | src/models/cnn_lstm.py + notebooks/02_cnn_lstm_training.ipynb | β |
| Pre-trained ASR: Whisper (local) | src/models/whisper_asr.py | β |
| Pre-trained ASR: Whisper API (cloud) | src/models/whisper_api.py | β |
| Pre-trained ASR: Wav2Vec 2.0 | src/models/wav2vec2_asr.py | β |
| WER / CER Evaluation | src/evaluate.py | β |
| Voice Notes Search Engine | src/search_engine.py | β |
| Speaker Diarization (advanced) | src/speaker_diarization.py | β |
| Emotion Detection (advanced) | src/emotion_detection.py | β |
| Text Summarization (advanced) | src/summarizer.py | β |
| Voice Chat Messenger (advanced) | frontend/src/app/chat/ | β |
| Live Demo Interface | Next.js 16 + FastAPI | β |
System Architecture
How data flows from audio input to structured output
The architecture is split into two independent layers: a Python FastAPI backend at port 8000 that handles all ML computation, and a Next.js 16 frontend at port 3000 that renders the interface. The two communicate via a stateless REST API over HTTP, meaning the backend can be hosted on a GPU server anywhere while the frontend can be served from a CDN.
Audio Input (.wav / .mp3 / .flac / .ogg)
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββ
β Audio Preprocessing β
β Resample to 16kHz mono β
β Segment if > 30s (Whisper chunking) β
β MFCC / Log-Mel extraction β
βββββββββββββββββββββ¬ββββββββββββββββββββββ
β
ββββββββββββ΄βββββββββββ
βΌ βΌ
ββββββββββββββ ββββββββββββββββββββββββ
β CNN + LSTM β β Whisper / Wav2Vec2 β
β (custom) β β (HuggingFace) β
ββββββββ¬ββββββ ββββββββββββ¬ββββββββββββ
ββββββββββββ¬ββββββββββββ
βΌ
ββββββββββββββββββββββββ
β Text Transcript β
βββββββββ¬βββββββββββββββ
β
ββββββββββββΌβββββββββββββββ
βΌ βΌ βΌ
ββββββββββββ ββββββββββ ββββββββββββββββββ
βSummarizerβ βEmotion β βSpeaker Diarize β
βmT5/TF-IDFβ βWav2Vec2β βPyAnnote/ECAPA β
ββββββββββββ ββββββββββ ββββββββββββββββββ
β β β
ββββββββββββ΄βββββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Search Engine β
β FAISS + sentence- β
β transformers β
βββββββββββββββββββββββBackend API surface
| Endpoint | Method | Input | Returns |
|---|---|---|---|
| /api/health | GET | β | { status: 'ok' } |
| /api/transcribe | POST | file (audio), model | { transcript, language, duration } |
| /api/diarize | POST | file (audio), num_speakers | { segments: [{speaker, start, end}] } |
| /api/emotion | POST | file (audio) | { emotion, scores: {happy,angry,β¦} } |
| /api/summarize | POST | text, max_length, method | { summary, method_used } |
| /api/search | POST | query, search_type | { results: [{text, score, id}] } |
| /api/search/add | POST | transcript, file? | { id, message } |
| /api/search/stats | GET | β | { total_notes, index_size } |
Frontend routing
Each feature is a Next.js App Router page under src/app/. All API calls are centralised in src/lib/api.ts. The dynamic side navbar uses Framer Motion layout animations to transition between a horizontal bottom dock (home page) and a vertical sidebar (all other pages), animated as a fluid spring, not a CSS snap.
Model Zoo
Every neural network used: architecture, rationale, performance
OpenAI Whisper API (Cloud)
The fastest option: sends audio to OpenAI's servers for transcription using the latest Whisper large model. No GPU or local model download needed. Requires an OPENAI_API_KEY in the .env file. The API handles any audio format and returns Arabic text directly. Ideal for quick demos or machines without GPUs.
OpenAI Whisper (Local)
Whisper is an encoder-decoder Transformer trained on 680,000 hours of multilingual audio. The audio encoder is a convolutional stem followed by a stack of Transformer blocks that produces a context-rich embedding of the log-mel spectrogram. The decoder auto-regressively generates BPE tokens using cross-attention over those encoder outputs. It natively supports Arabic and handles diacritics, code-switching, and noise robustly. We use whisper-small (244M parameters) locally; the model auto-downloads from HuggingFace on first use.
Wav2Vec 2.0: Arabic fine-tune
Wav2Vec 2.0 (Meta AI) learns speech representations directly from raw waveforms. Its convolutional feature extractor downsamples the audio, and a 24-layer Transformer encoder is pre-trained with a masked contrastive loss (like BERT for speech). A linear CTC head is then fine-tuned on labelled data. We load facebook/wav2vec2-large-xlsr-53-arabic, the XLSR-53 model fine-tuned specifically on the Arabic Common Voice split. Because CTC decoding is non-autoregressive it is roughly 2Γ faster than Whisper at inference.
CNN + LSTM: Custom Academic Model
Built from scratch to satisfy the course requirement for a deep-learning model designed and trained by students. It operates on MFCC features extracted from audio. Two 2D convolutional layers with batch normalisation extract spectro-temporal patterns from the feature map. The output is reshaped and fed into two bidirectional LSTM layers that model temporal context across frames. A fully-connected layer followed by a softmax over the character vocabulary produces per-frame posteriors; CTC loss handles alignment-free training.
Input waveform β librosa MFCC β (128, T) feature map
β
Conv2D(32, 3Γ3, padding=1) β BN β ReLU
Conv2D(64, 3Γ3, padding=1) β BN β ReLU β MaxPool2D(2,2)
β
Reshape β (T//2, 64 Γ 64) [batch Γ time Γ features]
β
BiLSTM(256, num_layers=2, dropout=0.3)
β
Linear(vocab_size) β log_softmax
β
CTC Loss ββ reference transcriptThe CNN+LSTM model is the academic experiment. Expect WER of 35β45% depending on training epochs and dataset quality. Whisper or Wav2Vec2 should be selected for production-quality transcriptions.
PyAnnote Audio: Speaker Diarization
PyAnnote is a speaker diarization toolkit built on PyTorch. The pyannote/speaker-diarization-3.1 pipeline chains voice-activity detection (a segmentation model), speaker embedding extraction (ECAPA-TDNN d-vectors), and spectral clustering into a single end-to-end pipeline. It outputs a speaker-labeled timeline with RTTM-compatible start/end timestamps.
Wav2Vec 2.0: Emotion Classification
A Wav2Vec 2.0 base checkpoint fine-tuned for speech emotion recognition. A mean-pooling layer followed by a 4-class linear head is trained on prosodic emotional datasets. The four classes are happy, angry, neutral, sad. The model outputs a probability distribution over all four, making it interpretable beyond a single predicted label.
mT5 / TF-IDF: Summarization
Two modes are offered. Extractive summarization uses TF-IDF sentence scoring: sentences are ranked by weighted term frequency and the topβN are returned verbatim; fast, no GPU required. Abstractive summarization runs a multilingual sequence-to-sequence model (Helsinki-NLP or an Arabic T5 variant) to generate a free-form condensed paraphrase of the input text.
CNN+LSTM Model Setup
How to get the checkpoint running β from GitHub to the backend
The CNN+LSTM model is a custom-trained checkpoint that must be present at outputs/checkpoints/best_model.pt before selecting it in the Transcribe page. This section covers three paths: using the pre-trained checkpoint from the repo, training your own, or fixing the backend if you see name 'T' is not defined.
Option 1 β Use the checkpoint already in the repo (recommended)
The checkpoint is committed to the repository via Git LFS. After cloning, it will be at outputs/checkpoints/best_model.pt (~55 MB). If LFS was not installed when you cloned, run the commands below to pull it down:
# 1. Install Git LFS (one-time): git lfs install # 2. Pull the checkpoint (if you already cloned without LFS): git lfs pull # 3. Verify the file exists and is the correct size: ls -lh outputs/checkpoints/best_model.pt # expected: 55MB
If Git LFS is not available, you can also download the checkpoint directly from the GitHub release assets at github.com/Moaz2010/arabic-asr-project/releases and place the file at outputs/checkpoints/best_model.pt manually.
Option 2 β Set up from GitHub (full fresh install)
# ββ Prerequisites: Python 3.9+, Node 18+, git ββ # 1. Clone the repository git clone https://github.com/Moaz2010/arabic-asr-project.git cd arabic-asr-project # 2. Create and activate a virtual environment python -m venv venv venv\Scripts\activate # Windows # source venv/bin/activate # macOS / Linux # 3. Install Python dependencies pip install -r requirements.txt pip install -r backend/requirements.txt # 4. Start the backend (auto-finds a free port 8000β8019) python backend/main.py # β Uvicorn running on http://0.0.0.0:8001 (or whichever port is free) # β Copy the NEXT_PUBLIC_API_URL value it prints # 5. Set the frontend env variable echo NEXT_PUBLIC_API_URL=http://localhost:8001 > frontend\.env.local # 6. Install frontend deps and start Next.js cd frontend npm install npm run dev # β http://localhost:3000 # 7. Open the Transcribe page, select CNN+LSTM, upload a WAV file
Connecting the config
The backend loads the checkpoint path from configs/config.yaml. If your checkpoint is saved elsewhere, update this file before starting the backend:
# configs/config.yaml
models:
cnn_lstm:
checkpoint: "outputs/checkpoints/best_model.pt"
vocab_size: 50
n_mfcc: 128
hidden_size: 256
num_layers: 2Troubleshooting: "name 'T' is not defined"
This error means the backend is running a stale cached version of audio_utils.py that still uses the old torchaudio.transforms as T alias. The fix has been applied in the current codebase β you just need to make sure Python reloads from disk.
| Cause | Fix |
|---|---|
| Stale __pycache__ .pyc files | Delete all __pycache__ folders and restart: rmdir /s /q src\__pycache__ (Windows) or find . -name __pycache__ -exec rm -rf {} + (Linux/Mac) |
| Old backend process still running | Kill the process occupying the port (check Task Manager or netstat -ano | findstr 8000), then restart backend/main.py |
| Running the wrong Python version | Make sure you run python3.13 (or your venv's Python), not a system python that has an old .pyc cached |
# Quick fix β clear all pycache and restart:
# Windows PowerShell:
Get-ChildItem -Recurse -Filter __pycache__ | Remove-Item -Recurse -Force
# macOS / Linux:
find . -type d -name __pycache__ -exec rm -rf {} +
# Then restart:
python backend/main.pyAfter applying the fix, CNN+LSTM imports clean and inference completes in ~15β20 seconds on CPU. The output for silent or very short clips will be an empty string β this is expected. Upload a real Arabic speech clip for meaningful output.
Option 3 β Train your own checkpoint
See the Training Notebook section of this guide (Training Notebook tab in the sidebar). The short path:
# 1. Open the training notebook in Colab or VS Code: jupyter lab notebooks/02_cnn_lstm_training.ipynb # 2. Add your Mozilla Data Collective API key in Colab Secrets: # Name: MDC_API_KEY β your key from datacollective.mozillafoundation.org # 3. Run all cells (β35 min per epoch on a Colab T4 GPU) # The notebook saves checkpoints/best_model.pt on every val-WER improvement # 4. Copy the trained checkpoint to the expected path: cp checkpoints/best_model.pt outputs/checkpoints/best_model.pt # 5. Restart the backend β it will auto-load the new checkpoint python backend/main.py
Datasets
Training data sources, sizes, and characteristics
Arabic ASR is data-starved compared to English. All available high-quality labelled Arabic speech resources total a few hundred hours, versus millions of hours for English. The following datasets are used or referenced in this project:
| Dataset | Size | Domain / Dialect | Usage in this project |
|---|---|---|---|
| Mozilla Common Voice (Arabic) | ~25h validated | MSA + some dialects | Primary training set for CNN+LSTM; Wav2Vec2 fine-tune |
| Arabic Speech Corpus (Nawar Halabi) | 1.8h studio | MSA broadcast | Evaluation + pronunciation reference |
| MASC (hirundo-io/MASC) | Multi-domain | MSA | Keyword spotting evaluation |
| EJUST Dataset (restricted) | University internal | Egyptian Arabic | Dialectal robustness testing |
| IEMOCAP (English reference) | 12h scripted | N/A | Emotion detection training baseline |
Mozilla Common Voice Arabic is gated behind the Mozilla Data Collective. To access it: create a free account at datacollective.mozillafoundation.org, request access to dataset cmn2g7uu701fqo1072r5na25l, then generate an API key from your account dashboard. In Google Colab open the Secrets panel (lock icon in the left sidebar), add a secret named MDC_API_KEY, and paste your key. The notebook reads it with userdata.get("MDC_API_KEY") and uses it as a Bearer token automatically.
You do not need to download any dataset to use the inference pipeline. Models are fine-tuned checkpoints served from HuggingFace and downloaded automatically on first use.
Data preprocessing
Every audio file is resampled to 16 kHz mono before any model sees it. For the CNN+LSTM training pipeline, 128 MFCC coefficients are extracted per frame (hop = 512, window = 2048). Feature maps are normalised to zero-mean unit-variance per sample. During training, SpecAugment is applied: random time masks (T=50 frames) and frequency masks (F=40 bins) are zeroed out to improve noise robustness.
Audio Processing
From raw waveform to neural network input
Neural models cannot consume raw audio bytes directly; they need structured numerical representations. Three representations are used across the different models:
MFCC (CNN+LSTM)
Mel-Frequency Cepstral Coefficients approximate the human auditory system. A Short-Time Fourier Transform (STFT) extracts frequency content over overlapping frames, the frequency axis is warped to the perceptual Mel scale, log energies are computed, and finally a Discrete Cosine Transform produces compact coefficients. The result is a 2D feature map (frequency Γ time) that the convolutional layers treat like an image.
import librosa, numpy as np
y, sr = librosa.load("audio.wav", sr=16000, mono=True)
mfcc = librosa.feature.mfcc(
y=y, sr=sr, n_mfcc=128, n_fft=2048, hop_length=512
) # shape: (128, T)
# Normalise per sample
mfcc = (mfcc - mfcc.mean()) / (mfcc.std() + 1e-8)Log-Mel Spectrogram (Whisper)
Whisper uses 80 Mel filter-bank channels with 25ms frames shifted by 10ms. The spectrogram is computed, log-scaled, normalised to [β1, 1], and chunked into 30-second blocks. The convolutional stem of Whisper's encoder processes these chunks in parallel before Transformer layers attend across frames.
Raw Waveform (Wav2Vec 2.0)
Wav2Vec 2.0 consumes normalised 16kHz waveform values directly. Its first stage, a stack of temporal 1D convolutions, acts as a learned feature extractor that replaces hand-crafted MFCCs. This end-to-end approach captures fine-grained acoustic detail that fixed feature extractors may discard.
Training the CNN+LSTM
Loss function, hyperparameters, and the training loop
The custom model is trained in src/train.py and a complete Colab-ready walkthrough is provided in notebooks/02_cnn_lstm_training.ipynb. The notebook covers data loading, feature extraction, model construction, training loop, and evaluation in sequential cells.
CTC Loss: no forced alignment needed
CTC (Connectionist Temporal Classification) allows training on (audio, transcript) pairs without frame-level phoneme annotations. It marginalises over every possible alignment between the output sequence and the target label by summing probabilities across all valid paths (including repetitions and blank tokens). This makes large-scale ASR training practical.
import torch.nn as nn criterion = nn.CTCLoss(blank=0, zero_infinity=True) # log_probs: (T, N, C) β time Γ batch Γ vocab_size # targets, input_lengths, target_lengths loss = criterion(log_probs, targets, input_lengths, target_lengths)
Configuration
| Parameter | Value / Strategy |
|---|---|
| Optimiser | AdamW, lr = 3e-4, weight_decay = 1e-4 |
| LR Schedule | OneCycleLR: 10% warmup β cosine anneal |
| Batch size | 16 on-disk + gradient accumulation Γ 4 = effective 64 |
| Epochs | 30 max with early stopping (patience = 5, monitor val-WER) |
| Dropout | 0.3 in LSTM, 0.1 in FC head |
| Gradient clipping | max_norm = 5.0 (prevents LSTM exploding gradients) |
| SpecAugment | Time mask T = 50 frames, Frequency mask F = 40 bins, 2 masks each |
| Vocab | Arabic character set + blank + space β 50 tokens |
Training requires a CUDA GPU with at least 8 GB VRAM. On CPU, a single epoch over Common Voice Arabic takes several hours. Use the provided Google Colab notebook for free cloud GPU access.
Evaluation Metrics
How transcription quality is measured objectively
Word Error Rate (WER)
WER is the primary ASR metric. It computes the minimum edit distance between the model's hypothesis and the ground-truth reference at the word level, then normalises by the total number of reference words. Lower is better; perfect transcription = 0%.
WER = (S + D + I) / N S = Substitutions (wrong word predicted) D = Deletions (reference word missing from hypothesis) I = Insertions (extra word in hypothesis) N = Total reference words Example β Reference: Ψ£ΩΨ§ Ψ£ΩΨΩΨ¨Ω Ψ§ΩΨΉΩΩΩΩ ΩΩΨ«ΩΨ±ΩΨ§ Hypothesis: Ψ£ΩΨ§ Ψ£ΨΨ¨ Ψ§ΩΨΉΩΩΩ β 3 errors (S, S, D) WER = 3/4 = 75%
Character Error Rate (CER)
CER applies the same formula character-by-character. For Arabic, CER is often more informative than WER because a single morphological suffix difference counts as one word substitution in WER but only a few character errors in CER, giving credit for partially-correct words.
Benchmark comparison
| Model | WER (Arabic Common Voice) | CER | RT Factor |
|---|---|---|---|
| Whisper API (cloud) | β 8% | β 3% | < 1s (cloud GPU) |
| Whisper-small (local) | β 12% | β 4% | 15-30s on CPU |
| Wav2Vec2 XLSR-53 Arabic | β 18% | β 6% | 20-40s on CPU |
| CNN+LSTM (from scratch) | ~95%* | β 82% | ~10s on CPU |
* CNN+LSTM WER reflects training on ~25h of data for 50 epochs. The high WER is the expected academic result, demonstrating why pre-trained models like Whisper (680,000h training) massively outperform small-data training-from-scratch approaches. This gap IS the educational insight.
Feature Modules
What each page in the app does, technically
Transcribe
/api/transcribe: whisper_api / whisper / wav2vec2 / cnn_lstmAccepts a raw audio file upload. Four models available: Whisper API (cloud, fastest), Whisper Local (best offline accuracy), Wav2Vec2 (fast CTC), and CNN+LSTM (custom experiment). Whisper uses auto-regressive beam search; Wav2Vec2 and CNN+LSTM use greedy CTC decoding. The result can be copied, downloaded, or forwarded to the Summarize or Search modules.
Voice Search
/api/search: FAISS + sentence-transformersTranscripts (and text notes) are embedded using a multilingual sentence-transformer and stored in a FAISS flat-L2 index on disk. Keyword mode runs TF-IDF BM25-style exact matching; Neural mode performs cosine similarity over the dense embeddings. Results are returned ordered by relevance score.
Speaker Diarization
/api/diarize: pyannote/speaker-diarization-3.1The PyAnnote pipeline runs VAD to detect speech segments, extracts ECAPA-TDNN d-vectors per segment, and clusters them with spectral clustering. Output is a speaker-labelled timeline rendered as colour-coded horizontal bands. Set num_speakers explicitly if known for best accuracy.
Emotion Detection
/api/emotion: Wav2Vec2 4-class emotion classifierA Wav2Vec2 model with a pooling + classification head outputs probabilities for happy, angry, neutral, and sad. Prosodic features (pitch, energy, rate) are captured implicitly by the Transformer encoder from raw audio. Short clips (3β10 seconds) give the most reliable results.
Summarize
/api/summarize: TF-IDF extractive / mT5 abstractiveTwo modes: extractive selects the highest-scoring sentences from the input without rewriting (fast, GPU-free); abstractive generates a condensed paraphrase using a seq-to-seq model. The recommended workflow is Transcribe β copy transcript β Summarize for speech-to-note automation.
Voice Chat + Search
MediaRecorder β /api/transcribe + /api/search/addThe browser MediaRecorder API captures voice messages as WebM/Opus blobs, which are sent to /api/transcribe. The resulting transcript is displayed in the chat bubble and simultaneously indexed via /api/search/add. The inline search bar queries the accumulated index and highlights matches in the message history.
Why Arabic ASR is Hard
The linguistic and technical challenges this system addresses
Arabic is a root-and-pattern language. Grammatical information (tense, person, number, gender, definiteness) is encoded as inflectional patterns applied to three- or four-letter roots, producing a vast number of unique surface forms. A single word like ΩΨ³ΩΩΨͺΨ¨ΩΩΩΨ§(wa-sa-yaktubu-na-hΔ, βand they will write itβ) encodes five pieces of information in one token. This dramatically inflates the effective vocabulary, making any statistical model harder to train compared to analytic languages like English.
Diacritics typically omitted
Written Arabic omits short vowel marks (αΈ₯arakΔt) in most everyday text. The word ΩΨͺΨ¨ is ambiguous: kataba (he wrote), kutub (books), kutiba (it was written). Models must infer the correct reading entirely from context, a challenge that compounds with the morphological ambiguity above.
Diglossia and dialectal variation
The Arabic-speaking world has 22 countries with spoken dialects that differ substantially in phonology, lexicon, and grammar. Egyptian, Levantine, Gulf, and Maghrebi Arabic are all linguistically Arabic but differ the way Portuguese differs from Spanish in some dimensions. Most labelled datasets focus on Modern Standard Arabic (MSA), which is the formal written register but rarely spoken in casual conversation, creating a systematic domain mismatch for real-world deployment.
Data scarcity
As of 2025 the best publicly available labelled Arabic speech corpus (Mozilla Common Voice Arabic) contains roughly 25 hours of validated recordings. English ASR models are trained on datasets that are 4 to 5 orders of magnitude larger. This scarcity is the primary reason transfer learning from massively multilingual models (Whisper, XLSR) dramatically outperforms training-from-scratch approaches on limited data, and is the key academic motivation for including both the custom CNN+LSTM and the pre-trained models in this project.
Whisper's strength comes from 680,000 hours of multilingual web audio; Arabic phonology was learned implicitly from a huge variety of sources, giving it robustness that a 25-hour fine-tune alone could never achieve.
Training Notebook
How to run 02_cnn_lstm_training.ipynb end-to-end
The notebook at notebooks/02_cnn_lstm_training.ipynb is a self-contained walkthrough of the custom CNN+LSTM model, from raw dataset to trained checkpoint with evaluation metrics. It is designed to run in Google Colab (free T4 GPU) or any local Jupyter environment with CUDA available.
Environment setup
The first cells install everything needed and mount Google Drive if running in Colab. No manual package management is required beyond running the cells in order.
# If running locally, activate your virtual environment first: source venv/bin/activate # macOS/Linux venv\Scripts\activate # Windows # Then launch Jupyter: jupyter lab notebooks/02_cnn_lstm_training.ipynb # Or open in VS Code β right-click β Open With β Jupyter Notebook
Cell-by-cell structure
| Cell group | What it does |
|---|---|
| 1: Imports & Config | Installs librosa, torch, torchaudio. Defines SAMPLE_RATE=16000, N_MFCC=128, BATCH_SIZE=16. |
| 2: Dataset Loading | Downloads Mozilla Common Voice Arabic (dataset ID cmn2g7uu701fqo1072r5na25l) via the MDC API using MDC_API_KEY from Colab Secrets. Builds train/val/test splits (80/10/10). |
| 3: Feature Extraction | Converts every audio clip to a normalised MFCC tensor. Applies SpecAugment augmentation online. |
| 4: Model Definition | Defines CnnLstmASR, 2Γ Conv2D β reshape β 2Γ BiLSTM β Linear(vocab). Prints parameter count. |
| 5: Training Loop | AdamW + OneCycleLR. CTC loss. Saves best checkpoint to checkpoints/best_model.pt on val-WER improvement. |
| 6: Inference Test | Loads best checkpoint. Runs greedy CTC decoding on 5 test samples. Prints hypothesis vs. reference. |
| 7: Metrics | Computes WER and CER over the full test split. Plots a confusion matrix over the top-20 most common characters. |
| 8: Export | Exports the trained model to ONNX for fast CPU inference via backend/main.py. |
Expected runtime
On a Colab T4 GPU, one full epoch over the Common Voice Arabic validated set (β25h audio) takes roughly 35β45 minutes. The notebook defaults to 5 epochs for a quick demo run. Set MAX_EPOCHS = 30 and enable early stopping for production-quality training. A pre-trained checkpoint is included in checkpoints/ so the notebook can also be run in evaluation-only mode by skipping cells 4β5.
Connecting the checkpoint to the backend
Once training is complete, copy the saved checkpoint path into configs/config.yaml under the cnn_lstm.checkpoint key. The FastAPI backend reads this config at startup and loads the model automatically; no code changes required.
# configs/config.yaml
models:
cnn_lstm:
checkpoint: "checkpoints/best_model.pt"
vocab_size: 50
n_mfcc: 128
hidden_size: 256
num_layers: 2To skip training entirely and just observe the inference pipeline, the backend will fall back to Whisper automatically when the CNN+LSTM checkpoint is missing. Set the model selector to βCNN+LSTMβ only after training completes.
Deliverables
What is included, where to find it, and how to verify each item
Every deliverable is present in the repository. The table below maps each item to its exact location. The demo interface (deliverable 6) can be started in under two minutes; see the Local Setup tab for the full command sequence.
| Deliverable | Location | Notes |
|---|---|---|
| Source code | src/ + backend/ + frontend/src/ | Python ML modules in src/, FastAPI server in backend/, Next.js UI in frontend/ |
| Dataset description | deliverables/02_dataset_description.md | Mozilla Common Voice Arabic + 4 supporting datasets, full statistics |
| System architecture | deliverables/03_system_architecture.md | End-to-end pipeline diagram, component breakdown, API surface |
| Experiments | deliverables/04_experiments.md | CNN+LSTM training runs, hyperparameter sweeps, loss curves |
| Evaluation results | deliverables/05_evaluation_results.md | WER/CER per model, benchmark comparison, confusion analysis |
| Demo interface | frontend/ (Next.js 16) | 7-page app: Transcribe, Search, Speakers, Emotion, Summarize, Chat, Guide |
| Bonus: Speaker Diarization | src/speaker_diarization.py | PyAnnote pipeline, ECAPA-TDNN d-vectors, spectral clustering |
| Bonus: Emotion Detection | src/emotion_detection.py | Wav2Vec2 fine-tune, 4 classes, probability distribution output |
| Bonus: Summarization | src/summarizer.py | Extractive TF-IDF + abstractive mT5 modes |
| Bonus: Voice Chat Messenger | frontend/src/app/chat/ | MediaRecorder β transcribe β index β searchable history |
Verifying the demo runs correctly
With the backend running on port 8000 and the frontend on port 3000, the indicator dot in the side-nav will turn green. The same status is reflected as ONLINE in the home screen header. If the dot is red, the most common causes are: backend not started, a missing Python dependency, or a firewall blocking the port.
Running the full pipeline end-to-end
# Terminal 1 β Backend
cd arabic-asr-project
source venv/bin/activate
python backend/main.py
# β Uvicorn running on http://0.0.0.0:8000
# Terminal 2 β Frontend
cd arabic-asr-project/frontend
npm run dev
# β Next.js ready on http://localhost:3000
# Verify API is healthy:
curl http://localhost:8000/api/health
# β {"status":"ok","models_loaded":["whisper","wav2vec2","cnn_lstm"]}Task coverage at a glance
| Required Task | Method / Model | Evaluation |
|---|---|---|
| Speech-to-Text (CNN+LSTM from scratch) | 2D CNN β BiLSTM β CTC | WER + CER, notebook cell 7 |
| Speech-to-Text (Whisper API cloud) | OpenAI cloud, latest large model | WER β 8%, instant on cloud GPU |
| Speech-to-Text (pre-trained Whisper) | small (244M, enc-dec Transformer) | WER β 12%, benchmark table |
| Speech-to-Text (pre-trained Wav2Vec2) | XLSR-53 fine-tune, CTC | WER β 18%, benchmark table |
| WER metric | src/evaluate.py | Formula + code, evaluation section |
| Voice notes search engine | FAISS + sentence-transformers | Semantic + keyword modes, search module |
| Speaker Identification (advanced) | PyAnnote 3.1, ECAPA-TDNN | Diarization Error Rate (DER) |
| Emotion Detection (advanced) | Wav2Vec2 4-class classifier | Per-class accuracy + confusion matrix |
| Summarization (advanced) | TF-IDF extractive + mT5 | ROUGE-L score |
| Demo Interface | Next.js 16 + FastAPI | Live at localhost:3000 |
The recommended verification order is: start backend β confirm green dot β upload a short Arabic WAV file to Transcribe β copy the result to Summarize β archive it β run a Search query. This single flow exercises the three compulsory tasks (ASR, transcript, search) and two optional ones (summarize + index) in under 90 seconds.