Dashboard
System Documentation

User Guide

Project Overview

What this system is and why it was built

The Arabic Audio Intelligence System is a deep-learning-powered platform that converts spoken Arabic audio into searchable, analysable, and summarised text. It was built as an academic deliverable addressing the real-world challenge of processing Arabic speech, a language with complex morphology, optional diacritics, and dozens of spoken dialects, entirely with open-source neural networks.

Rather than a single-purpose transcription tool, the system implements a fully integrated pipeline: raw audio goes in, and structured knowledge (transcripts, speaker timelines, emotional states, summaries, indexed notes) comes out. Every component is independently accessible through a module-driven interface. The system supports both fully local inference (no internet required after initial model download) and a cloud-accelerated mode via the OpenAI Whisper API for instant transcription.

The project satisfies every item in the course specification, including all optional advanced tasks (speaker diarization, emotion detection, text summarization, and a voice chat messenger with search capabilities). Four ASR models are available: OpenAI Whisper API (cloud), Whisper Local (HuggingFace), Wav2Vec2 (HuggingFace), and a custom CNN+LSTM trained from scratch.

RequirementFile / ModuleStatus
Speech-to-Text CNN+LSTM (from scratch)src/models/cnn_lstm.py + notebooks/02_cnn_lstm_training.ipynbβœ“
Pre-trained ASR: Whisper (local)src/models/whisper_asr.pyβœ“
Pre-trained ASR: Whisper API (cloud)src/models/whisper_api.pyβœ“
Pre-trained ASR: Wav2Vec 2.0src/models/wav2vec2_asr.pyβœ“
WER / CER Evaluationsrc/evaluate.pyβœ“
Voice Notes Search Enginesrc/search_engine.pyβœ“
Speaker Diarization (advanced)src/speaker_diarization.pyβœ“
Emotion Detection (advanced)src/emotion_detection.pyβœ“
Text Summarization (advanced)src/summarizer.pyβœ“
Voice Chat Messenger (advanced)frontend/src/app/chat/βœ“
Live Demo InterfaceNext.js 16 + FastAPIβœ“

System Architecture

How data flows from audio input to structured output

The architecture is split into two independent layers: a Python FastAPI backend at port 8000 that handles all ML computation, and a Next.js 16 frontend at port 3000 that renders the interface. The two communicate via a stateless REST API over HTTP, meaning the backend can be hosted on a GPU server anywhere while the frontend can be served from a CDN.

pipeline
Audio Input  (.wav / .mp3 / .flac / .ogg)
       β”‚
       β–Ό
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚           Audio Preprocessing           β”‚
 β”‚   Resample to 16kHz mono                β”‚
 β”‚   Segment if > 30s (Whisper chunking)   β”‚
 β”‚   MFCC / Log-Mel extraction             β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β–Ό                     β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ CNN + LSTM β”‚     β”‚  Whisper / Wav2Vec2  β”‚
   β”‚ (custom)   β”‚     β”‚  (HuggingFace)       β”‚
   β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β–Ό
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚   Text Transcript    β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β–Ό          β–Ό              β–Ό
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚Summarizerβ”‚ β”‚Emotion β”‚ β”‚Speaker Diarize β”‚
 β”‚mT5/TF-IDFβ”‚ β”‚Wav2Vec2β”‚ β”‚PyAnnote/ECAPA  β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚          β”‚              β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
                  β–Ό
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚    Search Engine    β”‚
        β”‚  FAISS + sentence-  β”‚
        β”‚  transformers       β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Backend API surface

EndpointMethodInputReturns
/api/healthGETβ€”{ status: 'ok' }
/api/transcribePOSTfile (audio), model{ transcript, language, duration }
/api/diarizePOSTfile (audio), num_speakers{ segments: [{speaker, start, end}] }
/api/emotionPOSTfile (audio){ emotion, scores: {happy,angry,…} }
/api/summarizePOSTtext, max_length, method{ summary, method_used }
/api/searchPOSTquery, search_type{ results: [{text, score, id}] }
/api/search/addPOSTtranscript, file?{ id, message }
/api/search/statsGETβ€”{ total_notes, index_size }

Frontend routing

Each feature is a Next.js App Router page under src/app/. All API calls are centralised in src/lib/api.ts. The dynamic side navbar uses Framer Motion layout animations to transition between a horizontal bottom dock (home page) and a vertical sidebar (all other pages), animated as a fluid spring, not a CSS snap.

Model Zoo

Every neural network used: architecture, rationale, performance

OpenAI Whisper API (Cloud)

The fastest option: sends audio to OpenAI's servers for transcription using the latest Whisper large model. No GPU or local model download needed. Requires an OPENAI_API_KEY in the .env file. The API handles any audio format and returns Arabic text directly. Ideal for quick demos or machines without GPUs.

Cloud APINo local GPU neededWER β‰ˆ 8% (large model)Requires OpenAI API key

OpenAI Whisper (Local)

Whisper is an encoder-decoder Transformer trained on 680,000 hours of multilingual audio. The audio encoder is a convolutional stem followed by a stack of Transformer blocks that produces a context-rich embedding of the log-mel spectrogram. The decoder auto-regressively generates BPE tokens using cross-attention over those encoder outputs. It natively supports Arabic and handles diacritics, code-switching, and noise robustly. We use whisper-small (244M parameters) locally; the model auto-downloads from HuggingFace on first use.

Transformer enc-dec244M params (small)WER β‰ˆ 12% standard ArabicAuto-download from HuggingFace, no API key

Wav2Vec 2.0: Arabic fine-tune

Wav2Vec 2.0 (Meta AI) learns speech representations directly from raw waveforms. Its convolutional feature extractor downsamples the audio, and a 24-layer Transformer encoder is pre-trained with a masked contrastive loss (like BERT for speech). A linear CTC head is then fine-tuned on labelled data. We load facebook/wav2vec2-large-xlsr-53-arabic, the XLSR-53 model fine-tuned specifically on the Arabic Common Voice split. Because CTC decoding is non-autoregressive it is roughly 2Γ— faster than Whisper at inference.

Self-supervised (XLSR)CTC decoding, no beam search~2Γ— faster than WhisperBest for short utterances

CNN + LSTM: Custom Academic Model

Built from scratch to satisfy the course requirement for a deep-learning model designed and trained by students. It operates on MFCC features extracted from audio. Two 2D convolutional layers with batch normalisation extract spectro-temporal patterns from the feature map. The output is reshaped and fed into two bidirectional LSTM layers that model temporal context across frames. A fully-connected layer followed by a softmax over the character vocabulary produces per-frame posteriors; CTC loss handles alignment-free training.

architecture
Input waveform  β†’  librosa MFCC  β†’  (128, T) feature map
     ↓
Conv2D(32, 3Γ—3, padding=1) β†’ BN β†’ ReLU
Conv2D(64, 3Γ—3, padding=1) β†’ BN β†’ ReLU β†’ MaxPool2D(2,2)
     ↓
Reshape  β†’  (T//2,  64 Γ— 64)   [batch Γ— time Γ— features]
     ↓
BiLSTM(256, num_layers=2, dropout=0.3)
     ↓
Linear(vocab_size)   β†’   log_softmax
     ↓
CTC Loss  ←→  reference transcript
NOTE

The CNN+LSTM model is the academic experiment. Expect WER of 35–45% depending on training epochs and dataset quality. Whisper or Wav2Vec2 should be selected for production-quality transcriptions.

PyAnnote Audio: Speaker Diarization

PyAnnote is a speaker diarization toolkit built on PyTorch. The pyannote/speaker-diarization-3.1 pipeline chains voice-activity detection (a segmentation model), speaker embedding extraction (ECAPA-TDNN d-vectors), and spectral clustering into a single end-to-end pipeline. It outputs a speaker-labeled timeline with RTTM-compatible start/end timestamps.

Wav2Vec 2.0: Emotion Classification

A Wav2Vec 2.0 base checkpoint fine-tuned for speech emotion recognition. A mean-pooling layer followed by a 4-class linear head is trained on prosodic emotional datasets. The four classes are happy, angry, neutral, sad. The model outputs a probability distribution over all four, making it interpretable beyond a single predicted label.

mT5 / TF-IDF: Summarization

Two modes are offered. Extractive summarization uses TF-IDF sentence scoring: sentences are ranked by weighted term frequency and the top‑N are returned verbatim; fast, no GPU required. Abstractive summarization runs a multilingual sequence-to-sequence model (Helsinki-NLP or an Arabic T5 variant) to generate a free-form condensed paraphrase of the input text.

CNN+LSTM Model Setup

How to get the checkpoint running β€” from GitHub to the backend

The CNN+LSTM model is a custom-trained checkpoint that must be present at outputs/checkpoints/best_model.pt before selecting it in the Transcribe page. This section covers three paths: using the pre-trained checkpoint from the repo, training your own, or fixing the backend if you see name 'T' is not defined.

Option 1 β€” Use the checkpoint already in the repo (recommended)

The checkpoint is committed to the repository via Git LFS. After cloning, it will be at outputs/checkpoints/best_model.pt (~55 MB). If LFS was not installed when you cloned, run the commands below to pull it down:

bash
# 1. Install Git LFS (one-time):
git lfs install

# 2. Pull the checkpoint (if you already cloned without LFS):
git lfs pull

# 3. Verify the file exists and is the correct size:
ls -lh outputs/checkpoints/best_model.pt
# expected: 55MB
TIP

If Git LFS is not available, you can also download the checkpoint directly from the GitHub release assets at github.com/Moaz2010/arabic-asr-project/releases and place the file at outputs/checkpoints/best_model.pt manually.

Option 2 β€” Set up from GitHub (full fresh install)

bash
# ── Prerequisites: Python 3.9+, Node 18+, git ──

# 1. Clone the repository
git clone https://github.com/Moaz2010/arabic-asr-project.git
cd arabic-asr-project

# 2. Create and activate a virtual environment
python -m venv venv
venv\Scripts\activate         # Windows
# source venv/bin/activate      # macOS / Linux

# 3. Install Python dependencies
pip install -r requirements.txt
pip install -r backend/requirements.txt

# 4. Start the backend  (auto-finds a free port 8000–8019)
python backend/main.py
# β†’ Uvicorn running on http://0.0.0.0:8001  (or whichever port is free)
# β†’ Copy the NEXT_PUBLIC_API_URL value it prints

# 5. Set the frontend env variable
echo NEXT_PUBLIC_API_URL=http://localhost:8001 > frontend\.env.local

# 6. Install frontend deps and start Next.js
cd frontend
npm install
npm run dev
# β†’ http://localhost:3000

# 7. Open the Transcribe page, select CNN+LSTM, upload a WAV file

Connecting the config

The backend loads the checkpoint path from configs/config.yaml. If your checkpoint is saved elsewhere, update this file before starting the backend:

yaml
# configs/config.yaml
models:
  cnn_lstm:
    checkpoint: "outputs/checkpoints/best_model.pt"
    vocab_size: 50
    n_mfcc: 128
    hidden_size: 256
    num_layers: 2

Troubleshooting: "name 'T' is not defined"

This error means the backend is running a stale cached version of audio_utils.py that still uses the old torchaudio.transforms as T alias. The fix has been applied in the current codebase β€” you just need to make sure Python reloads from disk.

CauseFix
Stale __pycache__ .pyc filesDelete all __pycache__ folders and restart: rmdir /s /q src\__pycache__ (Windows) or find . -name __pycache__ -exec rm -rf {} + (Linux/Mac)
Old backend process still runningKill the process occupying the port (check Task Manager or netstat -ano | findstr 8000), then restart backend/main.py
Running the wrong Python versionMake sure you run python3.13 (or your venv's Python), not a system python that has an old .pyc cached
bash
# Quick fix β€” clear all pycache and restart:

# Windows PowerShell:
Get-ChildItem -Recurse -Filter __pycache__ | Remove-Item -Recurse -Force

# macOS / Linux:
find . -type d -name __pycache__ -exec rm -rf {} +

# Then restart:
python backend/main.py
NOTE

After applying the fix, CNN+LSTM imports clean and inference completes in ~15–20 seconds on CPU. The output for silent or very short clips will be an empty string β€” this is expected. Upload a real Arabic speech clip for meaningful output.

Option 3 β€” Train your own checkpoint

See the Training Notebook section of this guide (Training Notebook tab in the sidebar). The short path:

bash
# 1. Open the training notebook in Colab or VS Code:
jupyter lab notebooks/02_cnn_lstm_training.ipynb

# 2. Add your Mozilla Data Collective API key in Colab Secrets:
#    Name: MDC_API_KEY  β†’  your key from datacollective.mozillafoundation.org

# 3. Run all cells (β‰ˆ35 min per epoch on a Colab T4 GPU)
#    The notebook saves checkpoints/best_model.pt on every val-WER improvement

# 4. Copy the trained checkpoint to the expected path:
cp checkpoints/best_model.pt outputs/checkpoints/best_model.pt

# 5. Restart the backend β€” it will auto-load the new checkpoint
python backend/main.py

Datasets

Training data sources, sizes, and characteristics

Arabic ASR is data-starved compared to English. All available high-quality labelled Arabic speech resources total a few hundred hours, versus millions of hours for English. The following datasets are used or referenced in this project:

DatasetSizeDomain / DialectUsage in this project
Mozilla Common Voice (Arabic)~25h validatedMSA + some dialectsPrimary training set for CNN+LSTM; Wav2Vec2 fine-tune
Arabic Speech Corpus (Nawar Halabi)1.8h studioMSA broadcastEvaluation + pronunciation reference
MASC (hirundo-io/MASC)Multi-domainMSAKeyword spotting evaluation
EJUST Dataset (restricted)University internalEgyptian ArabicDialectal robustness testing
IEMOCAP (English reference)12h scriptedN/AEmotion detection training baseline
IMPORTANT

Mozilla Common Voice Arabic is gated behind the Mozilla Data Collective. To access it: create a free account at datacollective.mozillafoundation.org, request access to dataset cmn2g7uu701fqo1072r5na25l, then generate an API key from your account dashboard. In Google Colab open the Secrets panel (lock icon in the left sidebar), add a secret named MDC_API_KEY, and paste your key. The notebook reads it with userdata.get("MDC_API_KEY") and uses it as a Bearer token automatically.

TIP

You do not need to download any dataset to use the inference pipeline. Models are fine-tuned checkpoints served from HuggingFace and downloaded automatically on first use.

Data preprocessing

Every audio file is resampled to 16 kHz mono before any model sees it. For the CNN+LSTM training pipeline, 128 MFCC coefficients are extracted per frame (hop = 512, window = 2048). Feature maps are normalised to zero-mean unit-variance per sample. During training, SpecAugment is applied: random time masks (T=50 frames) and frequency masks (F=40 bins) are zeroed out to improve noise robustness.

Audio Processing

From raw waveform to neural network input

Neural models cannot consume raw audio bytes directly; they need structured numerical representations. Three representations are used across the different models:

MFCC (CNN+LSTM)

Mel-Frequency Cepstral Coefficients approximate the human auditory system. A Short-Time Fourier Transform (STFT) extracts frequency content over overlapping frames, the frequency axis is warped to the perceptual Mel scale, log energies are computed, and finally a Discrete Cosine Transform produces compact coefficients. The result is a 2D feature map (frequency Γ— time) that the convolutional layers treat like an image.

python
import librosa, numpy as np

y, sr = librosa.load("audio.wav", sr=16000, mono=True)

mfcc = librosa.feature.mfcc(
    y=y, sr=sr, n_mfcc=128, n_fft=2048, hop_length=512
)  # shape: (128, T)

# Normalise per sample
mfcc = (mfcc - mfcc.mean()) / (mfcc.std() + 1e-8)

Log-Mel Spectrogram (Whisper)

Whisper uses 80 Mel filter-bank channels with 25ms frames shifted by 10ms. The spectrogram is computed, log-scaled, normalised to [–1, 1], and chunked into 30-second blocks. The convolutional stem of Whisper's encoder processes these chunks in parallel before Transformer layers attend across frames.

Raw Waveform (Wav2Vec 2.0)

Wav2Vec 2.0 consumes normalised 16kHz waveform values directly. Its first stage, a stack of temporal 1D convolutions, acts as a learned feature extractor that replaces hand-crafted MFCCs. This end-to-end approach captures fine-grained acoustic detail that fixed feature extractors may discard.

Training the CNN+LSTM

Loss function, hyperparameters, and the training loop

The custom model is trained in src/train.py and a complete Colab-ready walkthrough is provided in notebooks/02_cnn_lstm_training.ipynb. The notebook covers data loading, feature extraction, model construction, training loop, and evaluation in sequential cells.

CTC Loss: no forced alignment needed

CTC (Connectionist Temporal Classification) allows training on (audio, transcript) pairs without frame-level phoneme annotations. It marginalises over every possible alignment between the output sequence and the target label by summing probabilities across all valid paths (including repetitions and blank tokens). This makes large-scale ASR training practical.

python
import torch.nn as nn

criterion = nn.CTCLoss(blank=0, zero_infinity=True)

# log_probs: (T, N, C)  β€” time Γ— batch Γ— vocab_size
# targets, input_lengths, target_lengths
loss = criterion(log_probs, targets, input_lengths, target_lengths)

Configuration

ParameterValue / Strategy
OptimiserAdamW, lr = 3e-4, weight_decay = 1e-4
LR ScheduleOneCycleLR: 10% warmup β†’ cosine anneal
Batch size16 on-disk + gradient accumulation Γ— 4 = effective 64
Epochs30 max with early stopping (patience = 5, monitor val-WER)
Dropout0.3 in LSTM, 0.1 in FC head
Gradient clippingmax_norm = 5.0 (prevents LSTM exploding gradients)
SpecAugmentTime mask T = 50 frames, Frequency mask F = 40 bins, 2 masks each
VocabArabic character set + blank + space β‰ˆ 50 tokens
IMPORTANT

Training requires a CUDA GPU with at least 8 GB VRAM. On CPU, a single epoch over Common Voice Arabic takes several hours. Use the provided Google Colab notebook for free cloud GPU access.

Evaluation Metrics

How transcription quality is measured objectively

Word Error Rate (WER)

WER is the primary ASR metric. It computes the minimum edit distance between the model's hypothesis and the ground-truth reference at the word level, then normalises by the total number of reference words. Lower is better; perfect transcription = 0%.

formula
WER = (S + D + I) / N

  S = Substitutions  (wrong word predicted)
  D = Deletions      (reference word missing from hypothesis)
  I = Insertions     (extra word in hypothesis)
  N = Total reference words

Example β€”
  Reference:   Ψ£Ω†Ψ§   أُحِبّ   العِلْم   ΩƒΩŽΨ«ΩŠΨ±Ω‹Ψ§
  Hypothesis:  Ψ£Ω†Ψ§   Ψ£Ψ­Ψ¨     Ψ§Ω„ΨΉΩ„ΩˆΩ…              ← 3 errors (S, S, D)
  WER = 3/4 = 75%

Character Error Rate (CER)

CER applies the same formula character-by-character. For Arabic, CER is often more informative than WER because a single morphological suffix difference counts as one word substitution in WER but only a few character errors in CER, giving credit for partially-correct words.

Benchmark comparison

ModelWER (Arabic Common Voice)CERRT Factor
Whisper API (cloud)β‰ˆ 8%β‰ˆ 3%< 1s (cloud GPU)
Whisper-small (local)β‰ˆ 12%β‰ˆ 4%15-30s on CPU
Wav2Vec2 XLSR-53 Arabicβ‰ˆ 18%β‰ˆ 6%20-40s on CPU
CNN+LSTM (from scratch)~95%*β‰ˆ 82%~10s on CPU
NOTE

* CNN+LSTM WER reflects training on ~25h of data for 50 epochs. The high WER is the expected academic result, demonstrating why pre-trained models like Whisper (680,000h training) massively outperform small-data training-from-scratch approaches. This gap IS the educational insight.

Feature Modules

What each page in the app does, technically

Transcribe

/api/transcribe: whisper_api / whisper / wav2vec2 / cnn_lstm

Accepts a raw audio file upload. Four models available: Whisper API (cloud, fastest), Whisper Local (best offline accuracy), Wav2Vec2 (fast CTC), and CNN+LSTM (custom experiment). Whisper uses auto-regressive beam search; Wav2Vec2 and CNN+LSTM use greedy CTC decoding. The result can be copied, downloaded, or forwarded to the Summarize or Search modules.

Voice Search

/api/search: FAISS + sentence-transformers

Transcripts (and text notes) are embedded using a multilingual sentence-transformer and stored in a FAISS flat-L2 index on disk. Keyword mode runs TF-IDF BM25-style exact matching; Neural mode performs cosine similarity over the dense embeddings. Results are returned ordered by relevance score.

Speaker Diarization

/api/diarize: pyannote/speaker-diarization-3.1

The PyAnnote pipeline runs VAD to detect speech segments, extracts ECAPA-TDNN d-vectors per segment, and clusters them with spectral clustering. Output is a speaker-labelled timeline rendered as colour-coded horizontal bands. Set num_speakers explicitly if known for best accuracy.

Emotion Detection

/api/emotion: Wav2Vec2 4-class emotion classifier

A Wav2Vec2 model with a pooling + classification head outputs probabilities for happy, angry, neutral, and sad. Prosodic features (pitch, energy, rate) are captured implicitly by the Transformer encoder from raw audio. Short clips (3–10 seconds) give the most reliable results.

Summarize

/api/summarize: TF-IDF extractive / mT5 abstractive

Two modes: extractive selects the highest-scoring sentences from the input without rewriting (fast, GPU-free); abstractive generates a condensed paraphrase using a seq-to-seq model. The recommended workflow is Transcribe β†’ copy transcript β†’ Summarize for speech-to-note automation.

Voice Chat + Search

MediaRecorder β†’ /api/transcribe + /api/search/add

The browser MediaRecorder API captures voice messages as WebM/Opus blobs, which are sent to /api/transcribe. The resulting transcript is displayed in the chat bubble and simultaneously indexed via /api/search/add. The inline search bar queries the accumulated index and highlights matches in the message history.

Why Arabic ASR is Hard

The linguistic and technical challenges this system addresses

Arabic is a root-and-pattern language. Grammatical information (tense, person, number, gender, definiteness) is encoded as inflectional patterns applied to three- or four-letter roots, producing a vast number of unique surface forms. A single word like ΩˆΨ³ΩŠΩƒΨͺΨ¨ΩˆΩ†Ω‡Ψ§(wa-sa-yaktubu-na-hā, β€œand they will write it”) encodes five pieces of information in one token. This dramatically inflates the effective vocabulary, making any statistical model harder to train compared to analytic languages like English.

Diacritics typically omitted

Written Arabic omits short vowel marks (αΈ₯arakāt) in most everyday text. The word ΩƒΨͺΨ¨ is ambiguous: kataba (he wrote), kutub (books), kutiba (it was written). Models must infer the correct reading entirely from context, a challenge that compounds with the morphological ambiguity above.

Diglossia and dialectal variation

The Arabic-speaking world has 22 countries with spoken dialects that differ substantially in phonology, lexicon, and grammar. Egyptian, Levantine, Gulf, and Maghrebi Arabic are all linguistically Arabic but differ the way Portuguese differs from Spanish in some dimensions. Most labelled datasets focus on Modern Standard Arabic (MSA), which is the formal written register but rarely spoken in casual conversation, creating a systematic domain mismatch for real-world deployment.

Data scarcity

As of 2025 the best publicly available labelled Arabic speech corpus (Mozilla Common Voice Arabic) contains roughly 25 hours of validated recordings. English ASR models are trained on datasets that are 4 to 5 orders of magnitude larger. This scarcity is the primary reason transfer learning from massively multilingual models (Whisper, XLSR) dramatically outperforms training-from-scratch approaches on limited data, and is the key academic motivation for including both the custom CNN+LSTM and the pre-trained models in this project.

TIP

Whisper's strength comes from 680,000 hours of multilingual web audio; Arabic phonology was learned implicitly from a huge variety of sources, giving it robustness that a 25-hour fine-tune alone could never achieve.

Training Notebook

How to run 02_cnn_lstm_training.ipynb end-to-end

The notebook at notebooks/02_cnn_lstm_training.ipynb is a self-contained walkthrough of the custom CNN+LSTM model, from raw dataset to trained checkpoint with evaluation metrics. It is designed to run in Google Colab (free T4 GPU) or any local Jupyter environment with CUDA available.

Environment setup

The first cells install everything needed and mount Google Drive if running in Colab. No manual package management is required beyond running the cells in order.

bash
# If running locally, activate your virtual environment first:
source venv/bin/activate        # macOS/Linux
venv\Scripts\activate           # Windows

# Then launch Jupyter:
jupyter lab notebooks/02_cnn_lstm_training.ipynb

# Or open in VS Code β†’ right-click β†’ Open With β†’ Jupyter Notebook

Cell-by-cell structure

Cell groupWhat it does
1: Imports & ConfigInstalls librosa, torch, torchaudio. Defines SAMPLE_RATE=16000, N_MFCC=128, BATCH_SIZE=16.
2: Dataset LoadingDownloads Mozilla Common Voice Arabic (dataset ID cmn2g7uu701fqo1072r5na25l) via the MDC API using MDC_API_KEY from Colab Secrets. Builds train/val/test splits (80/10/10).
3: Feature ExtractionConverts every audio clip to a normalised MFCC tensor. Applies SpecAugment augmentation online.
4: Model DefinitionDefines CnnLstmASR, 2Γ— Conv2D β†’ reshape β†’ 2Γ— BiLSTM β†’ Linear(vocab). Prints parameter count.
5: Training LoopAdamW + OneCycleLR. CTC loss. Saves best checkpoint to checkpoints/best_model.pt on val-WER improvement.
6: Inference TestLoads best checkpoint. Runs greedy CTC decoding on 5 test samples. Prints hypothesis vs. reference.
7: MetricsComputes WER and CER over the full test split. Plots a confusion matrix over the top-20 most common characters.
8: ExportExports the trained model to ONNX for fast CPU inference via backend/main.py.

Expected runtime

On a Colab T4 GPU, one full epoch over the Common Voice Arabic validated set (β‰ˆ25h audio) takes roughly 35–45 minutes. The notebook defaults to 5 epochs for a quick demo run. Set MAX_EPOCHS = 30 and enable early stopping for production-quality training. A pre-trained checkpoint is included in checkpoints/ so the notebook can also be run in evaluation-only mode by skipping cells 4–5.

Connecting the checkpoint to the backend

Once training is complete, copy the saved checkpoint path into configs/config.yaml under the cnn_lstm.checkpoint key. The FastAPI backend reads this config at startup and loads the model automatically; no code changes required.

yaml
# configs/config.yaml
models:
  cnn_lstm:
    checkpoint: "checkpoints/best_model.pt"
    vocab_size: 50
    n_mfcc: 128
    hidden_size: 256
    num_layers: 2
TIP

To skip training entirely and just observe the inference pipeline, the backend will fall back to Whisper automatically when the CNN+LSTM checkpoint is missing. Set the model selector to β€œCNN+LSTM” only after training completes.

Deliverables

What is included, where to find it, and how to verify each item

Every deliverable is present in the repository. The table below maps each item to its exact location. The demo interface (deliverable 6) can be started in under two minutes; see the Local Setup tab for the full command sequence.

DeliverableLocationNotes
Source codesrc/ + backend/ + frontend/src/Python ML modules in src/, FastAPI server in backend/, Next.js UI in frontend/
Dataset descriptiondeliverables/02_dataset_description.mdMozilla Common Voice Arabic + 4 supporting datasets, full statistics
System architecturedeliverables/03_system_architecture.mdEnd-to-end pipeline diagram, component breakdown, API surface
Experimentsdeliverables/04_experiments.mdCNN+LSTM training runs, hyperparameter sweeps, loss curves
Evaluation resultsdeliverables/05_evaluation_results.mdWER/CER per model, benchmark comparison, confusion analysis
Demo interfacefrontend/ (Next.js 16)7-page app: Transcribe, Search, Speakers, Emotion, Summarize, Chat, Guide
Bonus: Speaker Diarizationsrc/speaker_diarization.pyPyAnnote pipeline, ECAPA-TDNN d-vectors, spectral clustering
Bonus: Emotion Detectionsrc/emotion_detection.pyWav2Vec2 fine-tune, 4 classes, probability distribution output
Bonus: Summarizationsrc/summarizer.pyExtractive TF-IDF + abstractive mT5 modes
Bonus: Voice Chat Messengerfrontend/src/app/chat/MediaRecorder β†’ transcribe β†’ index β†’ searchable history

Verifying the demo runs correctly

With the backend running on port 8000 and the frontend on port 3000, the indicator dot in the side-nav will turn green. The same status is reflected as ONLINE in the home screen header. If the dot is red, the most common causes are: backend not started, a missing Python dependency, or a firewall blocking the port.

Running the full pipeline end-to-end

bash
# Terminal 1 β€” Backend
cd arabic-asr-project
source venv/bin/activate
python backend/main.py
# β†’ Uvicorn running on http://0.0.0.0:8000

# Terminal 2 β€” Frontend
cd arabic-asr-project/frontend
npm run dev
# β†’ Next.js ready on http://localhost:3000

# Verify API is healthy:
curl http://localhost:8000/api/health
# β†’ {"status":"ok","models_loaded":["whisper","wav2vec2","cnn_lstm"]}

Task coverage at a glance

Required TaskMethod / ModelEvaluation
Speech-to-Text (CNN+LSTM from scratch)2D CNN β†’ BiLSTM β†’ CTCWER + CER, notebook cell 7
Speech-to-Text (Whisper API cloud)OpenAI cloud, latest large modelWER β‰ˆ 8%, instant on cloud GPU
Speech-to-Text (pre-trained Whisper)small (244M, enc-dec Transformer)WER β‰ˆ 12%, benchmark table
Speech-to-Text (pre-trained Wav2Vec2)XLSR-53 fine-tune, CTCWER β‰ˆ 18%, benchmark table
WER metricsrc/evaluate.pyFormula + code, evaluation section
Voice notes search engineFAISS + sentence-transformersSemantic + keyword modes, search module
Speaker Identification (advanced)PyAnnote 3.1, ECAPA-TDNNDiarization Error Rate (DER)
Emotion Detection (advanced)Wav2Vec2 4-class classifierPer-class accuracy + confusion matrix
Summarization (advanced)TF-IDF extractive + mT5ROUGE-L score
Demo InterfaceNext.js 16 + FastAPILive at localhost:3000
NOTE

The recommended verification order is: start backend β†’ confirm green dot β†’ upload a short Arabic WAV file to Transcribe β†’ copy the result to Summarize β†’ archive it β†’ run a Search query. This single flow exercises the three compulsory tasks (ASR, transcript, search) and two optional ones (summarize + index) in under 90 seconds.