User Guide
Project Overview
What this system is and why it was built
The Arabic Audio Intelligence System is a deep-learning-powered platform that converts spoken Arabic audio into searchable, analysable, and summarised text. It was built as an academic deliverable addressing the real-world challenge of processing Arabic speech, a language with complex morphology, optional diacritics, and dozens of spoken dialects, entirely with open-source neural networks.
Rather than a single-purpose transcription tool, the system implements a fully integrated pipeline: raw audio goes in, and structured knowledge (transcripts, speaker timelines, emotional states, summaries, indexed notes) comes out. Every component is independently accessible through a module-driven interface. No external paid APIs are used; all inference is self-hosted.
The project satisfies every item in the course specification, including all optional advanced tasks (speaker diarization, emotion detection, text summarization, and a voice chat messenger with search capabilities).
| Requirement | File / Module | Status |
|---|---|---|
| Speech-to-Text CNN+LSTM (from scratch) | src/models/cnn_lstm.py + notebooks/02_cnn_lstm_training.ipynb | ✓ |
| Pre-trained ASR: Whisper | src/models/whisper_asr.py | ✓ |
| Pre-trained ASR: Wav2Vec 2.0 | src/models/wav2vec2_asr.py | ✓ |
| WER / CER Evaluation | src/evaluate.py | ✓ |
| Voice Notes Search Engine | src/search_engine.py | ✓ |
| Speaker Diarization (advanced) | src/speaker_diarization.py | ✓ |
| Emotion Detection (advanced) | src/emotion_detection.py | ✓ |
| Text Summarization (advanced) | src/summarizer.py | ✓ |
| Voice Chat Messenger (advanced) | frontend/src/app/chat/ | ✓ |
| Live Demo Interface | Next.js 16 + FastAPI | ✓ |
System Architecture
How data flows from audio input to structured output
The architecture is split into two independent layers: a Python FastAPI backend at port 8000 that handles all ML computation, and a Next.js 16 frontend at port 3000 that renders the interface. The two communicate via a stateless REST API over HTTP, meaning the backend can be hosted on a GPU server anywhere while the frontend can be served from a CDN.
Audio Input (.wav / .mp3 / .flac / .ogg)
│
▼
┌─────────────────────────────────────────┐
│ Audio Preprocessing │
│ Resample to 16kHz mono │
│ Segment if > 30s (Whisper chunking) │
│ MFCC / Log-Mel extraction │
└───────────────────┬─────────────────────┘
│
┌──────────┴──────────┐
▼ ▼
┌────────────┐ ┌──────────────────────┐
│ CNN + LSTM │ │ Whisper / Wav2Vec2 │
│ (custom) │ │ (HuggingFace) │
└──────┬─────┘ └──────────┬───────────┘
└──────────┬───────────┘
▼
┌──────────────────────┐
│ Text Transcript │
└───────┬──────────────┘
│
┌──────────┼──────────────┐
▼ ▼ ▼
┌──────────┐ ┌────────┐ ┌────────────────┐
│Summarizer│ │Emotion │ │Speaker Diarize │
│mT5/TF-IDF│ │Wav2Vec2│ │PyAnnote/ECAPA │
└──────────┘ └────────┘ └────────────────┘
│ │ │
└──────────┴──────────────┘
│
▼
┌─────────────────────┐
│ Search Engine │
│ FAISS + sentence- │
│ transformers │
└─────────────────────┘Backend API surface
| Endpoint | Method | Input | Returns |
|---|---|---|---|
| /api/health | GET | — | { status: 'ok' } |
| /api/transcribe | POST | file (audio), model | { transcript, language, duration } |
| /api/diarize | POST | file (audio), num_speakers | { segments: [{speaker, start, end}] } |
| /api/emotion | POST | file (audio) | { emotion, scores: {happy,angry,…} } |
| /api/summarize | POST | text, max_length, method | { summary, method_used } |
| /api/search | POST | query, search_type | { results: [{text, score, id}] } |
| /api/search/add | POST | transcript, file? | { id, message } |
| /api/search/stats | GET | — | { total_notes, index_size } |
Frontend routing
Each feature is a Next.js App Router page under src/app/. All API calls are centralised in src/lib/api.ts. The dynamic side navbar uses Framer Motion layout animations to transition between a horizontal bottom dock (home page) and a vertical sidebar (all other pages), animated as a fluid spring, not a CSS snap.
Model Zoo
Every neural network used: architecture, rationale, performance
OpenAI Whisper
Whisper is an encoder–decoder Transformer trained on 680,000 hours of multilingual audio. The audio encoder is a convolutional stem followed by a stack of Transformer blocks that produces a context-rich embedding of the log-mel spectrogram. The decoder auto-regressively generates BPE tokens using cross-attention over those encoder outputs. It natively supports Arabic and handles diacritics, code-switching, and noise robustly. We use whisper-medium (769M parameters) as the default; smaller variants are also available.
Wav2Vec 2.0: Arabic fine-tune
Wav2Vec 2.0 (Meta AI) learns speech representations directly from raw waveforms. Its convolutional feature extractor downsamples the audio, and a 24-layer Transformer encoder is pre-trained with a masked contrastive loss (like BERT for speech). A linear CTC head is then fine-tuned on labelled data. We load facebook/wav2vec2-large-xlsr-53-arabic, the XLSR-53 model fine-tuned specifically on the Arabic Common Voice split. Because CTC decoding is non-autoregressive it is roughly 2× faster than Whisper at inference.
CNN + LSTM: Custom Academic Model
Built from scratch to satisfy the course requirement for a deep-learning model designed and trained by students. It operates on MFCC features extracted from audio. Two 2D convolutional layers with batch normalisation extract spectro-temporal patterns from the feature map. The output is reshaped and fed into two bidirectional LSTM layers that model temporal context across frames. A fully-connected layer followed by a softmax over the character vocabulary produces per-frame posteriors; CTC loss handles alignment-free training.
Input waveform → librosa MFCC → (128, T) feature map
↓
Conv2D(32, 3×3, padding=1) → BN → ReLU
Conv2D(64, 3×3, padding=1) → BN → ReLU → MaxPool2D(2,2)
↓
Reshape → (T//2, 64 × 64) [batch × time × features]
↓
BiLSTM(256, num_layers=2, dropout=0.3)
↓
Linear(vocab_size) → log_softmax
↓
CTC Loss ←→ reference transcriptThe CNN+LSTM model is the academic experiment. Expect WER of 35–45% depending on training epochs and dataset quality. Whisper or Wav2Vec2 should be selected for production-quality transcriptions.
PyAnnote Audio: Speaker Diarization
PyAnnote is a speaker diarization toolkit built on PyTorch. The pyannote/speaker-diarization-3.1 pipeline chains voice-activity detection (a segmentation model), speaker embedding extraction (ECAPA-TDNN d-vectors), and spectral clustering into a single end-to-end pipeline. It outputs a speaker-labeled timeline with RTTM-compatible start/end timestamps.
Wav2Vec 2.0: Emotion Classification
A Wav2Vec 2.0 base checkpoint fine-tuned for speech emotion recognition. A mean-pooling layer followed by a 4-class linear head is trained on prosodic emotional datasets. The four classes are happy, angry, neutral, sad. The model outputs a probability distribution over all four, making it interpretable beyond a single predicted label.
mT5 / TF-IDF: Summarization
Two modes are offered. Extractive summarization uses TF-IDF sentence scoring: sentences are ranked by weighted term frequency and the top‑N are returned verbatim; fast, no GPU required. Abstractive summarization runs a multilingual sequence-to-sequence model (Helsinki-NLP or an Arabic T5 variant) to generate a free-form condensed paraphrase of the input text.
Datasets
Training data sources, sizes, and characteristics
Arabic ASR is data-starved compared to English. All available high-quality labelled Arabic speech resources total a few hundred hours, versus millions of hours for English. The following datasets are used or referenced in this project:
| Dataset | Size | Domain / Dialect | Usage in this project |
|---|---|---|---|
| Mozilla Common Voice (Arabic) | ~25h validated | MSA + some dialects | Primary training set for CNN+LSTM; Wav2Vec2 fine-tune |
| Arabic Speech Corpus (Nawar Halabi) | 1.8h studio | MSA broadcast | Evaluation + pronunciation reference |
| MASC (hirundo-io/MASC) | Multi-domain | MSA | Keyword spotting evaluation |
| EJUST Dataset (restricted) | University internal | Egyptian Arabic | Dialectal robustness testing |
| IEMOCAP (English reference) | 12h scripted | N/A | Emotion detection training baseline |
Mozilla Common Voice Arabic is gated behind the Mozilla Data Collective. To access it: create a free account at datacollective.mozillafoundation.org, request access to dataset cmn2g7uu701fqo1072r5na25l, then generate an API key from your account dashboard. In Google Colab open the Secrets panel (lock icon in the left sidebar), add a secret named MDC_API_KEY, and paste your key. The notebook reads it with userdata.get("MDC_API_KEY") and uses it as a Bearer token automatically.
You do not need to download any dataset to use the inference pipeline. Models are fine-tuned checkpoints served from HuggingFace and downloaded automatically on first use.
Data preprocessing
Every audio file is resampled to 16 kHz mono before any model sees it. For the CNN+LSTM training pipeline, 128 MFCC coefficients are extracted per frame (hop = 512, window = 2048). Feature maps are normalised to zero-mean unit-variance per sample. During training, SpecAugment is applied: random time masks (T=50 frames) and frequency masks (F=40 bins) are zeroed out to improve noise robustness.
Audio Processing
From raw waveform to neural network input
Neural models cannot consume raw audio bytes directly; they need structured numerical representations. Three representations are used across the different models:
MFCC (CNN+LSTM)
Mel-Frequency Cepstral Coefficients approximate the human auditory system. A Short-Time Fourier Transform (STFT) extracts frequency content over overlapping frames, the frequency axis is warped to the perceptual Mel scale, log energies are computed, and finally a Discrete Cosine Transform produces compact coefficients. The result is a 2D feature map (frequency × time) that the convolutional layers treat like an image.
import librosa, numpy as np
y, sr = librosa.load("audio.wav", sr=16000, mono=True)
mfcc = librosa.feature.mfcc(
y=y, sr=sr, n_mfcc=128, n_fft=2048, hop_length=512
) # shape: (128, T)
# Normalise per sample
mfcc = (mfcc - mfcc.mean()) / (mfcc.std() + 1e-8)Log-Mel Spectrogram (Whisper)
Whisper uses 80 Mel filter-bank channels with 25ms frames shifted by 10ms. The spectrogram is computed, log-scaled, normalised to [–1, 1], and chunked into 30-second blocks. The convolutional stem of Whisper's encoder processes these chunks in parallel before Transformer layers attend across frames.
Raw Waveform (Wav2Vec 2.0)
Wav2Vec 2.0 consumes normalised 16kHz waveform values directly. Its first stage, a stack of temporal 1D convolutions, acts as a learned feature extractor that replaces hand-crafted MFCCs. This end-to-end approach captures fine-grained acoustic detail that fixed feature extractors may discard.
Training the CNN+LSTM
Loss function, hyperparameters, and the training loop
The custom model is trained in src/train.py and a complete Colab-ready walkthrough is provided in notebooks/02_cnn_lstm_training.ipynb. The notebook covers data loading, feature extraction, model construction, training loop, and evaluation in sequential cells.
CTC Loss: no forced alignment needed
CTC (Connectionist Temporal Classification) allows training on (audio, transcript) pairs without frame-level phoneme annotations. It marginalises over every possible alignment between the output sequence and the target label by summing probabilities across all valid paths (including repetitions and blank tokens). This makes large-scale ASR training practical.
import torch.nn as nn criterion = nn.CTCLoss(blank=0, zero_infinity=True) # log_probs: (T, N, C) — time × batch × vocab_size # targets, input_lengths, target_lengths loss = criterion(log_probs, targets, input_lengths, target_lengths)
Configuration
| Parameter | Value / Strategy |
|---|---|
| Optimiser | AdamW, lr = 3e-4, weight_decay = 1e-4 |
| LR Schedule | OneCycleLR: 10% warmup → cosine anneal |
| Batch size | 16 on-disk + gradient accumulation × 4 = effective 64 |
| Epochs | 30 max with early stopping (patience = 5, monitor val-WER) |
| Dropout | 0.3 in LSTM, 0.1 in FC head |
| Gradient clipping | max_norm = 5.0 (prevents LSTM exploding gradients) |
| SpecAugment | Time mask T = 50 frames, Frequency mask F = 40 bins, 2 masks each |
| Vocab | Arabic character set + blank + space ≈ 50 tokens |
Training requires a CUDA GPU with at least 8 GB VRAM. On CPU, a single epoch over Common Voice Arabic takes several hours. Use the provided Google Colab notebook for free cloud GPU access.
Evaluation Metrics
How transcription quality is measured objectively
Word Error Rate (WER)
WER is the primary ASR metric. It computes the minimum edit distance between the model's hypothesis and the ground-truth reference at the word level, then normalises by the total number of reference words. Lower is better; perfect transcription = 0%.
WER = (S + D + I) / N S = Substitutions (wrong word predicted) D = Deletions (reference word missing from hypothesis) I = Insertions (extra word in hypothesis) N = Total reference words Example — Reference: أنا أُحِبّ العِلْم كَثيرًا Hypothesis: أنا أحب العلوم ← 3 errors (S, S, D) WER = 3/4 = 75%
Character Error Rate (CER)
CER applies the same formula character-by-character. For Arabic, CER is often more informative than WER because a single morphological suffix difference counts as one word substitution in WER but only a few character errors in CER, giving credit for partially-correct words.
Benchmark comparison
| Model | WER (Arabic Common Voice) | CER | RT Factor |
|---|---|---|---|
| Whisper-medium | ≈ 12% | ≈ 4% | 0.3× (fast) |
| Wav2Vec2 XLSR-53 Arabic | ≈ 18% | ≈ 6% | 0.1× (very fast) |
| CNN+LSTM (from scratch) | 35–45%* | ≈ 15% | 0.2× |
* CNN+LSTM WER is approximate. It depends on training duration, dataset size, and whether test audio matches the acoustic domain of training data.
Feature Modules
What each page in the app does, technically
Transcribe
/api/transcribe: whisper_asr / wav2vec2_asr / cnn_lstmAccepts a raw audio file upload. The selected model runs inference and returns an Arabic transcript. Whisper uses auto-regressive beam search; Wav2Vec2 and CNN+LSTM use greedy CTC decoding. The result can be copied, downloaded, or forwarded to the Summarize or Search modules.
Voice Search
/api/search: FAISS + sentence-transformersTranscripts (and text notes) are embedded using a multilingual sentence-transformer and stored in a FAISS flat-L2 index on disk. Keyword mode runs TF-IDF BM25-style exact matching; Neural mode performs cosine similarity over the dense embeddings. Results are returned ordered by relevance score.
Speaker Diarization
/api/diarize: pyannote/speaker-diarization-3.1The PyAnnote pipeline runs VAD to detect speech segments, extracts ECAPA-TDNN d-vectors per segment, and clusters them with spectral clustering. Output is a speaker-labelled timeline rendered as colour-coded horizontal bands. Set num_speakers explicitly if known for best accuracy.
Emotion Detection
/api/emotion: Wav2Vec2 4-class emotion classifierA Wav2Vec2 model with a pooling + classification head outputs probabilities for happy, angry, neutral, and sad. Prosodic features (pitch, energy, rate) are captured implicitly by the Transformer encoder from raw audio. Short clips (3–10 seconds) give the most reliable results.
Summarize
/api/summarize: TF-IDF extractive / mT5 abstractiveTwo modes: extractive selects the highest-scoring sentences from the input without rewriting (fast, GPU-free); abstractive generates a condensed paraphrase using a seq-to-seq model. The recommended workflow is Transcribe → copy transcript → Summarize for speech-to-note automation.
Voice Chat + Search
MediaRecorder → /api/transcribe + /api/search/addThe browser MediaRecorder API captures voice messages as WebM/Opus blobs, which are sent to /api/transcribe. The resulting transcript is displayed in the chat bubble and simultaneously indexed via /api/search/add. The inline search bar queries the accumulated index and highlights matches in the message history.
Why Arabic ASR is Hard
The linguistic and technical challenges this system addresses
Arabic is a root-and-pattern language. Grammatical information (tense, person, number, gender, definiteness) is encoded as inflectional patterns applied to three- or four-letter roots, producing a vast number of unique surface forms. A single word like وسيكتبونها(wa-sa-yaktubu-na-hā, “and they will write it”) encodes five pieces of information in one token. This dramatically inflates the effective vocabulary, making any statistical model harder to train compared to analytic languages like English.
Diacritics typically omitted
Written Arabic omits short vowel marks (ḥarakāt) in most everyday text. The word كتب is ambiguous: kataba (he wrote), kutub (books), kutiba (it was written). Models must infer the correct reading entirely from context, a challenge that compounds with the morphological ambiguity above.
Diglossia and dialectal variation
The Arabic-speaking world has 22 countries with spoken dialects that differ substantially in phonology, lexicon, and grammar. Egyptian, Levantine, Gulf, and Maghrebi Arabic are all linguistically Arabic but differ the way Portuguese differs from Spanish in some dimensions. Most labelled datasets focus on Modern Standard Arabic (MSA), which is the formal written register but rarely spoken in casual conversation, creating a systematic domain mismatch for real-world deployment.
Data scarcity
As of 2025 the best publicly available labelled Arabic speech corpus (Mozilla Common Voice Arabic) contains roughly 25 hours of validated recordings. English ASR models are trained on datasets that are 4 to 5 orders of magnitude larger. This scarcity is the primary reason transfer learning from massively multilingual models (Whisper, XLSR) dramatically outperforms training-from-scratch approaches on limited data, and is the key academic motivation for including both the custom CNN+LSTM and the pre-trained models in this project.
Whisper's strength comes from 680,000 hours of multilingual web audio; Arabic phonology was learned implicitly from a huge variety of sources, giving it robustness that a 25-hour fine-tune alone could never achieve.
Training Notebook
How to run 02_cnn_lstm_training.ipynb end-to-end
The notebook at notebooks/02_cnn_lstm_training.ipynb is a self-contained walkthrough of the custom CNN+LSTM model, from raw dataset to trained checkpoint with evaluation metrics. It is designed to run in Google Colab (free T4 GPU) or any local Jupyter environment with CUDA available.
Environment setup
The first cells install everything needed and mount Google Drive if running in Colab. No manual package management is required beyond running the cells in order.
# If running locally, activate your virtual environment first: source venv/bin/activate # macOS/Linux venv\Scripts\activate # Windows # Then launch Jupyter: jupyter lab notebooks/02_cnn_lstm_training.ipynb # Or open in VS Code → right-click → Open With → Jupyter Notebook
Cell-by-cell structure
| Cell group | What it does |
|---|---|
| 1: Imports & Config | Installs librosa, torch, torchaudio. Defines SAMPLE_RATE=16000, N_MFCC=128, BATCH_SIZE=16. |
| 2: Dataset Loading | Downloads Mozilla Common Voice Arabic (dataset ID cmn2g7uu701fqo1072r5na25l) via the MDC API using MDC_API_KEY from Colab Secrets. Builds train/val/test splits (80/10/10). |
| 3: Feature Extraction | Converts every audio clip to a normalised MFCC tensor. Applies SpecAugment augmentation online. |
| 4: Model Definition | Defines CnnLstmASR, 2× Conv2D → reshape → 2× BiLSTM → Linear(vocab). Prints parameter count. |
| 5: Training Loop | AdamW + OneCycleLR. CTC loss. Saves best checkpoint to checkpoints/best_model.pt on val-WER improvement. |
| 6: Inference Test | Loads best checkpoint. Runs greedy CTC decoding on 5 test samples. Prints hypothesis vs. reference. |
| 7: Metrics | Computes WER and CER over the full test split. Plots a confusion matrix over the top-20 most common characters. |
| 8: Export | Exports the trained model to ONNX for fast CPU inference via backend/main.py. |
Expected runtime
On a Colab T4 GPU, one full epoch over the Common Voice Arabic validated set (≈25h audio) takes roughly 35–45 minutes. The notebook defaults to 5 epochs for a quick demo run. Set MAX_EPOCHS = 30 and enable early stopping for production-quality training. A pre-trained checkpoint is included in checkpoints/ so the notebook can also be run in evaluation-only mode by skipping cells 4–5.
Connecting the checkpoint to the backend
Once training is complete, copy the saved checkpoint path into configs/config.yaml under the cnn_lstm.checkpoint key. The FastAPI backend reads this config at startup and loads the model automatically; no code changes required.
# configs/config.yaml
models:
cnn_lstm:
checkpoint: "checkpoints/best_model.pt"
vocab_size: 50
n_mfcc: 128
hidden_size: 256
num_layers: 2To skip training entirely and just observe the inference pipeline, the backend will fall back to Whisper automatically when the CNN+LSTM checkpoint is missing. Set the model selector to “CNN+LSTM” only after training completes.
Deliverables
What is included, where to find it, and how to verify each item
Every deliverable is present in the repository. The table below maps each item to its exact location. The demo interface (deliverable 6) can be started in under two minutes; see the Local Setup tab for the full command sequence.
| Deliverable | Location | Notes |
|---|---|---|
| Source code | src/ + backend/ + frontend/src/ | Python ML modules in src/, FastAPI server in backend/, Next.js UI in frontend/ |
| Dataset description | deliverables/02_dataset_description.md | Mozilla Common Voice Arabic + 4 supporting datasets, full statistics |
| System architecture | deliverables/03_system_architecture.md | End-to-end pipeline diagram, component breakdown, API surface |
| Experiments | deliverables/04_experiments.md | CNN+LSTM training runs, hyperparameter sweeps, loss curves |
| Evaluation results | deliverables/05_evaluation_results.md | WER/CER per model, benchmark comparison, confusion analysis |
| Demo interface | frontend/ (Next.js 16) | 7-page app: Transcribe, Search, Speakers, Emotion, Summarize, Chat, Guide |
| Bonus: Speaker Diarization | src/speaker_diarization.py | PyAnnote pipeline, ECAPA-TDNN d-vectors, spectral clustering |
| Bonus: Emotion Detection | src/emotion_detection.py | Wav2Vec2 fine-tune, 4 classes, probability distribution output |
| Bonus: Summarization | src/summarizer.py | Extractive TF-IDF + abstractive mT5 modes |
| Bonus: Voice Chat Messenger | frontend/src/app/chat/ | MediaRecorder → transcribe → index → searchable history |
Verifying the demo runs correctly
With the backend running on port 8000 and the frontend on port 3000, the indicator dot in the side-nav will turn green. The same status is reflected as ONLINE in the home screen header. If the dot is red, the most common causes are: backend not started, a missing Python dependency, or a firewall blocking the port.
Running the full pipeline end-to-end
# Terminal 1 — Backend
cd arabic-asr-project
source venv/bin/activate
python backend/main.py
# → Uvicorn running on http://0.0.0.0:8000
# Terminal 2 — Frontend
cd arabic-asr-project/frontend
npm run dev
# → Next.js ready on http://localhost:3000
# Verify API is healthy:
curl http://localhost:8000/api/health
# → {"status":"ok","models_loaded":["whisper","wav2vec2","cnn_lstm"]}Task coverage at a glance
| Required Task | Method / Model | Evaluation |
|---|---|---|
| Speech-to-Text (CNN+LSTM from scratch) | 2D CNN → BiLSTM → CTC | WER + CER, notebook cell 7 |
| Speech-to-Text (pre-trained Whisper) | medium (769M, enc-dec Transformer) | WER ≈ 12%, benchmark table |
| Speech-to-Text (pre-trained Wav2Vec2) | XLSR-53 fine-tune, CTC | WER ≈ 18%, benchmark table |
| WER metric | src/evaluate.py | Formula + code, evaluation section |
| Voice notes search engine | FAISS + sentence-transformers | Semantic + keyword modes, search module |
| Speaker Identification (advanced) | PyAnnote 3.1, ECAPA-TDNN | Diarization Error Rate (DER) |
| Emotion Detection (advanced) | Wav2Vec2 4-class classifier | Per-class accuracy + confusion matrix |
| Summarization (advanced) | TF-IDF extractive + mT5 | ROUGE-L score |
| Demo Interface | Next.js 16 + FastAPI | Live at localhost:3000 |
The recommended verification order is: start backend → confirm green dot → upload a short Arabic WAV file to Transcribe → copy the result to Summarize → archive it → run a Search query. This single flow exercises the three compulsory tasks (ASR, transcript, search) and two optional ones (summarize + index) in under 90 seconds.