Dashboard
System Documentation

User Guide

Project Overview

What this system is and why it was built

The Arabic Audio Intelligence System is a deep-learning-powered platform that converts spoken Arabic audio into searchable, analysable, and summarised text. It was built as an academic deliverable addressing the real-world challenge of processing Arabic speech, a language with complex morphology, optional diacritics, and dozens of spoken dialects, entirely with open-source neural networks.

Rather than a single-purpose transcription tool, the system implements a fully integrated pipeline: raw audio goes in, and structured knowledge (transcripts, speaker timelines, emotional states, summaries, indexed notes) comes out. Every component is independently accessible through a module-driven interface. No external paid APIs are used; all inference is self-hosted.

The project satisfies every item in the course specification, including all optional advanced tasks (speaker diarization, emotion detection, text summarization, and a voice chat messenger with search capabilities).

RequirementFile / ModuleStatus
Speech-to-Text CNN+LSTM (from scratch)src/models/cnn_lstm.py + notebooks/02_cnn_lstm_training.ipynb
Pre-trained ASR: Whispersrc/models/whisper_asr.py
Pre-trained ASR: Wav2Vec 2.0src/models/wav2vec2_asr.py
WER / CER Evaluationsrc/evaluate.py
Voice Notes Search Enginesrc/search_engine.py
Speaker Diarization (advanced)src/speaker_diarization.py
Emotion Detection (advanced)src/emotion_detection.py
Text Summarization (advanced)src/summarizer.py
Voice Chat Messenger (advanced)frontend/src/app/chat/
Live Demo InterfaceNext.js 16 + FastAPI

System Architecture

How data flows from audio input to structured output

The architecture is split into two independent layers: a Python FastAPI backend at port 8000 that handles all ML computation, and a Next.js 16 frontend at port 3000 that renders the interface. The two communicate via a stateless REST API over HTTP, meaning the backend can be hosted on a GPU server anywhere while the frontend can be served from a CDN.

pipeline
Audio Input  (.wav / .mp3 / .flac / .ogg)
       │
       ▼
 ┌─────────────────────────────────────────┐
 │           Audio Preprocessing           │
 │   Resample to 16kHz mono                │
 │   Segment if > 30s (Whisper chunking)   │
 │   MFCC / Log-Mel extraction             │
 └───────────────────┬─────────────────────┘
                     │
          ┌──────────┴──────────┐
          ▼                     ▼
   ┌────────────┐     ┌──────────────────────┐
   │ CNN + LSTM │     │  Whisper / Wav2Vec2  │
   │ (custom)   │     │  (HuggingFace)       │
   └──────┬─────┘     └──────────┬───────────┘
          └──────────┬───────────┘
                     ▼
          ┌──────────────────────┐
          │   Text Transcript    │
          └───────┬──────────────┘
                  │
       ┌──────────┼──────────────┐
       ▼          ▼              ▼
 ┌──────────┐ ┌────────┐ ┌────────────────┐
 │Summarizer│ │Emotion │ │Speaker Diarize │
 │mT5/TF-IDF│ │Wav2Vec2│ │PyAnnote/ECAPA  │
 └──────────┘ └────────┘ └────────────────┘
       │          │              │
       └──────────┴──────────────┘
                  │
                  ▼
        ┌─────────────────────┐
        │    Search Engine    │
        │  FAISS + sentence-  │
        │  transformers       │
        └─────────────────────┘

Backend API surface

EndpointMethodInputReturns
/api/healthGET{ status: 'ok' }
/api/transcribePOSTfile (audio), model{ transcript, language, duration }
/api/diarizePOSTfile (audio), num_speakers{ segments: [{speaker, start, end}] }
/api/emotionPOSTfile (audio){ emotion, scores: {happy,angry,…} }
/api/summarizePOSTtext, max_length, method{ summary, method_used }
/api/searchPOSTquery, search_type{ results: [{text, score, id}] }
/api/search/addPOSTtranscript, file?{ id, message }
/api/search/statsGET{ total_notes, index_size }

Frontend routing

Each feature is a Next.js App Router page under src/app/. All API calls are centralised in src/lib/api.ts. The dynamic side navbar uses Framer Motion layout animations to transition between a horizontal bottom dock (home page) and a vertical sidebar (all other pages), animated as a fluid spring, not a CSS snap.

Model Zoo

Every neural network used: architecture, rationale, performance

OpenAI Whisper

Whisper is an encoder–decoder Transformer trained on 680,000 hours of multilingual audio. The audio encoder is a convolutional stem followed by a stack of Transformer blocks that produces a context-rich embedding of the log-mel spectrogram. The decoder auto-regressively generates BPE tokens using cross-attention over those encoder outputs. It natively supports Arabic and handles diacritics, code-switching, and noise robustly. We use whisper-medium (769M parameters) as the default; smaller variants are also available.

Transformer enc-dec769M params (medium)WER ≈ 12% standard ArabicAuto-download from HuggingFace

Wav2Vec 2.0: Arabic fine-tune

Wav2Vec 2.0 (Meta AI) learns speech representations directly from raw waveforms. Its convolutional feature extractor downsamples the audio, and a 24-layer Transformer encoder is pre-trained with a masked contrastive loss (like BERT for speech). A linear CTC head is then fine-tuned on labelled data. We load facebook/wav2vec2-large-xlsr-53-arabic, the XLSR-53 model fine-tuned specifically on the Arabic Common Voice split. Because CTC decoding is non-autoregressive it is roughly 2× faster than Whisper at inference.

Self-supervised (XLSR)CTC decoding, no beam search~2× faster than WhisperBest for short utterances

CNN + LSTM: Custom Academic Model

Built from scratch to satisfy the course requirement for a deep-learning model designed and trained by students. It operates on MFCC features extracted from audio. Two 2D convolutional layers with batch normalisation extract spectro-temporal patterns from the feature map. The output is reshaped and fed into two bidirectional LSTM layers that model temporal context across frames. A fully-connected layer followed by a softmax over the character vocabulary produces per-frame posteriors; CTC loss handles alignment-free training.

architecture
Input waveform  →  librosa MFCC  →  (128, T) feature map
     ↓
Conv2D(32, 3×3, padding=1) → BN → ReLU
Conv2D(64, 3×3, padding=1) → BN → ReLU → MaxPool2D(2,2)
     ↓
Reshape  →  (T//2,  64 × 64)   [batch × time × features]
     ↓
BiLSTM(256, num_layers=2, dropout=0.3)
     ↓
Linear(vocab_size)   →   log_softmax
     ↓
CTC Loss  ←→  reference transcript
NOTE

The CNN+LSTM model is the academic experiment. Expect WER of 35–45% depending on training epochs and dataset quality. Whisper or Wav2Vec2 should be selected for production-quality transcriptions.

PyAnnote Audio: Speaker Diarization

PyAnnote is a speaker diarization toolkit built on PyTorch. The pyannote/speaker-diarization-3.1 pipeline chains voice-activity detection (a segmentation model), speaker embedding extraction (ECAPA-TDNN d-vectors), and spectral clustering into a single end-to-end pipeline. It outputs a speaker-labeled timeline with RTTM-compatible start/end timestamps.

Wav2Vec 2.0: Emotion Classification

A Wav2Vec 2.0 base checkpoint fine-tuned for speech emotion recognition. A mean-pooling layer followed by a 4-class linear head is trained on prosodic emotional datasets. The four classes are happy, angry, neutral, sad. The model outputs a probability distribution over all four, making it interpretable beyond a single predicted label.

mT5 / TF-IDF: Summarization

Two modes are offered. Extractive summarization uses TF-IDF sentence scoring: sentences are ranked by weighted term frequency and the top‑N are returned verbatim; fast, no GPU required. Abstractive summarization runs a multilingual sequence-to-sequence model (Helsinki-NLP or an Arabic T5 variant) to generate a free-form condensed paraphrase of the input text.

Datasets

Training data sources, sizes, and characteristics

Arabic ASR is data-starved compared to English. All available high-quality labelled Arabic speech resources total a few hundred hours, versus millions of hours for English. The following datasets are used or referenced in this project:

DatasetSizeDomain / DialectUsage in this project
Mozilla Common Voice (Arabic)~25h validatedMSA + some dialectsPrimary training set for CNN+LSTM; Wav2Vec2 fine-tune
Arabic Speech Corpus (Nawar Halabi)1.8h studioMSA broadcastEvaluation + pronunciation reference
MASC (hirundo-io/MASC)Multi-domainMSAKeyword spotting evaluation
EJUST Dataset (restricted)University internalEgyptian ArabicDialectal robustness testing
IEMOCAP (English reference)12h scriptedN/AEmotion detection training baseline
IMPORTANT

Mozilla Common Voice Arabic is gated behind the Mozilla Data Collective. To access it: create a free account at datacollective.mozillafoundation.org, request access to dataset cmn2g7uu701fqo1072r5na25l, then generate an API key from your account dashboard. In Google Colab open the Secrets panel (lock icon in the left sidebar), add a secret named MDC_API_KEY, and paste your key. The notebook reads it with userdata.get("MDC_API_KEY") and uses it as a Bearer token automatically.

TIP

You do not need to download any dataset to use the inference pipeline. Models are fine-tuned checkpoints served from HuggingFace and downloaded automatically on first use.

Data preprocessing

Every audio file is resampled to 16 kHz mono before any model sees it. For the CNN+LSTM training pipeline, 128 MFCC coefficients are extracted per frame (hop = 512, window = 2048). Feature maps are normalised to zero-mean unit-variance per sample. During training, SpecAugment is applied: random time masks (T=50 frames) and frequency masks (F=40 bins) are zeroed out to improve noise robustness.

Audio Processing

From raw waveform to neural network input

Neural models cannot consume raw audio bytes directly; they need structured numerical representations. Three representations are used across the different models:

MFCC (CNN+LSTM)

Mel-Frequency Cepstral Coefficients approximate the human auditory system. A Short-Time Fourier Transform (STFT) extracts frequency content over overlapping frames, the frequency axis is warped to the perceptual Mel scale, log energies are computed, and finally a Discrete Cosine Transform produces compact coefficients. The result is a 2D feature map (frequency × time) that the convolutional layers treat like an image.

python
import librosa, numpy as np

y, sr = librosa.load("audio.wav", sr=16000, mono=True)

mfcc = librosa.feature.mfcc(
    y=y, sr=sr, n_mfcc=128, n_fft=2048, hop_length=512
)  # shape: (128, T)

# Normalise per sample
mfcc = (mfcc - mfcc.mean()) / (mfcc.std() + 1e-8)

Log-Mel Spectrogram (Whisper)

Whisper uses 80 Mel filter-bank channels with 25ms frames shifted by 10ms. The spectrogram is computed, log-scaled, normalised to [–1, 1], and chunked into 30-second blocks. The convolutional stem of Whisper's encoder processes these chunks in parallel before Transformer layers attend across frames.

Raw Waveform (Wav2Vec 2.0)

Wav2Vec 2.0 consumes normalised 16kHz waveform values directly. Its first stage, a stack of temporal 1D convolutions, acts as a learned feature extractor that replaces hand-crafted MFCCs. This end-to-end approach captures fine-grained acoustic detail that fixed feature extractors may discard.

Training the CNN+LSTM

Loss function, hyperparameters, and the training loop

The custom model is trained in src/train.py and a complete Colab-ready walkthrough is provided in notebooks/02_cnn_lstm_training.ipynb. The notebook covers data loading, feature extraction, model construction, training loop, and evaluation in sequential cells.

CTC Loss: no forced alignment needed

CTC (Connectionist Temporal Classification) allows training on (audio, transcript) pairs without frame-level phoneme annotations. It marginalises over every possible alignment between the output sequence and the target label by summing probabilities across all valid paths (including repetitions and blank tokens). This makes large-scale ASR training practical.

python
import torch.nn as nn

criterion = nn.CTCLoss(blank=0, zero_infinity=True)

# log_probs: (T, N, C)  — time × batch × vocab_size
# targets, input_lengths, target_lengths
loss = criterion(log_probs, targets, input_lengths, target_lengths)

Configuration

ParameterValue / Strategy
OptimiserAdamW, lr = 3e-4, weight_decay = 1e-4
LR ScheduleOneCycleLR: 10% warmup → cosine anneal
Batch size16 on-disk + gradient accumulation × 4 = effective 64
Epochs30 max with early stopping (patience = 5, monitor val-WER)
Dropout0.3 in LSTM, 0.1 in FC head
Gradient clippingmax_norm = 5.0 (prevents LSTM exploding gradients)
SpecAugmentTime mask T = 50 frames, Frequency mask F = 40 bins, 2 masks each
VocabArabic character set + blank + space ≈ 50 tokens
IMPORTANT

Training requires a CUDA GPU with at least 8 GB VRAM. On CPU, a single epoch over Common Voice Arabic takes several hours. Use the provided Google Colab notebook for free cloud GPU access.

Evaluation Metrics

How transcription quality is measured objectively

Word Error Rate (WER)

WER is the primary ASR metric. It computes the minimum edit distance between the model's hypothesis and the ground-truth reference at the word level, then normalises by the total number of reference words. Lower is better; perfect transcription = 0%.

formula
WER = (S + D + I) / N

  S = Substitutions  (wrong word predicted)
  D = Deletions      (reference word missing from hypothesis)
  I = Insertions     (extra word in hypothesis)
  N = Total reference words

Example —
  Reference:   أنا   أُحِبّ   العِلْم   كَثيرًا
  Hypothesis:  أنا   أحب     العلوم              ← 3 errors (S, S, D)
  WER = 3/4 = 75%

Character Error Rate (CER)

CER applies the same formula character-by-character. For Arabic, CER is often more informative than WER because a single morphological suffix difference counts as one word substitution in WER but only a few character errors in CER, giving credit for partially-correct words.

Benchmark comparison

ModelWER (Arabic Common Voice)CERRT Factor
Whisper-medium≈ 12%≈ 4%0.3× (fast)
Wav2Vec2 XLSR-53 Arabic≈ 18%≈ 6%0.1× (very fast)
CNN+LSTM (from scratch)35–45%*≈ 15%0.2×
NOTE

* CNN+LSTM WER is approximate. It depends on training duration, dataset size, and whether test audio matches the acoustic domain of training data.

Feature Modules

What each page in the app does, technically

Transcribe

/api/transcribe: whisper_asr / wav2vec2_asr / cnn_lstm

Accepts a raw audio file upload. The selected model runs inference and returns an Arabic transcript. Whisper uses auto-regressive beam search; Wav2Vec2 and CNN+LSTM use greedy CTC decoding. The result can be copied, downloaded, or forwarded to the Summarize or Search modules.

Voice Search

/api/search: FAISS + sentence-transformers

Transcripts (and text notes) are embedded using a multilingual sentence-transformer and stored in a FAISS flat-L2 index on disk. Keyword mode runs TF-IDF BM25-style exact matching; Neural mode performs cosine similarity over the dense embeddings. Results are returned ordered by relevance score.

Speaker Diarization

/api/diarize: pyannote/speaker-diarization-3.1

The PyAnnote pipeline runs VAD to detect speech segments, extracts ECAPA-TDNN d-vectors per segment, and clusters them with spectral clustering. Output is a speaker-labelled timeline rendered as colour-coded horizontal bands. Set num_speakers explicitly if known for best accuracy.

Emotion Detection

/api/emotion: Wav2Vec2 4-class emotion classifier

A Wav2Vec2 model with a pooling + classification head outputs probabilities for happy, angry, neutral, and sad. Prosodic features (pitch, energy, rate) are captured implicitly by the Transformer encoder from raw audio. Short clips (3–10 seconds) give the most reliable results.

Summarize

/api/summarize: TF-IDF extractive / mT5 abstractive

Two modes: extractive selects the highest-scoring sentences from the input without rewriting (fast, GPU-free); abstractive generates a condensed paraphrase using a seq-to-seq model. The recommended workflow is Transcribe → copy transcript → Summarize for speech-to-note automation.

Voice Chat + Search

MediaRecorder → /api/transcribe + /api/search/add

The browser MediaRecorder API captures voice messages as WebM/Opus blobs, which are sent to /api/transcribe. The resulting transcript is displayed in the chat bubble and simultaneously indexed via /api/search/add. The inline search bar queries the accumulated index and highlights matches in the message history.

Why Arabic ASR is Hard

The linguistic and technical challenges this system addresses

Arabic is a root-and-pattern language. Grammatical information (tense, person, number, gender, definiteness) is encoded as inflectional patterns applied to three- or four-letter roots, producing a vast number of unique surface forms. A single word like وسيكتبونها(wa-sa-yaktubu-na-hā, “and they will write it”) encodes five pieces of information in one token. This dramatically inflates the effective vocabulary, making any statistical model harder to train compared to analytic languages like English.

Diacritics typically omitted

Written Arabic omits short vowel marks (ḥarakāt) in most everyday text. The word كتب is ambiguous: kataba (he wrote), kutub (books), kutiba (it was written). Models must infer the correct reading entirely from context, a challenge that compounds with the morphological ambiguity above.

Diglossia and dialectal variation

The Arabic-speaking world has 22 countries with spoken dialects that differ substantially in phonology, lexicon, and grammar. Egyptian, Levantine, Gulf, and Maghrebi Arabic are all linguistically Arabic but differ the way Portuguese differs from Spanish in some dimensions. Most labelled datasets focus on Modern Standard Arabic (MSA), which is the formal written register but rarely spoken in casual conversation, creating a systematic domain mismatch for real-world deployment.

Data scarcity

As of 2025 the best publicly available labelled Arabic speech corpus (Mozilla Common Voice Arabic) contains roughly 25 hours of validated recordings. English ASR models are trained on datasets that are 4 to 5 orders of magnitude larger. This scarcity is the primary reason transfer learning from massively multilingual models (Whisper, XLSR) dramatically outperforms training-from-scratch approaches on limited data, and is the key academic motivation for including both the custom CNN+LSTM and the pre-trained models in this project.

TIP

Whisper's strength comes from 680,000 hours of multilingual web audio; Arabic phonology was learned implicitly from a huge variety of sources, giving it robustness that a 25-hour fine-tune alone could never achieve.

Training Notebook

How to run 02_cnn_lstm_training.ipynb end-to-end

The notebook at notebooks/02_cnn_lstm_training.ipynb is a self-contained walkthrough of the custom CNN+LSTM model, from raw dataset to trained checkpoint with evaluation metrics. It is designed to run in Google Colab (free T4 GPU) or any local Jupyter environment with CUDA available.

Environment setup

The first cells install everything needed and mount Google Drive if running in Colab. No manual package management is required beyond running the cells in order.

bash
# If running locally, activate your virtual environment first:
source venv/bin/activate        # macOS/Linux
venv\Scripts\activate           # Windows

# Then launch Jupyter:
jupyter lab notebooks/02_cnn_lstm_training.ipynb

# Or open in VS Code → right-click → Open With → Jupyter Notebook

Cell-by-cell structure

Cell groupWhat it does
1: Imports & ConfigInstalls librosa, torch, torchaudio. Defines SAMPLE_RATE=16000, N_MFCC=128, BATCH_SIZE=16.
2: Dataset LoadingDownloads Mozilla Common Voice Arabic (dataset ID cmn2g7uu701fqo1072r5na25l) via the MDC API using MDC_API_KEY from Colab Secrets. Builds train/val/test splits (80/10/10).
3: Feature ExtractionConverts every audio clip to a normalised MFCC tensor. Applies SpecAugment augmentation online.
4: Model DefinitionDefines CnnLstmASR, 2× Conv2D → reshape → 2× BiLSTM → Linear(vocab). Prints parameter count.
5: Training LoopAdamW + OneCycleLR. CTC loss. Saves best checkpoint to checkpoints/best_model.pt on val-WER improvement.
6: Inference TestLoads best checkpoint. Runs greedy CTC decoding on 5 test samples. Prints hypothesis vs. reference.
7: MetricsComputes WER and CER over the full test split. Plots a confusion matrix over the top-20 most common characters.
8: ExportExports the trained model to ONNX for fast CPU inference via backend/main.py.

Expected runtime

On a Colab T4 GPU, one full epoch over the Common Voice Arabic validated set (≈25h audio) takes roughly 35–45 minutes. The notebook defaults to 5 epochs for a quick demo run. Set MAX_EPOCHS = 30 and enable early stopping for production-quality training. A pre-trained checkpoint is included in checkpoints/ so the notebook can also be run in evaluation-only mode by skipping cells 4–5.

Connecting the checkpoint to the backend

Once training is complete, copy the saved checkpoint path into configs/config.yaml under the cnn_lstm.checkpoint key. The FastAPI backend reads this config at startup and loads the model automatically; no code changes required.

yaml
# configs/config.yaml
models:
  cnn_lstm:
    checkpoint: "checkpoints/best_model.pt"
    vocab_size: 50
    n_mfcc: 128
    hidden_size: 256
    num_layers: 2
TIP

To skip training entirely and just observe the inference pipeline, the backend will fall back to Whisper automatically when the CNN+LSTM checkpoint is missing. Set the model selector to “CNN+LSTM” only after training completes.

Deliverables

What is included, where to find it, and how to verify each item

Every deliverable is present in the repository. The table below maps each item to its exact location. The demo interface (deliverable 6) can be started in under two minutes; see the Local Setup tab for the full command sequence.

DeliverableLocationNotes
Source codesrc/ + backend/ + frontend/src/Python ML modules in src/, FastAPI server in backend/, Next.js UI in frontend/
Dataset descriptiondeliverables/02_dataset_description.mdMozilla Common Voice Arabic + 4 supporting datasets, full statistics
System architecturedeliverables/03_system_architecture.mdEnd-to-end pipeline diagram, component breakdown, API surface
Experimentsdeliverables/04_experiments.mdCNN+LSTM training runs, hyperparameter sweeps, loss curves
Evaluation resultsdeliverables/05_evaluation_results.mdWER/CER per model, benchmark comparison, confusion analysis
Demo interfacefrontend/ (Next.js 16)7-page app: Transcribe, Search, Speakers, Emotion, Summarize, Chat, Guide
Bonus: Speaker Diarizationsrc/speaker_diarization.pyPyAnnote pipeline, ECAPA-TDNN d-vectors, spectral clustering
Bonus: Emotion Detectionsrc/emotion_detection.pyWav2Vec2 fine-tune, 4 classes, probability distribution output
Bonus: Summarizationsrc/summarizer.pyExtractive TF-IDF + abstractive mT5 modes
Bonus: Voice Chat Messengerfrontend/src/app/chat/MediaRecorder → transcribe → index → searchable history

Verifying the demo runs correctly

With the backend running on port 8000 and the frontend on port 3000, the indicator dot in the side-nav will turn green. The same status is reflected as ONLINE in the home screen header. If the dot is red, the most common causes are: backend not started, a missing Python dependency, or a firewall blocking the port.

Running the full pipeline end-to-end

bash
# Terminal 1 — Backend
cd arabic-asr-project
source venv/bin/activate
python backend/main.py
# → Uvicorn running on http://0.0.0.0:8000

# Terminal 2 — Frontend
cd arabic-asr-project/frontend
npm run dev
# → Next.js ready on http://localhost:3000

# Verify API is healthy:
curl http://localhost:8000/api/health
# → {"status":"ok","models_loaded":["whisper","wav2vec2","cnn_lstm"]}

Task coverage at a glance

Required TaskMethod / ModelEvaluation
Speech-to-Text (CNN+LSTM from scratch)2D CNN → BiLSTM → CTCWER + CER, notebook cell 7
Speech-to-Text (pre-trained Whisper)medium (769M, enc-dec Transformer)WER ≈ 12%, benchmark table
Speech-to-Text (pre-trained Wav2Vec2)XLSR-53 fine-tune, CTCWER ≈ 18%, benchmark table
WER metricsrc/evaluate.pyFormula + code, evaluation section
Voice notes search engineFAISS + sentence-transformersSemantic + keyword modes, search module
Speaker Identification (advanced)PyAnnote 3.1, ECAPA-TDNNDiarization Error Rate (DER)
Emotion Detection (advanced)Wav2Vec2 4-class classifierPer-class accuracy + confusion matrix
Summarization (advanced)TF-IDF extractive + mT5ROUGE-L score
Demo InterfaceNext.js 16 + FastAPILive at localhost:3000
NOTE

The recommended verification order is: start backend → confirm green dot → upload a short Arabic WAV file to Transcribe → copy the result to Summarize → archive it → run a Search query. This single flow exercises the three compulsory tasks (ASR, transcript, search) and two optional ones (summarize + index) in under 90 seconds.