Dynamic Nepali Sign Language Recognition

CNN + LSTM Model for Gesture-Based Communication

Splash Screen

Project Overview

Objective: Build an AI system to dynamically recognize Nepali Sign Language (NSL) gestures and bridge the communication gap between deaf-mute and hearing-speaking individuals.
Tools & Technologies: Python, TensorFlow, InceptionV3, LSTM, FFmpeg, NumPy, Google Colab

Problem Statement

Many deaf-mute individuals in Nepal lack early exposure to Nepali Sign Language (NSL), making communication with the general public challenging. There is a shortage of interpreters, and most families lack NSL proficiency. This project aims to create a system that can dynamically translate NSL gestures into text/voice and vice versa to bridge this communication gap.

Dataset

Due to the lack of publicly available NSL datasets, a custom dataset was generated with five NSL signs: father, food, promise, tea, wife. Each sign was performed 20 times under different lighting and background conditions, leading to 80 videos per class (66 for training, 14 for testing). Data augmentation expanded the dataset to 1650 total videos.

Preprocessing

Augmented videos using flipping, rotation, contrast manipulation.
Extracted frames using FFmpeg and generated metadata CSV.
Downsampled to 40 frames per video to reduce overfitting.
Extracted features using InceptionV3 (output: 40x2048 feature vectors).

Methodology

Feature Extraction: Used InceptionV3 (pretrained on ImageNet) to extract spatial features from video frames.
Temporal Modeling: Passed features to LSTM layers to model sequential dependencies.
Model Architecture:
- 2 LSTM layers (256 and 128 units)
- Dense + ReLU + Dropout layers
- Final Dense layer with Softmax for classification
Training Configuration:
- Loss Function: Categorical Crossentropy
- Optimizer: Adam (lr=1e-5, decay=1e-5)
- Metrics: Accuracy, Top-k Accuracy

Experimental Results

Trained for 100 epochs on Google Colab (with GPU support)
Initial model with 66 videos/class → 98% training accuracy, 20% validation accuracy (overfitting)
Augmented dataset to 330 videos/class → 89% training accuracy, 59% validation accuracy

Conclusion

The system effectively classifies dynamic Nepali Sign Language gestures using a CNN-LSTM architecture. Despite limitations like small dataset size and limited hardware resources, the model showed promising results. This work demonstrates the potential of deep learning in sign language recognition and can be expanded to more gestures in the future.

Key Features

Real-time sign recognition using video sequences
Dynamic sentence generation from detected gestures
Bi-directional conversion: Text/Voice to NSL and NSL to Text/Voice