Dynamic Nepali Sign Language Recognition
CNN + LSTM Model for Gesture-Based Communication

Splash Screen
Project Overview
- Objective: Build an AI system to dynamically recognize Nepali Sign Language (NSL) gestures and bridge the communication gap between deaf-mute and hearing-speaking individuals.
- Tools & Technologies: Python, TensorFlow, InceptionV3, LSTM, FFmpeg, NumPy, Google Colab
Problem Statement
Many deaf-mute individuals in Nepal lack early exposure to Nepali Sign Language (NSL), making communication with the general public challenging. There is a shortage of interpreters, and most families lack NSL proficiency. This project aims to create a system that can dynamically translate NSL gestures into text/voice and vice versa to bridge this communication gap.
Dataset
Due to the lack of publicly available NSL datasets, a custom dataset was generated with five NSL signs: father, food, promise, tea, wife. Each sign was performed 20 times under different lighting and background conditions, leading to 80 videos per class (66 for training, 14 for testing). Data augmentation expanded the dataset to 1650 total videos.
Preprocessing
- Augmented videos using flipping, rotation, contrast manipulation.
- Extracted frames using FFmpeg and generated metadata CSV.
- Downsampled to 40 frames per video to reduce overfitting.
- Extracted features using InceptionV3 (output: 40x2048 feature vectors).
Methodology
- Feature Extraction: Used InceptionV3 (pretrained on ImageNet) to extract spatial features from video frames.
- Temporal Modeling: Passed features to LSTM layers to model sequential dependencies.
- Model Architecture:
- 2 LSTM layers (256 and 128 units)
- Dense + ReLU + Dropout layers
- Final Dense layer with Softmax for classification
- Training Configuration:
- Loss Function: Categorical Crossentropy
- Optimizer: Adam (lr=1e-5, decay=1e-5)
- Metrics: Accuracy, Top-k Accuracy
Experimental Results
- Trained for 100 epochs on Google Colab (with GPU support)
- Initial model with 66 videos/class → 98% training accuracy, 20% validation accuracy (overfitting)
- Augmented dataset to 330 videos/class → 89% training accuracy, 59% validation accuracy
Conclusion
The system effectively classifies dynamic Nepali Sign Language gestures using a CNN-LSTM architecture. Despite limitations like small dataset size and limited hardware resources, the model showed promising results. This work demonstrates the potential of deep learning in sign language recognition and can be expanded to more gestures in the future.
Key Features
- Real-time sign recognition using video sequences
- Dynamic sentence generation from detected gestures
- Bi-directional conversion: Text/Voice to NSL and NSL to Text/Voice