Usage

Combines CNNs and Mediapipe

1. Preprocessing and Feature Extraction

Before passing the landmarks into the CNN for classification, the extracted hand landmarks are preprocessed and transformed into a form that the CNN can efficiently process:

Normalization: The landmark coordinates are normalized relative to the bounding box of the hand or the frame of the image, ensuring consistency regardless of hand size or camera distance.

$\hat{x}_i = w x_i, \quad \hat{y}_i = h y_i$

Where ( ${x}_i,\ {y}_i$ ) are the raw coordinates, and 𝑤 , ℎ are the width and height of the image frame.

Angle Features: Angles between specific joint positions (e.g., between the wrist, index finger, and thumb) can also be calculated using trigonometric functions, such as:

$\theta = \arccos\left(\frac{(x_2 - x_1)^2 + (y_2 - y_1)^2}{\sqrt{(x_3 - x_1)^2 + (y_3 - y_1)^2} \cdot \sqrt{(x_2 - x_1)(x_3 - x_1) + (y_2 - y_1)(y_3 - y_1)}}\right)$

Where 𝜃 θ represents the angle between three landmarks: wrist, index finger, and thumb.

These extracted features are then passed into the CNN for classification.

2. Gesture Classification Using CNN

Once the features are extracted, they are input into a Convolutional Neural Network for classification. During training, the CNN learns to map combinations of hand gestures (patterns of hand landmarks) to specific classes (such as letters, words, or commands).

The classification is performed using the Softmax activation function, which outputs the probability for each class ${y}_i$ (i.e., a sign language gesture):

\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}

Where ${z}_i$ are the logits (raw output) from the final fully connected layer, and the output is the probability distribution for each gesture.

3. Post-Processing

After gesture classification, the recognized sign is outputted as text or spoken words using Text-to-Speech (TTS). This text can be displayed on a screen for real-time communication or converted into audio output.

PreviousMediapipe NextSystem

Last updated 7 months ago