Interpretable Digit Classification using Handcrafted Features and Euclidean Distance

Author: Austin Benedicto
Mentor: Dr. Rabih Younes
Nichols School

Abstract

The rapid growth of deep learning has overshadowed simpler, interpretable approaches to image classification. This study presents an alternative method for classifying handwritten digits using a custom feature extraction pipeline applied to the MNIST dataset. Rather than relying on convolutional neural networks, the classifier is built upon engineered features such as loop count, corner detection, symmetry score, bounding box dimensions, and writing direction. After normalization and feature weighting, a Euclidean distance classifier is used to compare new images to per-digit feature averages. The model achieves moderate accuracy and reveals detailed patterns of confusion between similar digits. This interpretable framework offers educational value, and may serve as a lightweight alternative in domains where transparency and explainability are prioritized.

Introduction

In recent years, machine learning and artificial intelligence have revolutionized image classification, with deep neural networks achieving state-of-the-art results across many datasets (Xie & Tu, 2015). However, these models are often criticized for their lack of interpretability, requiring massive computational resources and opaque architectures that hinder trust in decision-making systems (Lundberg & Lee, 2017; Fan et al., 2021). Particularly in educational settings or lightweight applications, simpler alternatives with explainable mechanisms are highly desirable. Interpretability is not just a technical challenge but a critical requirement for deploying AI responsibly, especially when end-users need to understand or contest the model’s decisions (Lipton, 2018).

This study explores a transparent, handcrafted pipeline for digit classification using the MNIST dataset. Instead of relying on pretrained convolutional networks, we develop a modular feature extraction system that emphasizes human-understandable visual traits. The extracted features are numerical descriptors such as bounding box dimensions, center of mass, symmetry, corner and intersection counts, and directional gradients derived from skeletonized images. These features are used in a Euclidean distance classifier that matches new digits to the closest mean feature vector per digit class. The objective is to demonstrate the utility and challenges of building a fully interpretable classification pipeline from scratch.

Dataset and Preprocessing

The experiment utilizes the well-established MNIST dataset, a collection of 70,000 grayscale images of handwritten digits ranging from 0 to 9. Each image is 28×28 pixels in size and is paired with a corresponding digit label. For the purposes of this project, only the test set (10,000 samples) is used, with a configurable limit on how many samples per digit are extracted. Preprocessing begins by parsing the IDX-format image and label files into NumPy arrays, enabling efficient manipulation. The pixel values, originally ranging from 0 to 255, are binarized into black-and-white using a simple thresholding method. This step reduces noise and computational overhead for feature extraction algorithms, particularly those based on geometry and shape. Once binarized, each image is treated as a 2D grid where white pixels represent the strokes of the digit.

To prepare for classification, the dataset is stratified by digit class. A defined number of samples per digit (e.g., 100 images each for digits 0 through 9) are selected and then split into training and test sets. The training set comprises 80% of each digit’s samples, which are used to compute the mean feature vector for that class. The remaining 20% are reserved for evaluation. This consistent stratified sampling ensures that the model is exposed to a balanced and diverse set of handwriting styles while maintaining generalization in testing. This methodology facilitates accurate evaluation of the classifier’s performance using confusion matrices and accuracy metrics.

Feature Extraction Pipeline

Instead of relying on pixel-based convolutional layers or learned representations, this study employs a handcrafted feature extraction pipeline that emphasizes interpretability and simplicity. Each image undergoes a series of geometric and spatial analyses to extract meaningful numerical features. The first feature is the dark pixel count, which reflects the number of active (white) pixels in the binary image and serves as a proxy for stroke density. The center of mass is then calculated by averaging the coordinates of all white pixels, providing insight into digit placement and skew. Bounding box dimensions are computed by identifying the outermost white pixels, then determining the height and width of the smallest rectangle that encloses the digit: useful for distinguishing tall digits like 1 from wide digits like 8 or 0.

Several topological features are also extracted. Loop count is determined using OpenCV’s findContours function, which detects enclosed regions in the digit’s shape. This is especially informative for digits such as 6, 8, and 9, which may contain one or more loops. The corner count uses Harris corner detection applied to the skeletonized image, which minimizes redundant stroke thickness and enhances precision. Intersection count is calculated by analyzing the number of skeletonized pixels that have more than two white neighbors in an 8-connectivity pattern, which indicates points where strokes cross or branch. Both features provide structural detail critical for distinguishing between digits with similar silhouettes, such as 4 and 9.

To further enhance the feature set, a symmetry score is calculated by reflecting the image horizontally and vertically and measuring the pixel overlap between the mirrored and original images. This allows for quantification of both vertical and horizontal symmetry, key traits in digits like 0 and 8. Finally, a directional feature is derived by skeletonizing the digit and computing the gradient flow, which is then subjected to a Fourier transform to isolate the dominant direction and its magnitude. This writing direction analysis is further divided across image quadrants, enabling localized directionality insights. Together, these features provide a compact yet rich representation of each digit, making the classification process interpretable and explainable.

Classification Strategy

Following the extraction of handcrafted features from the binarized and skeletonized images, classification is performed using a simple, interpretable method based on Euclidean distance to class averages. This method was chosen over more complex machine learning models to maintain full transparency in decision-making and provide clear insight into how features influence predictions. The pipeline first computes the average feature vector for each digit class (0 through 9) using the 80% training portion of the dataset. These averages represent the typical geometric and topological characteristics of each digit, such as mean loop count for eights or average horizontal symmetry for zeros.

Each feature is then normalized on a 0 to 1 scale across the dataset to prevent features with larger ranges (e.g., dark pixel count) from disproportionately affecting the Euclidean distance computation. Once normalization is complete, the classifier measures the straight-line (L2) distance between each test image’s feature vector and the average vector for each digit class. The digit whose class average yields the smallest distance is assigned as the predicted label for that image.

To enhance flexibility and allow for fine-tuning of the classification process, the system includes support for feature weighting. Each feature can be scaled by a custom weight during distance calculation, effectively increasing or decreasing its influence on the final prediction. This allows for experimentation with different feature importance values, guided by confusion matrices and performance trends. For instance, if corner count is found to be highly discriminative between certain digits (like 4 and 7), its weight can be increased to reflect its higher utility.

The classifier outputs a confusion matrix that visualizes the true versus predicted labels across all classes, allowing for targeted diagnosis of where misclassifications occur. The overall accuracy is also computed as the percentage of correct predictions on the test set, providing a concise summary of classifier performance. This baseline approach is not only computationally inexpensive and highly interpretable but also lays the groundwork for more advanced ensemble techniques or data-driven weight optimization in future iterations of the project.

Feature Extraction

Feature extraction is the core component of this project, as it forms the foundation upon which classification is based. Instead of using deep learning to automatically learn features from the data, this project focuses on manually engineered features: interpretable numerical attributes that describe various geometric and visual properties of handwritten digits. These features are designed to help distinguish between digit classes by capturing unique patterns, shapes, and structures in the binary images of the digits (Nguyen & Bai, 2020).

The feature extraction pipeline begins by reading grayscale MNIST digit images, which are normalized and binarized so that white pixels (indicating parts of the digit) are treated as foreground and black pixels as background. From there, a series of handcrafted features are computed. One of the most basic yet important features is the total number of white (foreground) pixels, which provides a rough measure of the digit’s thickness or density.

Another key feature is the bounding box area, which captures the size of the smallest rectangle that contains all white pixels of the digit. This is complemented by the center of mass, a two-dimensional coordinate (x, y) that indicates the average location of the white pixels. Together, these features provide spatial information about the digit’s spread and balance.

Corners are detected using a skeletonized version of the digit, followed by the Harris corner detection algorithm. This isolates sharp changes in pixel direction and curvature, giving insight into how angular the digit is. A digit like “4” or “7” tends to have many corners, while “0” or “8” might have fewer. In contrast, intersections are defined as pixels in the skeleton with three or more white neighbors: these typically appear at junctions where strokes branch or cross, such as in the middle of a “4” or “8”.

Loop detection is another critical feature. Loops are identified by performing a flood fill on the background and counting the number of enclosed white regions. This helps distinguish looped digits like “8” or “6” from non-looped ones like “1” or “7”.

Symmetry is calculated in two directions: horizontal and vertical. For horizontal symmetry, the top half of the digit is compared to the flipped bottom half, pixel by pixel. A similar process is used for vertical symmetry. The results are stored as decimal values between 0 and 1, where 1 indicates perfect symmetry. Digits like “8” are highly symmetric, while “5” is less so.

One of the more advanced features is writing direction, which analyzes the dominant flow of pen strokes in the digit. This is estimated by skeletonizing the digit and calculating gradient vectors between connected white pixels. The directions are summarized using angular histograms and averaged over four image quadrants to better capture local directional trends. The result includes both magnitude (the strength of directional flow) and angle (the orientation), which help differentiate digits based on how they are drawn: for example, a “2” may show strong rightward curvature, while a “7” may show sharp vertical and diagonal transitions.

Finally, quarter-based features are also computed by dividing each image into four equal parts. For each quadrant, features like pixel density, average stroke width, and gradient flow are independently measured. This adds a layer of localized spatial analysis and can be particularly helpful when digits share global features but differ in their layout, such as “9” vs “4”.

In summary, the feature extraction process converts raw MNIST image data into a structured vector of interpretable numeric features that describe the digit’s shape, structure, and writing dynamics. These features are exported into a CSV file for use in the classification stage, enabling an interpretable, modular approach to handwritten digit recognition.

Classification Pipeline

Following feature extraction, the digit classification was performed using a distance-based approach. Rather than leveraging external machine learning libraries, a custom K-Nearest Neighbors (KNN)-style classifier was implemented from scratch. The classifier calculates the Euclidean distance between each testing image’s feature vector and the average feature vector of each digit class (0–9) computed from the training set.

Before computing distances, feature values were normalized to a [0,1] range using min-max scaling to avoid bias due to differing numeric scales. Additionally, the system allows for feature weighting, meaning that more important features (e.g., symmetry or loop count) can be assigned higher influence during classification. This modular design supports experimentation with different weighting schemes to optimize accuracy.

The model evaluates its predictions using a confusion matrix, precision metrics, and accuracy scores, enabling quantitative comparison of different feature sets and weighting strategies. Errors are visually inspected to guide iterative refinement of the feature extraction and classification logic.

Dataset and Experimental Setup

For this study, we utilized the MNIST dataset, a well-known benchmark for handwritten digit recognition. The full dataset consists of 70,000 labeled grayscale images of handwritten digits (0–9), each of size 28×28 pixels. Of these, 60,000 images are used for training and 10,000 for testing. However, in our experiment, we implemented a custom CSV-based approach using a limited subset of the MNIST data. Specifically, we processed and extracted features from a fixed number of samples per digit to maintain class balance and control computational complexity.

The system was designed in two main phases: feature extraction and classification. In the feature extraction phase, each image was transformed into a row in a CSV file, where each column represented a manually engineered feature such as pixel count, loop count, symmetry, bounding box, intersection count, etc. In total, over 20 distinct features were extracted and used in the classification phase.

The classification phase was implemented using a custom Euclidean distance classifier. For each digit class (0–9), we computed the average feature vector across the training samples. When a new test image was presented, its features were compared against each of the class averages using Euclidean distance, and the closest class was chosen as the prediction.

Additionally, we introduced feature weighting, allowing specific features to have more or less influence during classification based on their discriminative power. The classification results were tracked and evaluated using accuracy metrics and confusion matrices.

Accuracy and Confusion Matrix

The performance of the classifier was evaluated using a confusion matrix, which visually represents the number of correct and incorrect predictions for each digit. This matrix enabled us to quickly identify which digits were most frequently misclassified and which were most accurately predicted.

The initial model, without any feature weighting or tuning, achieved a moderate classification accuracy, with particularly strong performance on digits like “0” and “1,” which have distinct visual structures. Digits such as “5” and “3” were more commonly confused with each other due to their visual similarity, particularly in cursive or stylized handwriting.

After iteratively tuning the feature weights, we observed a notable improvement in classification performance, especially in reducing confusion between closely related digits. For example, giving more weight to loop count, intersections, and writing direction significantly helped in distinguishing digits like “6,” “8,” and “9.”

At its best configuration, the system reached an overall accuracy of 50.81, with some digits like “1,” “0,” and “8” achieving near-perfect classification. The confusion matrix clearly reflected the impact of feature weighting, with off-diagonal errors shrinking in many digit classes.

To assess which features contributed most effectively to accurate digit classification, a series of weight tuning experiments and feature ablation tests were conducted. These experiments involved systematically adjusting the importance (weight) of individual features during the classification process and observing the resulting changes in accuracy. This approach allowed us to isolate the features with the greatest impact on distinguishing between visually similar digits.

One of the most consistently useful features was pixel count, which reflects the total number of non-background pixels in the digit image. This feature helped differentiate digits with dense strokes, like “8,” from those with minimal writing, such as “1.” Similarly, loop count proved to be highly informative, especially for identifying digits like “8,” which contains two loops, versus digits such as “0,” “6,” or “9,” which have one loop, and digits like “1” or “7,” which have none.

The corner count and intersection count, both derived from the skeletonized version of the digit, played a key role in identifying digits that involve sharp turns or complex branch-like structures. Digits such as “4” and “8” exhibited higher intersection counts due to multiple connecting lines, while digits like “1” and “7” had noticeably fewer corners. However, corner detection was found to be sensitive to image noise and line thickness, and improvements were made by fine-tuning the skeletonization algorithm (Siddiqi & Pizer, 2008).

Another valuable set of features came from analyzing symmetry. Horizontal and vertical symmetry scores helped to recognize digits with more balanced structures, such as “0,” “3,” and “8.” In contrast, digits like “5” and “2” exhibited more asymmetry, which aided in distinguishing them from others. Symmetry-based features were particularly helpful when pixel count or loops were not sufficient on their own.

Finally, one of the most advanced features used was the writing direction, computed from gradient vectors and angular motion across the digit’s skeleton. This feature helped capture the natural drawing flow of digits. For example, “2” typically starts with a curve that swings from the top left to the bottom right, while “5” often features a left-facing arc followed by a vertical drop. By dividing the image into quadrants and calculating directional vectors in each section, we were able to capture both global and local movement trends that further improved digit differentiation.

Overall, the combination of these features, both geometric and dynamic, provided a diverse and interpretable representation of handwritten digits. When these features were strategically weighted, they significantly improved the system’s ability to correctly classify even the most visually ambiguous samples.

Visualization and Debugging

Visualization tools played a crucial role in understanding model behavior. For each test sample, the system could plot the digit image, highlight detected corners, intersections, center of mass, bounding box, and even draw gradient arrows representing writing direction. These visuals helped validate that the feature extractor was working correctly and guided the adjustment of skeletonization, thresholding, and corner detection parameters (Siddiqi & Pizer, 2008).

Skeletonization outputs, in particular, revealed occasional anomalies, such as overly thick or broken lines due to imperfect thresholding. These were later corrected through pre-processing steps and adaptive thinning (Siddiqi & Pizer, 2008).

Summary of Results

Overall, the experimental results demonstrated that an interpretable, feature-based classifier can achieve reasonable performance on a complex task like digit recognition. While not competitive with modern convolutional neural networks (CNNs), this approach provides clear insights into how features contribute to classification. The system’s modularity also makes it easy to extend, optimize, and debug.

The key takeaway is that careful feature engineering and visualization can go a long way in building effective and explainable machine learning models: even for tasks typically reserved for deep learning (Nguyen & Bai, 2020).

Strengths of the Approach

One of the major strengths of this handwritten digit classification system is the interpretability of the features used. Unlike black-box models such as neural networks, which can achieve high accuracy but offer little transparency, this approach relies on intuitive and human-understandable features, such as corner counts, pixel density, symmetry, and writing direction. These features provide not only a basis for classification but also a valuable window into the structure and characteristics of handwritten digits. This makes the model especially useful for educational purposes, explainable AI research, and deployment in systems where traceability of decisions is important.

Additionally, the design emphasizes customization and modular testing. Because each feature is extracted individually and can be visualized, the model allows for fine-grained analysis of each image. Visualization tools, such as skeleton overlays, direction arrows, and bounding boxes, enhance interpretability and assist in identifying both successful and problematic classifications. Moreover, the implementation of feature weighting allows for dynamic tuning of the classifier to prioritize certain distinguishing characteristics for specific digits, significantly improving the robustness of the model.

Limitations

Despite these strengths, the system also has notable limitations. First, feature-based classification is inherently less flexible than deep learning models. While convolutional neural networks can learn thousands of nuanced features from training data, this system relies on a fixed set of manually engineered features. As a result, it may struggle to adapt to unusual handwriting styles or generalize to out-of-distribution samples (Nguyen & Bai, 2020).

Second, while some features such as pixel count and symmetry are stable across digits, others—particularly corner and intersection counts—are sensitive to noise and variations in stroke width. Even after applying skeletonization and refinement techniques, some digits still exhibit spurious feature detections in areas of high stroke density. These inaccuracies can mislead the classifier, especially for digits like “5” and “9” that have subtle structural differences (Siddiqi & Pizer, 2008).

Another challenge arises from the uniform scaling of feature distances. Since all features are normalized to the same scale before Euclidean distance is calculated, differences in feature stability and importance can be overlooked unless explicitly corrected with proper weighting. Without optimized weights, the classifier may be biased toward features that have larger variance or noise, reducing accuracy.

Implications for Future Work

The findings from this system reinforce the idea that simple, interpretable features can still perform competitively on classification tasks when properly designed and tuned. This supports the value of feature engineering in settings where model explainability is critical. Additionally, the ability to visualize the contribution of each feature creates opportunities for human-in-the-loop optimization and error analysis.

These results also open the door to future hybrid approaches. By combining the transparent logic of engineered features with the pattern recognition strength of machine learning models, it may be possible to create hybrid systems that provide both high accuracy and clear explanations. In educational settings, this system can serve as a baseline for teaching students the fundamentals of computer vision and classification without the overhead of deep learning frameworks (Nguyen & Bai, 2020).

Finally, the architecture’s modularity makes it well-suited for experimentation with novel features. Techniques like stroke order estimation, writing speed simulation, or temporal reconstruction of digit drawing paths may offer further improvements. The flexibility and transparency of the current system provide a solid foundation for continued exploration.

Conclusion

This research project presents an interpretable, feature-based approach to handwritten digit classification using the MNIST dataset. Unlike black-box deep learning models, the method relies on clearly defined, explainable features such as pixel count, symmetry, corner and intersection detection, writing direction via Fourier and gradient analysis, and geometric properties like bounding boxes and centers of mass. Through careful engineering and visualization of these features, the system offers valuable insight into how digits can be uniquely characterized by their visual structure (Nguyen & Bai, 2020).

The classifier itself uses a weighted Euclidean distance algorithm to compare new digit samples to statistical averages derived from a training set. This approach allows the model to make data-driven predictions while maintaining transparency and flexibility. Results were visualized via confusion matrices, highlighting both successful classifications and areas where the model struggled, such as differentiating between visually similar digits like 4 and 9. Weight adjustments to the features significantly improved accuracy by emphasizing the most discriminative properties.

One of the key contributions of this work lies in the balance between accuracy and interpretability. While modern deep learning approaches may achieve higher performance metrics, they often sacrifice explainability. This project demonstrates that through methodical feature selection and modular design, it is possible to achieve strong classification performance without abandoning transparency (Nguyen & Bai, 2020).

Looking forward, this framework serves as a robust foundation for further research into human-interpretable machine learning systems. By continuing to refine feature definitions, integrating hybrid techniques, and addressing edge cases through new innovations like stroke order simulation, the model can evolve to rival more complex approaches—while remaining understandable and trustworthy.

In conclusion, this project highlights the potential of interpretable, modular AI systems to achieve meaningful results in computer vision tasks, with wide-ranging applications in education, transparency-focused AI development, and real-world deployment where explainability is paramount.

References

Fan, Y., Zhao, X., Wang, L., Wang, W., Wang, S., & Xu, M. (2021). A review on interpretability of artificial neural networks. Frontiers in Neurorobotics, 15, 752666. https://doi.org/10.3389/fnbot.2021.752666

Lipton, Z. C. (2018). The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Communications of the ACM, 61(10), 36–43. https://doi.org/10.1145/3233231

Lundberg, S. M., & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems, 30. https://proceedings.neurips.cc/paper_files/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html (Lundberg & Lee, 2017).

Nguyen, T. T., & Bai, L. (2020). A review of traditional and deep learning-based feature descriptors for image classification. Journal of Big Data, 7(1), 1–32. https://link.springer.com/article/10.1186/s40537-020-00327-4 (Nguyen & Bai, 2020).

Siddiqi, K., & Pizer, S. M. (2008). Medial Representations: Mathematics, Algorithms and Applications. Springer. https://link.springer.com/book/10.1007/978-1-4020-8658-3 (Siddiqi & Pizer, 2008).

Xie, S., & Tu, Z. (2015). Holistically-Nested Edge Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1395–1403). https://openaccess.thecvf.com/content_cvpr_2015/html/Xie_Holistically-Nested_Edge_Detection_2015_CVPR_paper.html (Xie & Tu, 2015).


About the author

Austin Benedicto

Austin is a 12th grade student at the Nichols School in Buffalo, New York. He enjoys studying computer science and robotics in school. Austin has been involved in the FIRST Robotic program at his school for the last 8 years, serving both as a team member and mentoring younger students. He also served as project manager on the coding sub team and has an interest in artificial intelligence.