Распознавание рукописных цифр с помощью scikit-learn

Постановка задачи:

Набор данных Digits библиотеки scikit-learn предоставляет многочисленные наборы данных, которые полезны для тестирования многих задач анализа данных и прогнозирования результатов. Некоторые ученые утверждают, что он точно предсказывает цифру в 95% случаев. Выполните анализ данных, чтобы принять эту гипотезу.

Импорт необходимых библиотек:

#importing libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression 
from sklearn.ensemble import RandomForestClassifier
import warnings 
warnings.filterwarnings("ignore")

Импорт набора данных из библиотеки Sk-learn:

# Importing dataset from sk-learn library
from sklearn import datasets
digits = datasets.load_digits()

Описание набора данных:

# Description of dataset
print(digits.DESCR)
.. _digits_dataset:

Optical recognition of handwritten digits dataset
--------------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 1797
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is an integer in the range
0..16. This reduces dimensionality and gives invariance to small
distortions.

For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
1994.

.. topic:: References

  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
    Graduate Studies in Science and Engineering, Bogazici University.
  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
    Linear dimensionalityreduction using relevance weighted LDA. School of
    Electrical and Electronic Engineering Nanyang Technological University.
    2005.
  - Claudio Gentile. A New Approximate Maximal Margin Classification
    Algorithm. NIPS. 2000.

# shape of dataset
print(digits.data.shape)
out: (1797, 64)
# ploting single digit image
plt.gray()
plt.matshow(digits.images[0],cmap=plt.cm.hot,interpolation = 'nearest')
plt.show()

Чтобы применить классификатор к этим данным, нам нужно сгладить изображения, превратив каждый двумерный массив значений оттенков серого из формы (8, 8) в форму (64,). Впоследствии весь набор данных будет иметь форму (выборка, признаки), где образец — это количество изображений, а признаки — общее количество пикселей в каждом изображении.

# flatten the images
sample = len(digits.images)
data = digits.images.reshape((sample, -1))
print(data.view())
out:
[[ 0.  0.  5. ...  0.  0.  0.]
 [ 0.  0.  0. ... 10.  0.  0.]
 [ 0.  0.  0. ... 16.  9.  0.]
 ...
 [ 0.  0.  1. ...  6.  0.  0.]
 [ 0.  0.  2. ... 12.  0.  0.]
 [ 0.  0. 10. ... 12.  1.  0.]]

Разделение обучения и тестирования:

# spliting data in training and testing split
from sklearn.model_selection import train_test_split 
x_train, x_test, y_train, y_test = train_test_split(data, digits.target,test_size = 0.3,random_state=0)

Набор данных digits состоит из изображений цифр размером 8x8 пикселей. Атрибут images набора данных хранит массивы значений оттенков серого 8x8 для каждого изображения. Мы будем использовать эти массивы для визуализации изображений.

Визуализация цифр от 0 до 9

# Visualization of digits
plt.figure(figsize=(18,9))
for index,(image,label) in enumerate(zip(digits.data[0:10], digits.target[0:10])):
    plt.subplot(2,5,index+1)
    plt.imshow(np.reshape(image,(8,8)),cmap=plt.cm.gray)
    plt.title('Digit %d\n'% label, fontsize =15 )

Прогнозирование с использованием:

1.SVC (классификатор опорных векторов)

# creating the model
svc = svm.SVC(gamma=0.001, C=100.)
# fitting the svc model
svc.fit(x_train,y_train)
#Prediction
y_pred = svc.predict(x_test)
# test sample pridiction 
_, axes = plt.subplots(nrows=1, ncols=4, figsize=(10, 3))
for ax, image, prediction in zip(axes, x_test, y_pred):
    ax.set_axis_off()
    image = image.reshape(8, 8)
    ax.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
    ax.set_title(f'Prediction: {prediction}')
    
# save the figure
plt.savefig('plot7.png', dpi=300, bbox_inches='tight')

Отчет о классификации SVC и матрица путаницы

#using classification report and confusion matrix
from sklearn.metrics import confusion_matrix, classification_report
svc_confusion = confusion_matrix(y_test,y_pred)
print(f'Confusion matrix: \n{svc_confusion}\n\n')
print(f'Classification Report: \n{classification_report(y_test,y_pred)}')
Confusion matrix: 
[[45  0  0  0  0  0  0  0  0  0]
 [ 0 52  0  0  0  0  0  0  0  0]
 [ 0  0 52  0  0  0  0  1  0  0]
 [ 0  0  0 54  0  0  0  0  0  0]
 [ 0  0  0  0 48  0  0  0  0  0]
 [ 0  0  0  0  0 55  1  0  0  1]
 [ 0  0  0  0  0  0 60  0  0  0]
 [ 0  0  0  0  0  0  0 53  0  0]
 [ 0  1  0  0  0  0  0  0 60  0]
 [ 0  0  0  0  0  1  0  0  0 56]]


Classification Report: 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        45
           1       0.98      1.00      0.99        52
           2       1.00      0.98      0.99        53
           3       1.00      1.00      1.00        54
           4       1.00      1.00      1.00        48
           5       0.98      0.96      0.97        57
           6       0.98      1.00      0.99        60
           7       0.98      1.00      0.99        53
           8       1.00      0.98      0.99        61
           9       0.98      0.98      0.98        57

    accuracy                           0.99       540
   macro avg       0.99      0.99      0.99       540
weighted avg       0.99      0.99      0.99       540

Тепловая карта отчета о классификации и матрица путаницы для классификатора SVC

# heatmap for SVC classifier
plt.figure(figsize=(8,6))
s1=sns.heatmap(svc_confusion,square=True, annot=True,cmap='nipy_spectral_r', fmt='d', cbar=True)
plt.title('Confusion Matrix for SVC',fontsize = 15)
plt.xlabel('Predicted label', fontsize =15)
plt.ylabel('True label', fontsize =15)

# SVC accuracy score
svc_score = svc.score(x_test, y_test)
print('SVC accuracy score:',svc_score*100)
SVC accuracy score: 99.07407407407408

Показатель точности SVC: 99,07407407407408

2. Классификатор KNN (K-ближайших соседей)

# creating the model
knn = KNeighborsClassifier()
# training the model
knn.fit(x_train,y_train)
# pridiction
knn_y_predict = knn.predict(x_test)
#Accuracy score 
knn_score = knn.score(x_test, y_test)

Отчет о классификации KNN и матрица путаницы

#using classification report and confusion matrix
from sklearn.metrics import confusion_matrix, classification_report
knn_confusion = confusion_matrix(y_test,knn_y_predict)
print(f'Confusion matrix: \n{knn_confusion}\n\n')
print(f'Classification Report: \n{classification_report(y_test,knn_y_predict)}')
Confusion matrix: 
[[45  0  0  0  0  0  0  0  0  0]
 [ 0 51  0  0  0  1  0  0  0  0]
 [ 0  0 52  0  0  0  0  1  0  0]
 [ 0  0  1 53  0  0  0  0  0  0]
 [ 0  0  0  0 47  0  0  1  0  0]
 [ 0  0  0  0  0 55  1  0  0  1]
 [ 0  0  0  0  0  0 60  0  0  0]
 [ 0  0  0  0  0  0  0 53  0  0]
 [ 0  1  0  1  0  0  1  0 58  0]
 [ 0  0  0  0  0  1  0  0  0 56]]


Classification Report: 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        45
           1       0.98      0.98      0.98        52
           2       0.98      0.98      0.98        53
           3       0.98      0.98      0.98        54
           4       1.00      0.98      0.99        48
           5       0.96      0.96      0.96        57
           6       0.97      1.00      0.98        60
           7       0.96      1.00      0.98        53
           8       1.00      0.95      0.97        61
           9       0.98      0.98      0.98        57

    accuracy                           0.98       540
   macro avg       0.98      0.98      0.98       540
weighted avg       0.98      0.98      0.98       540
# heatmap for KNN Classifier
plt.figure(figsize=(8,6))
sns.heatmap(knn_confusion,square=True, annot=True,cmap='nipy_spectral_r', fmt='d', cbar=True)
plt.title('Confusion Matrix for KNN',fontsize = 15)
plt.xlabel('Predicted label', fontsize =15)
plt.ylabel('True label', fontsize =15)

KNN accuray score: 98.14814814814815

Классификатор K-ближайших соседей обеспечивает точность 98,14 %

3. Логистическая классификация

# creating the model
lr = LogisticRegression()
# prediction
lr_y_pred = lr.predict(x_test)
# prediction score
lr_score = lr.score(x_test, y_test)

Отчет о логистической классификации и матрица путаницы

# Classification report and confusion matrix for Logistic Classification
from sklearn.metrics import confusion_matrix, classification_report
lr_confusion = confusion_matrix(y_test,lr_y_pred)
print(f'Confusion matrix: \n{lr_confusion}\n\n')
print(f'Classification Report: \n{classification_report(y_test,lr_y_pred)}')
Confusion matrix: 
[[45  0  0  0  0  0  0  0  0  0]
 [ 0 49  0  0  0  0  0  0  2  1]
 [ 0  2 49  2  0  0  0  0  0  0]
 [ 0  0  0 52  0  0  0  0  1  1]
 [ 0  0  0  0 47  0  0  1  0  0]
 [ 0  0  0  0  0 55  0  0  0  2]
 [ 0  1  0  0  0  0 59  0  0  0]
 [ 0  0  0  1  1  0  0 51  0  0]
 [ 0  3  1  0  0  0  0  0 53  4]
 [ 0  0  0  0  0  1  0  0  1 55]]


Classification Report: 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        45
           1       0.89      0.94      0.92        52
           2       0.98      0.92      0.95        53
           3       0.95      0.96      0.95        54
           4       0.98      0.98      0.98        48
           5       0.98      0.96      0.97        57
           6       1.00      0.98      0.99        60
           7       0.98      0.96      0.97        53
           8       0.93      0.87      0.90        61
           9       0.87      0.96      0.92        57

    accuracy                           0.95       540
   macro avg       0.96      0.96      0.96       540
weighted avg       0.96      0.95      0.95       540
# heatmap of Logistic Classifier
plt.figure(figsize=(8,6))
sns.heatmap(lr_confusion,square=True, annot=True,cmap='nipy_spectral_r', fmt='d', cbar=True)
plt.title('Confusion Matrix for Logistic Classification ',fontsize = 15)
plt.xlabel('Predicted label', fontsize =15)
plt.ylabel('True label', fontsize =15)

# score
lr_score = lr.score(x_test, y_test)
print('Accuracy Score of Logistic Classifier:',(lr_score*100))
Accuracy Score of Logistic Classifier: 95.37037037037037

Логистический классификатор обеспечивает точность 97,40 %

4. Классификатор случайного леса

# creating the module
rf = RandomForestClassifier()
# fitting the model
rf.fit(x_train,y_train)
# prediction
rf_y_pred = rf.predict(x_test)
# prediction score
rf_score = rf.score(x_test, y_test)

Отчет о классификации и матрица путаницы для случайного леса

# Classification report and confusion matrix for Random forest
from sklearn.metrics import confusion_matrix, classification_report
rf_confusion = confusion_matrix(y_test,rf_y_pred)
print(f'Confusion matrix: \n{rf_confusion}\n\n')
print(f'Classification Report: \n{classification_report(y_test,rf_y_pred)}')
Confusion matrix: 
[[45  0  0  0  0  0  0  0  0  0]
 [ 0 51  0  0  0  1  0  0  0  0]
 [ 1  1 51  0  0  0  0  0  0  0]
 [ 0  0  0 54  0  0  0  0  0  0]
 [ 0  0  0  0 46  0  0  2  0  0]
 [ 0  0  0  1  0 55  1  0  0  0]
 [ 0  0  0  0  0  0 60  0  0  0]
 [ 0  0  0  0  0  0  0 53  0  0]
 [ 0  1  0  1  0  0  0  1 58  0]
 [ 0  0  0  1  0  0  0  0  0 56]]


Classification Report: 
              precision    recall  f1-score   support

           0       0.98      1.00      0.99        45
           1       0.96      0.98      0.97        52
           2       1.00      0.96      0.98        53
           3       0.95      1.00      0.97        54
           4       1.00      0.96      0.98        48
           5       0.98      0.96      0.97        57
           6       0.98      1.00      0.99        60
           7       0.95      1.00      0.97        53
           8       1.00      0.95      0.97        61
           9       1.00      0.98      0.99        57

    accuracy                           0.98       540
   macro avg       0.98      0.98      0.98       540
weighted avg       0.98      0.98      0.98       540
# Heatmap for Random Forest Classifier
plt.figure(figsize=(8,6))
sns.heatmap(rf_confusion,square=True, annot=True,cmap='nipy_spectral_r', fmt='d', cbar=True)
plt.title('Confusion Matrix for Random Forest Classifier',fontsize = 15)
plt.xlabel('Predicted label', fontsize =15)
plt.ylabel('True label', fontsize =15)

# score
rf_score = rf.score(x_test, y_test)
print('Accuracy Score of Random Forest Classifier:',(rf_score*100))
Accuracy Score of Random Forest Classifier: 97.96296296296296

Классификатор случайного леса обеспечивает точность 97,96 %

Некоторые ученые утверждают, что они точно предсказывают цифру в 95 % случаев. Поэтому нам нужно найти среднее значение всех 4 классификаторов, чтобы обосновать эту гипотезу

# calculating mean for all 4 classifier
avg_accuracy = (svc_score+lr_score+knn_score+rf_score)/4
print(f'Mean accuracy of all the classifier is:',round(avg_accuracy,4)*100)
Mean accuracy of all the classifier is: 97.64

Наши классификаторы дают среднюю точность 97,5 %

Заключение:

Этот набор данных предсказывает цифры в 97,5% случаев, поэтому мы можем сделать вывод, что результаты согласуются с нулевой гипотезой.

Спасибо!

Ссылка на Github: https://github.com/lawish/Suven-Consultants-technology-internship/blob/main/Recognizing%20Handwriting%20Digits%20(1).ipynb