Практический пример SVM — Распознавание рукописных цифр.

Для получения дополнительных обновлений, пожалуйста, подпишитесь на мой блог@naivedatascientist.co.in

Классической проблемой в области распознавания образов является распознавание рукописных цифр. Предположим, у вас есть изображения рукописных цифр в диапазоне от 0 до 9, написанные разными людьми в коробках определенного размера — аналогично формам заявлений в банках и университетах.
Цель состоит в том, чтобы разработать модель, которая может правильно идентифицировать цифра (от 0 до 9), записанная на изображении. Для этой задачи мы используем данные MNIST, которые представляют собой большую базу данных рукописных цифр. «Значения пикселей» каждой цифры (изображения) составляют функции, а фактическое число от 0 до 9 является меткой. Поскольку каждое изображение имеет размер 28 x 28 пикселей, а каждый пиксель образует функцию, имеется 784 функции.

Чтение и проверка данных.

Importing important libraries
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn import svm
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import scale
import gc

Чтение данных

digit=pd.read_csv("DIGIT_RECOGNITION_SVM\\train.csv")
digit.head()
digit.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42000 entries, 0 to 41999
Columns: 785 entries, label to pixel783
dtypes: int64(785)
digit.describe()
digit.isnull().sum(axis=1).any()
False

digit.isnull().sum(axis=0).any()
False

Как видно отсюда, пропущенного значения как такового нет.

График некоторых цифр.

plt.figure(figsize=(12,6))
plt.subplot(1,4,1)
zero=digit.iloc[1,1:]
zero=zero.values.reshape(28,28)
plt.imshow(zero,cmap='gray')
plt.subplot(1,4,2)
one=digit.iloc[0,1:]
one=one.values.reshape(28,28)
plt.imshow(one,cmap='gray')
plt.subplot(1,4,3)
two = digit.iloc[16, 1:]
two = two.values.reshape(28, 28)
plt.imshow(two, cmap='gray')
plt.subplot(1,4,4)
three=digit.iloc[7,1:]
three=three.values.reshape(28,28)
plt.imshow(three,cmap='gray')

digit['label'].value_counts()
1    4684
7    4401
3    4351
9    4188
2    4177
6    4137
0    4132
4    4072
8    4063
5    3795
round(digit['label'].value_counts()/len(digit.index)*100,2)
1    11.15
7    10.48
3    10.36
9     9.97
2     9.95
6     9.85
0     9.84
4     9.70
8     9.67
5     9.04
The percentage of all the digits appearing in the data is almost 10% +- 1% ,so in all we have fairly balanced data,bcoz
svm normally does'nt do good work on unbalanced data.
Splitting into X and Y
x=digit.drop('label',axis=1)
y=digit['label']
x=scale(x)
Splitting into train and test,where training set is 20 % data and test set is 90 % data.
x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=0.10, random_state=10)
print('x_train : ',x_train.shape)
print('x_test : ',x_test.shape)
print('y_train : ',y_train.shape)
print('y_test : ',y_test.shape)
x_train :  (4200, 784)
x_test :  (37800, 784)
y_train :  (4200,)
y_test :  (37800,)

Построение модели.

linear model
model_linear = SVC(kernel='linear')
model_linear.fit(x_train,y_train)
predict
y_pred=model_linear.predict(x_test)
y_pred[:10]
array([7, 3, 9, 8, 6, 9, 7, 7, 9, 6], dtype=int64)
confusion matrix
accuracy
print("accuracy:", metrics.accuracy_score(y_true=y_test, y_pred=y_pred), "\n")
cm
print(metrics.confusion_matrix(y_true=y_test, y_pred=y_pred))
accuracy: 0.8999470899470899
Class wise accuracy
class_wise_accuracy=metrics.classification_report(y_true=y_test,y_pred=y_pred)
print(class_wise_accuracy)
            precision  recall    f1-score  support

      0       0.93      0.97      0.95      3716
      1       0.93      0.98      0.95      4201
      2       0.87      0.88      0.87      3767
      3       0.88      0.88      0.88      3949
      4       0.85      0.93      0.88      3624
      5       0.87      0.86      0.87      3456
      6       0.95      0.94      0.94      3692
      7       0.92      0.89      0.91      3975
      8       0.91      0.82      0.86      3672
      9       0.89      0.84      0.86      3748
gc.collect()

22

Нелинейная модель

Nonlinear SVM
non_linear_model=svm.SVC(kernel='rbf')
non_linear_model.fit(x_train,y_train)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
Predict
prediction= non_linear_model.predict(x_test)

Accuracy
print(metrics.accuracy_score(y_true=y_test, y_pred=prediction))
0.9201851851851852
Confusion matrix
print(metrics.confusion_matrix(y_true=y_test,y_pred=prediction))
[[3587    0   40    5    6   10   39    3   24    2]
 [   0 4111   25   21    9    5    9    4   12    5]
 [  19   16 3543   32   48    4   20   22   54    9]
 [  12   19  219 3465    8   74    9   36   81   26]
 [   5    7   87    0 3392   10   18   13    5   87]
 [  27   12   70  107   35 3063   69    6   37   30]
 [  20    6  110    0   13   35 3492    0   16    0]
 [   8   33  173    7   69    3    2 3551    6  123]
 [  18   52   72   67   25   81   18   14 3279   46]
 [  15   11   94   57  119    9    1  112   30 3300]]
Class wise accuracy
class_wise_accuracy=metrics.classification_report(y_true=y_test,y_pred=prediction)
print(class_wise_accuracy)
                precision   recall   f1-score    support

           0       0.97      0.97      0.97      3716
           1       0.96      0.98      0.97      4201
           2       0.80      0.94      0.86      3767
           3       0.92      0.88      0.90      3949
           4       0.91      0.94      0.92      3624
           5       0.93      0.89      0.91      3456
           6       0.95      0.95      0.95      3692
           7       0.94      0.89      0.92      3975
           8       0.93      0.89      0.91      3672
           9       0.91      0.88      0.89      3748

   micro avg       0.92      0.92      0.92     37800
   macro avg       0.92      0.92      0.92     37800
weighted avg       0.92      0.92      0.92     37800
There is a slight increase in accuracy when using rbf kernel,now using grid search cv  to tune hyper parameters.
creating a KFold object with 5 splits 
folds = KFold(n_splits = 5, shuffle = True, random_state = 10)

specify range of hyperparameters
Set the parameters by cross-validation
hyper_params = [ {'gamma': [1e-2, 1e-3, 1e-4],
                     'C': [1, 10, 100]}]


specify model
model = SVC(kernel="rbf")

set up GridSearchCV()
model_cv = GridSearchCV(estimator = model, 
                        param_grid = hyper_params, 
                        scoring= 'accuracy',
                        n_jobs=-1,
                        cv = folds, 
                        verbose = 1,
                        return_train_score=True)      

fit the model
model_cv.fit(x_train, y_train)
cv_results=pd.DataFrame(model_cv.cv_results_)

cv_results

Converting c into numeric type to plot on x axis cv_results['param_C']=cv_results['param_C'].astype('int') plt.figure(figsize=(15,7)) 

plt.subplot(1,3,1) 
gamma_01 = cv_results[cv_results['param_gamma']==0.01] plt.plot(gamma_01["param_C"], gamma_01["mean_test_score"]) plt.plot(gamma_01["param_C"], gamma_01["mean_train_score"]) plt.xlabel('C') plt.ylabel('Accuracy') plt.title("Gamma=0.01") plt.ylim([0.60, 1]) plt.legend(['test accuracy', 'train accuracy'], loc='lower left') plt.xscale('log')
 
plt.subplot(1,3,2) 
gamma_001 = cv_results[cv_results['param_gamma']==0.001] plt.plot(gamma_001["param_C"], gamma_001["mean_test_score"]) plt.plot(gamma_001["param_C"], gamma_001["mean_train_score"]) plt.xlabel('C') plt.ylabel('Accuracy') plt.title("Gamma=0.001") plt.ylim([0.60, 1]) plt.legend(['test accuracy', 'train accuracy'], loc='lower left') plt.xscale('log')

plt.subplot(1,3,3)
gamma_0001 = cv_results[cv_results['param_gamma']==0.0001] plt.plot(gamma_0001["param_C"], gamma_0001["mean_test_score"]) plt.plot(gamma_0001["param_C"], gamma_0001["mean_train_score"]) plt.xlabel('C') plt.ylabel('Accuracy') plt.title("Gamma=0.0001") plt.ylim([0.60, 1]) plt.legend(['test accuracy', 'train accuracy'], loc='lower left') plt.xscale('log')

printing the optimal accuracy score and hyperparameters
best_score = model_cv.best_score_
best_hyperparams = model_cv.best_params_

print("The best test score is {0} corresponding to hyperparameters {1}".format(best_score, best_hyperparams))

The best test score is 0.9209523809523809 corresponding to hyperparameters {'C': 10, 'gamma': 0.001}

Из приведенного выше графика мы можем сделать вывод: 1) При gammaa=0,01 модель достигает приблизительно 75% на тестовых данных, но полностью на данных поезда.

2) При gamma = 0,001, первоначально при c = 1 обе точности сопоставимы, но по мере увеличения c модель начинает переобучать по мере увеличения данных поезда, а данных испытаний нет.

3) При gamma=0,0001 точность модели составляет до c=10, но при большом значении c она имеет тенденцию к переоснащению. Таким образом, отсюда мы можем сделать вывод, что наилучшее сочетание — c==10 и gamma=0,001, где точность теста самая высокая (~92%)¶

Построение и оценка окончательной модели.

окончательная модель с оптимальными гиперпараметрами и проверка максимальной точности.

модель

модель = SVC (C = 10, гамма = 0,001, ядро = rbf)

model.fit(x_train, y_train)
y_pred = model.predict(x_test)

показатели

print(accuracy, metrics.accuracy_score(y_test, y_pred), \n)
print(metrics.confusion_matrix(y_test, y_pred), \n)

accuracy 0.9316402116402116 

[[3617    0   35    2    4    6   35    4    9    4]
 [   0 4113   26   21    9    3    8    6    8    7]
 [  18   15 3551   31   42    7   20   28   50    5]
 [  12    9  156 3578    9   72    6   28   56   23]
 [   7    8   72    0 3415   10   17   16    2   77]
 [  29   12   53  103   34 3105   62    5   36   17]
 [  17    4   75    0   12   22 3554    0    8    0]
 [   7   32  141   11   70    2    2 3629    8   73]
 [  24   46   75   71   25   76   20    8 3300   27]
 [  10    5   75   59  103   17    2  102   21 3354]]

Вывод

Точность для линейного ядра составляет 90%, а для нелинейного ядра — 93%, можно сказать, что проблема носит нелинейный характер.¶