为了降低预估的错误率以及避免过拟合现象的发生,我们可以在某种意义下将原始数据(dataset)进行分组,一部分做为训练集(train set),另一部分做为验证集(validation set or test set),首先用训练集对分类器进行训练,再利用验证集来测试训练得到的模型(model),以此来做为评价分类器的性能指标。这种方法也就是所谓的交叉验证(Cross Validation)。
importmatplotlib.pyplotaspltfromsklearn.cross_validationimportcross_val_score# creating odd list of K for KNNmyList=list(range(1,50))# subsetting just the odd onesneighbors=list(filter(lambdax:x%2!=0,myList))# empty list that will hold cv scorescv_scores=[]# perform 10-fold cross validationforkinneighbors:knn=KNeighborsClassifier(n_neighbors=k)scores=cross_val_score(knn,X_train,y_train,cv=10,scoring='accuracy')cv_scores.append(scores.mean())# changing to misclassification errorMSE=[1-xforxincv_scores]# determining best koptimal_k=neighbors[MSE.index(min(MSE))]print("The optimal number of neighbors is %d"%optimal_k)# plot misclassification error vs kplt.plot(neighbors,MSE)plt.xlabel('Number of Neighbors K')plt.ylabel('Misclassification Error')plt.show()
defpredict(X_train,y_train,x_test,k):# create list for distances and targetsdistances=[]targets=[]foriinrange(len(X_train)):# first we compute the euclidean distancedistance=np.sqrt(np.sum(np.square(x_test-X_train[i,:])))# add it to list of distancesdistances.append([distance,i])# sort the listdistances=sorted(distances)# make a list of the k neighbors' targetsforiinrange(k):index=distances[i][1]targets.append(y_train[index])# return most common targetreturnCounter(targets).most_common(1)[0][0]
defkNearestNeighbor(X_train,y_train,X_test,predictions,k):# loop over all observationsforiinrange(len(X_test)):predictions.append(predict(X_train,y_train,X_test[i,:],k))
使用我们上面得出最优的 k = 7作为参数生成模型并进行预估。
1234567891011121314
# making our predictions fromsklearn.metricsimportaccuracy_scorepredictions=[]kNearestNeighbor(X_train,y_train,X_test,predictions,7)# transform the list into an arraypredictions=np.asarray(predictions)#print(y_test,predictions)# evaluating accuracy#for i in range(predictions.size):# print(predictions.tolist()[i],list(y_test)[i])accuracy=accuracy_score(y_test,predictions)print('\nThe accuracy of our classifier is %d%%'%int(accuracy*100))