Manmohan Mishra: Algorithm KNN = ScikitLearn vs Actual

Tuesday, July 25, 2017

Algorithm KNN = ScikitLearn vs Actual

By Scikit Learn:

import numpy as np

from sklearn import preprocessing, cross_validation, neighbors

import pandas as pd

df = pd.read_csv('breast-cancer-wisconsin.data.txt')

df.replace('?',-99999, inplace=True)

df.drop(['id'], 1, inplace=True)

X = np.array(df.drop(['class'], 1))

y = np.array(df['class'])

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)

clf = neighbors.KNeighborsClassifier()

clf.fit(X_train, y_train)

accuracy = clf.score(X_test, y_test)

print(accuracy)

Build your own model:

import numpy as np

import matplotlib.pyplot as plt

from matplotlib import style

import warnings

from collections import Counter

#dont forget this

import pandas as pd

import random

style.use('fivethirtyeight')

def k_nearest_neighbors(data, predict, k=3):

if len(data) >= k:

warnings.warn('K is set to a value less than total voting groups!')

distances = []

for group in data:

for features in data[group]:

euclidean_distance = np.linalg.norm(np.array(features)-np.array(predict))

distances.append([euclidean_distance,group])

votes = [i[1] for i in sorted(distances)[:k]]

vote_result = Counter(votes).most_common(1)[0][0]

return vote_result

df = pd.read_csv('breast-cancer-wisconsin.data.txt')

df.replace('?',-99999, inplace=True)

df.drop(['id'], 1, inplace=True)

full_data = df.astype(float).values.tolist()

random.shuffle(full_data)

test_size = 0.2

train_set = {2:[], 4:[]}

test_set = {2:[], 4:[]}

train_data = full_data[:-int(test_size*len(full_data))]

test_data = full_data[-int(test_size*len(full_data)):]

for i in train_data:

train_set[i[-1]].append(i[:-1])

for i in test_data:

test_set[i[-1]].append(i[:-1])

correct = 0

total = 0

for group in test_set:

for data in test_set[group]:

vote = k_nearest_neighbors(train_set, data, k=5)

if group == vote:

correct += 1

total += 1

print('Accuracy:', correct/total)

Point to be know before you consider:

1. Whats your data volume (should not be in TB)

2. What should be value of "K" (depending upon your requirement, having high K value doesn't mean you will get better accuracy, in-fact opposite is what I observed)

3. Can you multithread your algorithm ? (Scikit KNN algorithm is already multithreaded (n_jobs = -1))

4. Difference between Accuracy and Confidence

5. Do you need to define Radius ?

2 comments:

AnonymousMay 18, 2026 at 10:06 PM
The K-Nearest Neighbors (KNN) algorithm implemented in scikit-learn follows the same core concept as the actual theoretical KNN algorithm, but Scikit-learn provides optimized, production-ready implementations for efficiency and scalability. In the actual KNN algorithm, the model stores all training data and predicts the output for a new data point by calculating distances (such as Euclidean distance) between the new point and all existing training samples. The algorithm then selects the nearest “K” neighbors and performs classification or regression based on majority voting or averaging.
ReplyDelete
Replies
AnonymousMay 18, 2026 at 10:07 PM
In Machine Learning Projects for Final Year, Scikit-learn enhances this process using optimized data structures and computational techniques such as KD-Tree, Ball Tree, and brute-force search for faster neighbor lookup. It also provides additional features like automatic parameter tuning, multiple distance metrics, weighted neighbors, parallel processing, and seamless integration with preprocessing pipelines. While the actual KNN algorithm is mainly a conceptual learning model, the Scikit-learn implementation is engineered for real-world machine learning applications with improved performance, scalability, and ease of use. For example, Scikit-learn allows users to train a KNN model with just a few lines of Python code while internally handling optimization and memory management efficiently.
ReplyDelete
Replies

Add comment

Pages

Tuesday, July 25, 2017

Algorithm KNN = ScikitLearn vs Actual

2 comments: