By Scikit Learn:
import numpy as np
from sklearn import preprocessing, cross_validation, neighbors
import pandas as pd
df = pd.read_csv('breast-cancer-wisconsin.data.txt')
df.replace('?',-99999, inplace=True)
df.drop(['id'], 1, inplace=True)
X = np.array(df.drop(['class'], 1))
y = np.array(df['class'])
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
print(accuracy)
Build your own model:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
import warnings
from collections import Counter
#dont forget this
import pandas as pd
import random
style.use('fivethirtyeight')
def k_nearest_neighbors(data, predict, k=3):
if len(data) >= k:
warnings.warn('K is set to a value less than total voting groups!')
distances = []
for group in data:
for features in data[group]:
euclidean_distance = np.linalg.norm(np.array(features)-np.array(predict))
distances.append([euclidean_distance,group])
votes = [i[1] for i in sorted(distances)[:k]]
vote_result = Counter(votes).most_common(1)[0][0]
return vote_result
df = pd.read_csv('breast-cancer-wisconsin.data.txt')
df.replace('?',-99999, inplace=True)
df.drop(['id'], 1, inplace=True)
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
test_size = 0.2
train_set = {2:[], 4:[]}
test_set = {2:[], 4:[]}
train_data = full_data[:-int(test_size*len(full_data))]
test_data = full_data[-int(test_size*len(full_data)):]
for i in train_data:
train_set[i[-1]].append(i[:-1])
for i in test_data:
test_set[i[-1]].append(i[:-1])
correct = 0
total = 0
for group in test_set:
for data in test_set[group]:
vote = k_nearest_neighbors(train_set, data, k=5)
if group == vote:
correct += 1
total += 1
print('Accuracy:', correct/total)
Point to be know before you consider:
1. Whats your data volume (should not be in TB)
2. What should be value of "K" (depending upon your requirement, having high K value doesn't mean you will get better accuracy, in-fact opposite is what I observed)
3. Can you multithread your algorithm ? (Scikit KNN algorithm is already multithreaded (n_jobs = -1))
4. Difference between Accuracy and Confidence
5. Do you need to define Radius ?
The K-Nearest Neighbors (KNN) algorithm implemented in scikit-learn follows the same core concept as the actual theoretical KNN algorithm, but Scikit-learn provides optimized, production-ready implementations for efficiency and scalability. In the actual KNN algorithm, the model stores all training data and predicts the output for a new data point by calculating distances (such as Euclidean distance) between the new point and all existing training samples. The algorithm then selects the nearest “K” neighbors and performs classification or regression based on majority voting or averaging.
ReplyDeleteIn Machine Learning Projects for Final Year, Scikit-learn enhances this process using optimized data structures and computational techniques such as KD-Tree, Ball Tree, and brute-force search for faster neighbor lookup. It also provides additional features like automatic parameter tuning, multiple distance metrics, weighted neighbors, parallel processing, and seamless integration with preprocessing pipelines. While the actual KNN algorithm is mainly a conceptual learning model, the Scikit-learn implementation is engineered for real-world machine learning applications with improved performance, scalability, and ease of use. For example, Scikit-learn allows users to train a KNN model with just a few lines of Python code while internally handling optimization and memory management efficiently.
ReplyDelete