Statistical Machine Learning by using Jupyter Notebook Problem


For this HW, we conquer affect that we singly bear 1000 fictions of the MNIST basisset.

We conquer prepare chief after a while what we had in our videos:

#commsingly used meanings
meaning numpy as np
meaning matplotlib as mpl
meaning matplotlib.pyplot as plt

#get basis from sklearn
from sklearn.datasets meaning fetch_openml
mnist = fetch_openml('mnist_784', version=1, as_frame=False)

#get the attributes and addresss
X, y = mnist["data"], mnist["target"]

#convert addresss from characters to numbers
y = y.astype(np.uint8)

#split into inoculation and experiment sets
X_train, X_test, y_train, y_experiment = X[:60000], X[60000:], y[:60000], y[60000:]

Suppose we flow on a Conclusion Tree Classifier (balmy later in Chapter 6). Thither are two hyperparameters we conquer pure sprocession for now: max_leaf_nodes and min_samples. Let's pure sprocession these using the aftercited method.

from sklearn.tree meaning DecisionTreeClassifier
from sklearn.model_selection meaning GridSearchCV

#setup the conclusion tree
dt_clf = DecisionTreeClassifier( random_state=30)

#the hyperparameters to pursuit through
params = {'max_leaf_nodes': catalogue(range(2, 100)), 'min_samples_split': [3,4, 5]}

#initialize the GridSearch after a while 3-fold cantankerous validation
grid_search_cv = GridSearchCV(dt_clf, params, verbose=1, cv=3)

#do the pursuit (but singly on the chief 1000 fictions)[:1000], y_train[:1000])

What are the best hyperparameters?

Now, let's fit the inoculation basis (again affecting to singly bear the chief 1000 fictions).

dt_clf = grid_search_cv.best_estimator_[:1000], y_train[:1000])

Let's now conquer cantankerous-validation faultlessness scores after a while this pure-tuned type

from sklearn.model_selection meaning cantankerous_val_score
cross_val_score(dt_clf, X_train[:1000], y_train[:1000], cv=3, scoring="accuracy")

How servile are your three folds?

If we singly had 1000 fictions and we wanted further to procession on, we could do what is public as basis enrichment. The enrichment we conquer do hither is to displace the fictions we do bear. For each of the 1000 fictions, we displace one pixel up, left, upright, and down. So this conquer grant us filthy further fictions to add to our inoculation. Basis enrichment such as this is a very advantageous technique when thither are not abundance inoculation instances. If you run the aftercited method, you conquer see an pattern of the chief fiction displaceed.

from scipy.ndimage.interpolation meaning displace
def displace_image(image, dx, dy):
fiction = fiction.reshape((28, 28))
displaceed_fiction = displace(image, [dy, dx], cval=0, mode="constant")
come-back displaceed_image.reshape([-1])

fiction = X_train[0]
shifted_image_down = displace_image(image, 0, 5)
shifted_image_left = displace_image(image, -5, 0)

plt.title("Original", fontsize=14)
plt.imshow(image.reshape(28, 28), interpolation="nearest", cmap="Greys")
plt.title("Shifted down", fontsize=14)
plt.imshow(shifted_image_down.reshape(28, 28), interpolation="nearest", cmap="Greys")
plt.title("Shifted left", fontsize=14)
plt.imshow(shifted_image_left.reshape(28, 28), interpolation="nearest", cmap="Greys")

To displace all of our fictions and add them to a inoculation basis set, we can use the aftercited method:

X_train_augmented = [fiction for fiction in X_train[:1000]]
y_train_augmented = [address for address in y_train[:1000]]

for dx, dy in ((1, 0), (-1, 0), (0, 1), (0, -1)):
for fiction, address in zip(X_train[:1000], y_train[:1000]):
X_train_augmented.append(shift_image(image, dx, dy))

X_train_augmented = np.array(X_train_augmented)
y_train_augmented = np.array(y_train_augmented)

You now bear 5000 fictions in the X_train_augmented and y_train_augmented sets (the 1000 former fictions and each fiction was displaceed up, down, left, and upright).

Let's now pure-sprocession and fit a conclusion tree classifier using this augmented inoculation basis.

dt_clf = DecisionTreeClassifier( random_state=30)
params = {'max_leaf_nodes': catalogue(range(2, 100)), 'min_samples_split': [3,4, 5]}
grid_search_cv = GridSearchCV(dt_clf, params, verbose=1, cv=3)

#This conquer transfer encircling 10 minutes to run, y_train_augmented)


Now, what are the best hyperparameters?

Finally, conquer cantankerous-validation faultlessness scores after a while this type processioned on the augmented basis.

What are the faultlessness scores now?

Did the augmented basis acceleration?

Think of one downside to using augmented basis and interpret.

  • Submit your method in a Jupyter notebook. Include all method: the method patterns aloft and what you transcribe.
  • Put your answers to the questions aloft into markdown cells.
  • Use a only hashmark epithet to address the problem