MNIST Digit Classification with Daft#
Run this tutorial on Google Colab
The MNIST Dataset is a "large database of handwritten digits that is commonly used for training various image processing systems". A classic example used in machine learning.
Loading JSON Data#
This is a JSON file containing all the data for the MNIST test set. Let's load it up into a Daft Dataframe!
1 2 3 4 5 | |
To peek at the dataset, simply display the images_df that was just created.
1 | |
You just loaded your first Daft Dataframe! It consists of two columns:
- The "image" column is a Python column of type
list- where it looks like each row contains a list of digits representing the pixels of each image - The "label" column is an Integer column, consisting of just the label of that image.
Processing Columns with User-Defined Functions (UDF)#
It seems our JSON file has provided us with a one-dimensional array of pixels instead of two-dimensional images. We can easily modify data in this column by instructing Daft to run a method on every row in the column like so:
1 2 3 4 5 6 7 | |
Great, but we can do one better - let's convert these two-dimensional arrays into Images. Computers speak in pixels and arrays, but humans do much better with visual patterns!
To do this, we can leverage the .apply expression method. Similar to the .as_py method, this allows us to run a single function on all rows of a given column, but provides us with more flexibility as it takes as input any arbitrary function.
1 2 3 4 5 6 7 | |
Amazing! This looks great and we can finally get some idea of what the dataset truly looks like.
Running a model with UDFs#
Next, let's try to run a deep learning model to classify each image. Models are expensive to initialize and load, so we want to do this as few times as possible, and share a model across multiple invocations.
For the convenience of this quickstart tutorial, we pre-trained a model using a PyTorch-provided example script and saved the trained weights at https://github.com/Eventual-Inc/mnist-json/raw/master/mnist_cnn.pt. We need to define the same deep learning model "scaffold" as the trained model that we want to load (this part is all PyTorch and is not specific at all to Daft)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | |
Now comes the fun part - we can define a UDF using the @udf decorator. Notice that for a batch of data we only initialize our model once!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
Using this UDF is really easy, we simply run it on the columns that we want to process:
1 2 3 | |
Our model ran successfully, and produced a new classification column. These look pretty good - let's filter our Dataframe to show only rows that the model predicted wrongly.
1 | |
Some of these look hard indeed, even for a human!
Analytics#
We just managed to run our model, but how well did it actually do? Dataframes expose a powerful set of operations in Groupbys/Aggregations to help us report on aggregates of our data.
Let's group our data by the true labels and calculate how many mistakes our model made per label.
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
Pretty impressive, given that the model only actually trained for one epoch!