Neural Networks 101: Part 7 - Data Preprocessing
Data Preprocessing
This article will take a closer look at Data Preprocessing using the FastAI library.
Data Preprocessing is a very important step required for Deep Learning. It can be seen as more of a “chore” and less glamorous than learning about Architectures but its still a very impotant step.
We can define the motivation of Data Preprocessing as:
Organzing, "cleaning" and enhance training data in order to maximize the results.
We can loosely define the steps of Data Preprocessing (specifically with Image Classification):
- Download the training data
- Investigate and understand the structure of the data
- Separate the independent variable (the input e.g. an image) and the dependent variable (the label) into x and y axis
- Standardize the input data (e.g. convert all images to the same sized tensors)
- Data Augmentation - (e.g. create copies of the same input but, zoom into different segments, flip or rotate the images)
- Split the data into training and validation steps
- Batch the training data, ready for training
Below we will run through some code using FastAI that will perform Data Preprocessing on images that require multi label classification (meaning more than one label in an image).
Code Example
- Download the training data - PASCAL_2007.
from fastai.vision.all import *
path = untar_data(URLs.PASCAL_2007)
- Using pandas, read the csv folder that contains information about the structure of the training data
data_frame = pd.read_csv(path/'train.csv')
The returned object is a DataFrame
. This is a pandas object that represents data in a tabular structure.
- We need to convert the
DataFrame
to aDataBlock
The DataBlock
is a FastAI object that prepares the data for training. This class specifies where the data comes from, how to split the data between training and validation, how to label the data and how to augment the data.
- Using the
DataBlock
, we standardize and augment the data.
def get_x(r): return path/f"train/{r['fname']}"
def get_y(r): return r['labels'].split(" ")
def splitter(data_frame):
train = data_frame.index[~data_frame['is_valid']].tolist()
valid = data_frame.index[data_frame['is_valid']].tolist()
return train,valid
data_block = DataBlock(
blocks=(ImageBlock, MultiCategoryBlock),
splitter=splitter,
get_x=get_x,
get_y=get_y,
item_tfms = RandomResizedCrop(128, min_scale=0.35)
)
data_loaders = data_block.dataloaders(data_frame)
The functions
get_x
andget_y
are functions that will be used in theDataBlock
to get the independent variable (the image) and the depenent variable (the labels) from the training data.The function
splitter
will use theis_valid
column in theDataFrame
object to split the training and validation data.In the
DataBlock
,blocks=(ImageBlock, MultiCategoryBlock)
defines the training input.- ImageBlock will assume the independent variables are images and will convert the images to tensors
- MultiCategoryBlock will create a
one-hot encoding
for the dependent variables.
In the
DataBlock
,item_tfms
is the item transformer, this will perform the Augmentation. In this case, we are resizing all images to 128*128 pixels and choosing at least 35% as the minimum size of the randomly selected crop on the image.The final line,
data_block.dataloaders(data_frame)
converts theDataBlock
to aDataLoaders
object, this takes configuration from theDataBlock
and batches them, ready for training
Summary
We explained how and why Data Preprocessing is required before training a Model.
The steps are:
- Download the data
- Investigate the data and its structure
- Split the input into training and validation
- Separate the data into its dependent and independent variables
- Standardize the input
- Augment the input
- Batch the training data