Data Preprocessing

This article will take a closer look at Data Preprocessing using the FastAI library.

Data Preprocessing is a very important step required for Deep Learning. It can be seen as more of a “chore” and less glamorous than learning about Architectures but its still a very impotant step.

We can define the motivation of Data Preprocessing as:

Organzing, "cleaning" and enhance training data in order to maximize the results.

We can loosely define the steps of Data Preprocessing (specifically with Image Classification):

Download the training data
Investigate and understand the structure of the data
Separate the independent variable (the input e.g. an image) and the dependent variable (the label) into x and y axis
Standardize the input data (e.g. convert all images to the same sized tensors)
Data Augmentation - (e.g. create copies of the same input but, zoom into different segments, flip or rotate the images)
Split the data into training and validation steps
Batch the training data, ready for training

Below we will run through some code using FastAI that will perform Data Preprocessing on images that require multi label classification (meaning more than one label in an image).

Code Example

Download the training data - PASCAL_2007.

from fastai.vision.all import *
path = untar_data(URLs.PASCAL_2007)

Using pandas, read the csv folder that contains information about the structure of the training data

data_frame = pd.read_csv(path/'train.csv')

The returned object is a DataFrame. This is a pandas object that represents data in a tabular structure.

We need to convert the DataFrame to a DataBlock

The DataBlock is a FastAI object that prepares the data for training. This class specifies where the data comes from, how to split the data between training and validation, how to label the data and how to augment the data.

Using the DataBlock, we standardize and augment the data.

def get_x(r): return path/f"train/{r['fname']}"

def get_y(r): return r['labels'].split(" ")

def splitter(data_frame):
    train = data_frame.index[~data_frame['is_valid']].tolist()
    valid = data_frame.index[data_frame['is_valid']].tolist()
    return train,valid

data_block = DataBlock(
    blocks=(ImageBlock, MultiCategoryBlock),
    splitter=splitter,
    get_x=get_x,
    get_y=get_y,
    item_tfms = RandomResizedCrop(128, min_scale=0.35)
)

data_loaders = data_block.dataloaders(data_frame)

The functions get_x and get_y are functions that will be used in the DataBlock to get the independent variable (the image) and the depenent variable (the labels) from the training data.
The function splitter will use the is_valid column in the DataFrame object to split the training and validation data.
In the DataBlock, blocks=(ImageBlock, MultiCategoryBlock) defines the training input.
- ImageBlock will assume the independent variables are images and will convert the images to tensors
- MultiCategoryBlock will create a one-hot encoding for the dependent variables.
In the DataBlock, item_tfms is the item transformer, this will perform the Augmentation. In this case, we are resizing all images to 128*128 pixels and choosing at least 35% as the minimum size of the randomly selected crop on the image.
The final line, data_block.dataloaders(data_frame) converts the DataBlock to a DataLoaders object, this takes configuration from the DataBlock and batches them, ready for training

Summary

We explained how and why Data Preprocessing is required before training a Model.

The steps are:

Download the data
Investigate the data and its structure
Split the input into training and validation
Separate the data into its dependent and independent variables
Standardize the input
Augment the input
Batch the training data

Neural Networks 101: Part 7

Data Preprocessing

Code Example

Summary

Related

Neural Networks 101: Part 7

Data Preprocessing

Code Example

Summary

Did you find this page helpful? Consider sharing it 🙌

Related