When we load data to train a neural network model, We usually split data into the training data and testing data.
Then the training data is further split into batches to train the model. This is done to reduce the memory usage and speed up the training process.
Let’s see how to load data into a neural network model using PyTorch.
Dataset
pytorch provides a Dataset
class to load data into a neural network model. We can create a custom dataset by inheriting the Dataset
class and implementing the __len__
and __getitem__
methods.
Here is an example of a custom dataset class that loads data from a CSV file.
import pandas as pd
import torch
class CustomDataset(torch.utils.data.Dataset):
def __init__(self, file_path):
self.data = pd.read_csv(file_path)
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
x = torch.tensor(self.data.iloc[idx, :-1].values, dtype=torch.float32)
y = torch.tensor(self.data.iloc[idx, -1], dtype=torch.float32)
return x, y
This dataset class reads data from a CSV file and returns the input and output values as tensors.
Image Loading example
import torch
from torchvision import datasets, transforms
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))
])
class CustomDataset(torch.utils.data.Dataset):
def __init__(self, directory):
self.directory = directory
self.image_files = [f for f in os.listdir(directory) if f.endswith('.png')]
def __len__(self):
return len(self.image_files)
def __getitem__(self, idx):
image = Image.open(os.path.join(self.directory, self.image_files[idx]))
image = transform(image)
label = self.image_files[idx].split('_')[0]
return image, label
The Dataset class can be used to load the data by applying required transformations.
For splitting the data into training and testing sets we could simply use the random_split
method provided by PyTorch.
from torch.utils.data import random_split
dataset = CustomDataset('data.csv')
train_size = int(0.8 * len(dataset))
test_size = len(dataset) - train_size
train_dataset, test_dataset = random_split(dataset, [train_size, test_size])
DataLoader
The DataLoader
class is used to load the data in batches. It provides options to shuffle the data and load the data in parallel.
Here is an example of how to use the DataLoader
class to load the data.
from torch.utils.data import DataLoader
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)
when training the model, we can iterate over the DataLoader
object to get the data in batches.
for item, label in train_loader:
output = model(item)
loss = criterion(output, label)
optimizer.zero_grad()
loss.backward()
optimizer.step()
for accuracy calculation, we can use the the test_loader.
correct = 0
val_loss = 0
total = 0
with torch.no_grad():
for item, label in test_loader:
output = model(item)
loss = criterion(output, label)
val_loss += loss.item() # represents the loss of the model
_, predicted = torch.max(output, 1)
total += label.size(0)
correct += (predicted == label).sum().item()
accuracy = correct / len(total)
Summary
- The
Dataset
class is used to load data into a neural network model. - The
DataLoader
class is used to load the data in batches. - The data can be split into training and testing sets using the
random_split
method provided by PyTorch. - The
DataLoader
class provides options to shuffle the data and load the data in parallel. - The
DataLoader
class can be used to iterate over the data in batches when training the model. - The split can be used to calculate the accuracy of the model.