djl

Dataset Creation

The Dataset in DJL represents both the raw data and the data loading process. For this reason, training in DJL usually requires that your data be implemented through using a dataset class. You can choose to use one of the well-known datasets we have built in. Or, you can create a custom dataset.

Dataset Helpers

There are a number of helpers provided by DJL to make it easy to create custom datasets. If a helper is available, it can make it easier to implement the dataset then building it from scratch:

CV

NLP

Tabular

Custom Datasets

If none of the provided datasets meet your requirements, you can also easily customize you own dataset in a custom class. While technically the dataset must only implement Dataset, it is best to instead extend RandomAccessDataset. It manages data randomization and provides comprehensive data loading functionality.

The RandomAccessDataset is based on making your data records into a list where each record has an index. Then, it only needs to know how many records there are and how to load each record giving its index.

As part of implementing the dataset, there are two methods that must be defined:

In addition, the dataset should also have a nested builder class to contain details on how to load the dataset. The builder would extend RandomAccessDataset.BaseBuilder. This provides an avenue to modify how RandomAccessDataset loads the data. You can also add your own options into the builder. For an example of how this would look like, see ImageFolder.Builder.

You can also view this example of creating a new CSV dataset.

Many of the abstract dataset helpers above also extend RandomAccessDataset. When using them, most of the same information applies. You may be asked to implement slightly different methods depending on the particular extended class. You will also want to extend that classes BaseBuilder instead of the one found in RandomAccessDataset to get the additional data loading options from the helper.

If you create a new dataset for public dataset, consider contributing that dataset back to DJL for others to use. You can follow these instructions for adding it.