The Dataset in DJL represents both the raw data and the data loading process. For this reason, training in DJL usually requires that your data be implemented through using a dataset class. You can choose to use one of the well-known datasets we have built in. Or, you can create a custom dataset.
There are a number of helpers provided by DJL to make it easy to create custom datasets. If a helper is available, it can make it easier to implement the dataset then building it from scratch:
If none of the provided datasets meet your requirements, you can also easily customize you own dataset in a custom class.
While technically the dataset must only implement
Dataset, it is best to instead extend
It manages data randomization and provides comprehensive data loading functionality.
RandomAccessDataset is based on making your data records into a list where each record has an index.
Then, it only needs to know how many records there are and how to load each record giving its index.
As part of implementing the dataset, there are two methods that must be defined:
Record get(NDManager manager, long index)- Returns the record (both input data and output label) for a particular index
long availableSize()- Returns the number of records in the dataset
In addition, the dataset should also have a nested builder class to contain details on how to load the dataset.
The builder would extend
This provides an avenue to modify how RandomAccessDataset loads the data.
You can also add your own options into the builder.
For an example of how this would look like, see
You can also view this example of creating a new CSV dataset.
Many of the abstract dataset helpers above also extend
When using them, most of the same information applies.
You may be asked to implement slightly different methods depending on the particular extended class.
You will also want to extend that classes
BaseBuilder instead of the one found in
RandomAccessDataset to get the additional data loading options from the helper.
If you create a new dataset for public dataset, consider contributing that dataset back to DJL for others to use. You can follow these instructions for adding it.