Dataset
A dataset (or data set) is a collection of data that is used for training a machine learning model.
Machine learning typically works with three datasets:
-
Training dataset
The actual dataset that we use to train the model. The model learns weights and parameters from this data.
-
Validation dataset
The validation set is used to evaluate a given model during the training process. It helps machine learning
engineers to fine-tune the HyperParameters
at model development stage.
The model doesn’t learn from validation dataset; and validation dataset is optional.
-
Test dataset
The Test dataset provides the gold standard used to evaluate the model.
It is only used once a model is completely trained.
The test dataset should more accurately evaluate how the model will be performed on new data.
See Jason Brownlee’s article for more detail.
DJL provides a number of built-in basic and standard datasets. These datasets are used to train deep learning models.
This module contains the following datasets:
CV
Image Classification
- MNIST - A small and fast handwritten digits dataset
- Fashion MNIST - A small and fast clothing type detection dataset
- CIFAR10 - A dataset consisting of 60,000 32x32 color images in 10 classes
- ImageNet - An image database organized according to the WordNet hierarchy
Note: You have to manually download the ImageNet dataset due to licensing requirements.
Object Detection
- Pikachu - 1000 Pikachu images of different angles and sizes created using an open source 3D Pikachu model
- Banana Detection - A testing single object detection dataset
Other CV
- Captcha - A dataset for a grayscale 6-digit CAPTCHA task
- Coco - A large-scale object detection, segmentation, and captioning dataset that contains 1.5 million object instances
- You have to manually add
com.twelvemonkeys.imageio:imageio-jpeg:3.11.0
dependency to your project
NLP
Text Classification and Sentiment Analysis
- AmazonReview - A sentiment analysis dataset of Amazon Reviews with their ratings
- Stanford Movie Review - A sentiment analysis dataset of movie reviews and sentiments sourced from IMDB
- GoEmotions - A dataset classifying 50k curated reddit comments into either 27 emotion categories or neutral
Unlabeled Text
- Penn Treebank Text - The text (not POS tags) from the Penn Treebank, a collection of Wall Street Journal stories
- WikiText2 - A collection of over 100 million tokens extracted from good and featured articles on wikipedia
Other NLP
Tabular
Time Series