djl

Dataset

A dataset (or data set) is a collection of data that is used for training a machine learning model.

Machine learning typically works with three datasets:

Training dataset

The actual dataset that we use to train the model. The model learns weights and parameters from this data.
Validation dataset

The validation set is used to evaluate a given model during the training process. It helps machine learning engineers to fine-tune the HyperParameters at model development stage. The model doesn’t learn from validation dataset; and validation dataset is optional.
Test dataset

The Test dataset provides the gold standard used to evaluate the model. It is only used once a model is completely trained. The test dataset should more accurately evaluate how the model will be performed on new data.

See Jason Brownlee’s article for more detail.

Basic Dataset

DJL provides a number of built-in basic and standard datasets. These datasets are used to train deep learning models. This module contains the following datasets:

CV

Image Classification

MNIST - A small and fast handwritten digits dataset
Fashion MNIST - A small and fast clothing type detection dataset
CIFAR10 - A dataset consisting of 60,000 32x32 color images in 10 classes
ImageNet - An image database organized according to the WordNet hierarchy

Note: You have to manually download the ImageNet dataset due to licensing requirements.

Object Detection

Pikachu - 1000 Pikachu images of different angles and sizes created using an open source 3D Pikachu model
Banana Detection - A testing single object detection dataset

Other CV

Captcha - A dataset for a grayscale 6-digit CAPTCHA task
Coco - A large-scale object detection, segmentation, and captioning dataset that contains 1.5 million object instances
- You have to manually add com.twelvemonkeys.imageio:imageio-jpeg:3.11.0 dependency to your project

NLP

Text Classification and Sentiment Analysis

AmazonReview - A sentiment analysis dataset of Amazon Reviews with their ratings
Stanford Movie Review - A sentiment analysis dataset of movie reviews and sentiments sourced from IMDB
GoEmotions - A dataset classifying 50k curated reddit comments into either 27 emotion categories or neutral

Unlabeled Text

Penn Treebank Text - The text (not POS tags) from the Penn Treebank, a collection of Wall Street Journal stories
WikiText2 - A collection of over 100 million tokens extracted from good and featured articles on wikipedia

Other NLP

Stanford Question Answering Dataset (SQuAD) - A reading comprehension dataset with text from wikipedia articles
Tatoeba English French Dataset - An english-french translation dataset from the Tatoeba Project

Tabular

Airfoil Self-Noise - A 6 feature dataset from NASA tests of airfoils
Ames House Pricing - A 80 feature dataset to predict house prices
Movielens 100k - A 6 feature dataset of movie ratings on 1682 movies from 943 users

Time Series

Daily Delhi Climate

This site is open source. Improve this page.