Creating Your Own Dataset In Hugging Face | Generative AI with Hugging Face | Ingenium Academy

Ingenium Academy
19 Sept 202310:09

TLDRThis video tutorial from Ingenium Academy guides viewers on how to work with datasets in Hugging Face, including creating and uploading a custom dataset to the Hugging Face Hub. It covers installation of necessary libraries, loading datasets, pre-processing data, and splitting into training and test sets. The video demonstrates how to extract data from the Reuters 21578 dataset, create JSONL files, and use the 'datasets' library to load and manage data. Finally, it shows the process of sharing the dataset on the Hugging Face Hub, requiring an access token for authentication.

Takeaways

  • 😀 The video teaches how to work with datasets from Hugging Face and how to create and push your own dataset to the Hugging Face Hub.
  • 🛠️ It's essential to install the Transformers, torch, and datasets libraries to access datasets from Hugging Face.
  • 📚 Hugging Face allows loading datasets by calling `load_dataset` with the dataset's path, which returns a dataset dict object.
  • 🔄 Not all datasets have the same structure; some may include only a training set, while others have training, validation, and test splits.
  • 💾 Before loading certain datasets, you might need to install additional packages with `pip install`.
  • 🔄 The script demonstrates data pre-processing, including shuffling and splitting the dataset into training and test sets.
  • 📊 The video uses the Reuters 21578 dataset as an example to show how to create a custom dataset from a machine learning archive.
  • 🔗 It's necessary to decompress and parse the dataset files, such as .sgm files, to extract relevant data like titles and bodies of articles.
  • 📝 The script shows how to save processed articles into JSONL files, which are then used to create training, validation, and test dataset splits.
  • 🌐 The video guides through sharing the created dataset on the Hugging Face Hub by using an access token and demonstrates how to load the dataset back into the platform.

Q & A

  • What is the main focus of the video?

    -The main focus of the video is to teach viewers how to work with datasets from Hugging Face, create their own dataset, and push it to their Hugging Face Hub account.

  • What libraries are suggested to be installed at the beginning of the video?

    -The video suggests installing the 'transformers', 'torch', and 'datasets' libraries to work with Hugging Face datasets.

  • How does Hugging Face allow users to load a dataset?

    -Hugging Face allows users to load a dataset by calling the 'load_dataset' function and providing the path to the dataset.

  • What are the different features of a dataset in Hugging Face?

    -A dataset in Hugging Face typically has features like 'train', 'test', and 'validation' sets, and each dataset object has specific features depending on the dataset, such as 'act' and 'prompt'.

  • Why might some datasets require additional pip installations?

    -Some datasets require additional pip installations because they have unique requirements or dependencies that need to be met for proper functionality.

  • How can you access individual examples within a Hugging Face dataset?

    -You can access individual examples within a Hugging Face dataset by indexing into the 'train' or other relevant subset of the dataset and then into the specific example.

  • What is the purpose of shuffling a dataset in the preprocessing section?

    -Shuffling a dataset is done to randomize the order of the data, which helps prevent any natural ordering from influencing the model training process, especially when creating train and test splits.

  • How does the video demonstrate creating a custom dataset?

    -The video demonstrates creating a custom dataset by downloading and processing the Reuters 21578 dataset, extracting titles and bodies of articles, and then splitting them into train, validation, and test sets.

  • What is the significance of JSONL file format in the context of the video?

    -JSONL file format is significant because it allows for each line in the dataset to be a new JSON object, which is useful for creating train, validation, and test files that can be easily loaded into Hugging Face datasets.

  • How does the video guide users to push their custom dataset to the Hugging Face Hub?

    -The video guides users to push their custom dataset to the Hugging Face Hub by using the 'huggingface_hub' library, logging in with an access token, and then uploading the dataset through the Hub interface.

  • What is the role of an access token in uploading a dataset to the Hugging Face Hub?

    -An access token is required for authentication purposes, allowing users to securely push their custom datasets to their Hugging Face Hub account.

Outlines

00:00

😀 Introduction to Hugging Face Datasets

This paragraph introduces the tutorial's focus on working with datasets from Hugging Face. It covers the installation of necessary libraries like Transformers, torch, datasets, and the process of creating a dataset on Hugging Face Hub. The speaker demonstrates how to load a dataset using the 'load_dataset' function and explains the structure of a dataset object in Hugging Face, which includes features like 'train', 'validation', and 'test' sets. The paragraph also touches on the need for installing additional packages for certain datasets and concludes with a demonstration of loading a summarization dataset from Hugging Face.

05:02

📚 Creating and Sharing a Custom Dataset

The second paragraph delves into the process of creating a custom dataset from the Reuters 21578 dataset. It details the steps of downloading the dataset, extracting articles using Beautiful Soup, and preparing the data by splitting it into training, validation, and test sets. The speaker then shows how to save these sets into JSONL files and subsequently load them back into a Hugging Face dataset object. The paragraph concludes with a demonstration of sharing the created dataset on the Hugging Face Hub, which involves logging in with an access token and pushing the dataset to the user's repository.

10:02

🚀 Advanced Dataset Operations

The final paragraph briefly mentions that more advanced operations with datasets, such as fine-tuning models, will be covered in later videos. It serves as a teaser for upcoming content, indicating that the current tutorial is just the beginning of a more comprehensive exploration of working with datasets in the context of machine learning and natural language processing.

Mindmap

Keywords

💡Hugging Face

Hugging Face is a company that provides a platform for developers to build, train, and deploy natural language processing (NLP) models. In the context of the video, Hugging Face is used to access and manipulate datasets, which are crucial for training NLP models. The video demonstrates how to use the Hugging Face platform to load datasets and even create and push a custom dataset to the Hugging Face Hub.

💡Dataset

A dataset in the context of machine learning and NLP is a collection of data used to train models. The video discusses how to work with datasets from Hugging Face, including loading them, creating custom datasets, and manipulating them for model training. The script mentions datasets like 'Samsung' for summarization and 'Reuters 21578' for text classification.

💡Transformers

Transformers are a type of deep learning model architecture that has become the standard for NLP tasks. The video script refers to the 'Transformers' library, which is a collection of pre-trained models and tools provided by Hugging Face. It is used for installing necessary packages to work with datasets and models.

💡Torch

Torch, often referred to as PyTorch, is an open-source machine learning library based on the Torch library. It is used for applications such as computer vision and natural language processing. In the video, 'torch' is mentioned as a package to be installed for working with datasets and models in Hugging Face.

💡Data Pre-processing

Data pre-processing is a crucial step in preparing data for model training. It involves tasks like shuffling, splitting, and cleaning data. The video script describes pre-processing steps such as shuffling a dataset to create a train-test split, which is essential for ensuring that the model learns from a representative sample of the data.

💡Train-Test Split

A train-test split is a method used in machine learning to divide a dataset into a training set and a test set. The video script illustrates how to create a train-test split using a specified train size and random seed, ensuring that the model can be trained on one part of the data and validated on another part to assess its performance.

💡JSON Lines (Jsonl)

JSON Lines, or Jsonl, is a file format where each line in the file is a separate JSON object. This format is useful for storing datasets where each entry is a self-contained JSON object. The video script describes how to save a dataset in Jsonl format, which is then used to create a custom dataset on Hugging Face.

💡Beautiful Soup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with parsers to provide ways of navigating, searching, and modifying the parse tree. In the video, Beautiful Soup is used to extract titles and bodies from .sgm files to create a dataset of news articles.

💡Hugging Face Hub

The Hugging Face Hub is a platform where users can share their models and datasets. The video script includes a step-by-step guide on how to push a custom dataset to the Hugging Face Hub, making it accessible to others in the community. This is an important aspect of collaborative machine learning and model development.

💡Access Token

An access token is a security credential used to authenticate requests to a service or API. In the context of the video, an access token is required to push a custom dataset to the Hugging Face Hub. The script explains the process of obtaining an access token from the Hugging Face account settings, which is necessary for users to interact with the Hub.

Highlights

Learn how to work with datasets from Hugging Face and create your own dataset.

Install necessary libraries: Transformers, torch, datasets.

Load a dataset using the 'load_dataset' function.

Datasets in Hugging Face return a dataset dict object.

Some datasets require additional package installations.

Datasets may contain different splits like train, test, and validation.

Demonstration of accessing and manipulating dataset features.

Data preprocessing techniques like shuffling and splitting datasets.

Creating a custom dataset from the Reuters 21578 dataset.

Using 'wget' to download and 'tar' to decompress dataset files.

Parsing SGML files to extract titles and bodies of articles.

Splitting the dataset into train, validation, and test sets.

Saving the dataset in JSONL format for Hugging Face compatibility.

Loading the custom dataset using Hugging Face's 'load_dataset' function.

Sharing the custom dataset on the Hugging Face Hub.

Using 'notebook login' to authenticate and push the dataset to the Hub.

Accessing the uploaded dataset from the Hugging Face Hub.