Training Data

Now let's talk about training data in the context of machine learning.

In machine learning, we need data to train our models, and on top of having data, we need good data. Good has to be defined, of course, but as a general effect, if you put bad data (called garbage) into your model, you're going to get garbage out of your model - meaning your model won't be good.

Training data, cleaning the data, and making sure that it is good for your use case is one of, if not the most critical stage to build a good model.

There are several options to model data, and this will impact the type of algorithms you can use to train your models. We'll cover two main categorizations:

labeled versus unlabeled data, and
structured versus unstructured data.

Labeled vs Unlabeled Data

Labeled Data

Labeled data is data that has both input features and output labels.

Example: Here we have some images of animals, and each image is going to be labeled with the corresponding animal type. In the image we have dogs, and cats

sup-cats-dogs

In this, image is itself input feature and the output label corresponds to what the image is (i.e. Cats or Dogs)

Key characteristics:

When we have labeled data, it enables supervised learning.
The algorithm learns to map inputs to known outputs like:
- We teach the algorithm that "this image should have a predicted value of dog, and we know it's a dog because we've labeled it"

Unlabeled Data

Unlabeled data only includes input features without any output labels.

Example: A collection of images without any associated labels:

We have images (say, four cats and two dogs)
We don't tell the algorithm "this is a dog" or "this is a cat"
The algorithm must figure out that there is such thing as a cat and such thing as a dog.

unsup-cats-dogs

Key characteristics:

Enables unsupervised learning when there are no labels
It is more complicated than supervised learning
The algorithm finds patterns between things or structures in the data and groups them together
It is used when you have so much data that it's very costly or simply impossible to label everything. (It is when you have too much unlabelled data)

alt text

This is why in the field of machine learning, we have algorithm for both of the use case of labeled and unlabeled data.

Structured vs Unstructured Data

Structured Data

Structured data is organized in a structured format, usually in rows and columns, just like in Microsoft Excel.

Tabular Data:

Data is arranged in a table with rows representing records and columns representing features
For example: Columns with Customer_ID, Name, Age, Purchase_Amount and with the rows

Time Series Data

Data points collected or recorded at successive points in time
Example: Stock price of a company over time
You can have time series data in tabular format or simply two columns (Date and Stock Price)

In both of the cases of time series and tabular data, it is very easy to read it and very easy to structure the data.

Unstructured Data

Unstructured data is the data that doesn't follow a specific structure and is usually text-heavy or multimedia content.

Example 1: Text Data You have:

Articles online
Social media posts
Customer review on your business

Then this data is considered unstructured data.

For Example: Here is the review of a Yoga Class

alt text

This is long text with no structure except the fact that it is just a long text.

Example 2: Image Data

Image data is unstructured data
This is just pixels with no organized structure beyond the pixel data itself

So both of these: image and text data are unstructured with no specific organizational structure. We have specific type of algorithms to deal with this data.

Summary

Now we've learned about labeled and unlabeled data, the necessity of having good data for ML algorithms, and discussed structured and unstructured data. These concepts form the foundation for understanding how different types of data require different algorithmic approaches in machine learning.

Keyboard shortcuts

AWS AI Practitioner Study Guide