What is semi-supervised learning? Semi-supervised learning is a broad category of machine learning methods that makes use of both labeled and unlabeled data; as its name implies, it is thus a combination of supervised and unsupervised learning methods.
You will find a gentle introduction to the field of machine learning’s semi-supervised learning in this tutorial. Let’s get started.
Table of Contents
What Is Semi-supervised Learning?
In a nutshell, semi-supervised learning (SSL) is a machine learning technique that uses a small portion of labeled data and lots of unlabeled data to train a predictive model.
We should view the SSL idea through the lenses of its two main competitors in order to comprehend it better.
How Does Semi-supervised Learning Work?
Imagine that you want to train a model using a sizable collection of unlabeled data. In addition to taking months to complete, manually labeling all of this data will probably cost you a fortune. The semi-supervised machine learning approach steps into the rescue at that point.
The operating theory is quite straightforward. Instead of labeling the entire dataset, you hand-label a small portion of it and use it to train a model that is then used on the vast amount of unlabeled data.
Self-training
Self-training is generally one of the simplest examples of semi-supervised learning.
Self-training is the procedure in which you can take any supervised method for classification or regression and modify it to work in a semi-supervised manner, taking advantage of labeled and unlabeled data. The typical process is as follows.
- You choose a modest amount of labeled data, such as images showing cats and dogs with their respective tags, and you use this dataset to train a base model with the help of ordinary supervised methods.
- Then you apply the process known as pseudo-labeling — when you take the partially trained model and use it to make predictions for the rest of the database which is yet unlabeled. The labels generated thereafter are called pseudo as they are produced based on the originally labeled data that has limitations (say, there may be an uneven representation of classes in the set resulting in bias — more dogs than cats).
- From this point, you take the model’s most accurate predictions (for instance, you want a confidence level of over 80% that an image depicts a cat, not a dog). Any pseudo-labels that are more accurate than this confidence level are added to the labeled dataset to create a new combined input for training improved models.
- There may be several iterations in the process, with each iteration adding more pseudo-labels. The performance of the model will keep improving with each iteration assuming the data is appropriate for the process.
Despite the fact that self-training has been used successfully in some cases, it should be noted that performance can differ significantly from dataset to dataset. Additionally, there are many instances where self-training may result in lower performance when compared to following the supervised path.
Co-training
Derived from the self-training approach and its improved version, co-training is another semi-supervised learning technique used when only a small portion of labeled data is available. Unlike the typical process, co-training trains two individual classifiers based on two views of data.
The views are basically distinct collections of features that give each instance additional information, making them independent of one another given the class. Additionally, each view is adequate; each set of features can reliably predict the class of sample data.
The approach, according to the original co-training research paper, can be used effectively for tasks like classifying web content. There are two ways to describe each web page: using words that appear on the page itself and using anchor words from links that point to it.
Here is a brief explanation of how co-training operates.
- With the help of a small amount of labeled data, you first train a unique classifier (model) for each view.
- The larger pool of unlabeled data is then added and given fictitious labels.
- The highest confidence level pseudo-labels are used to co-train classifiers. The data with the confident pseudo-labels assigned by the first classifier updates the second classifier, and vice versa, if the first classifier confidently predicts the real label for a data sample while the second classifier makes a prediction error.
- The combined predictions from the two updated classifiers form the classification result in the final step.
Co-training, like self-training, involves numerous iterations to create an additional training dataset with labels from the enormous amounts of unlabeled data.
SSL With Graph-based Label Propagation
A popular way to run SSL is to represent labeled and unlabeled data in the form of graphs and then apply a label propagation algorithm. The entire data network is covered in human-made annotations.
You can see a network of data points on the graph, most of which are unlabeled, with four of them carrying labels (two red and two green points to represent various classes). The task is to disperse these colored labels across the network. A method for doing this is to choose a point in the network, like 4, and count up all the various routes that lead from 4 to each colored node. If you do that, you will discover that there are only four walks leading to green points and five walks leading to red points. We can infer that point 4 falls under the red category from this. Then, repeat this procedure for each of the graph’s points.
Personalization and recommender systems are examples of where this technique is used in practice. Based on the knowledge of other customers, label propagation allows you to forecast customer interests. Here, we can use the variation of continuity assumption; for instance, it is very likely that two people who are connected on social media will have similar interests.
Benefits And Limitations Of Semi-supervised Learning
The practical difficulties associated with data collection serve as the inspiration for techniques that make use of unlabeled data.
Large sums of time and money are spent annually by businesses and professionals labeling datasets for machine learning. Unlabeled data continues to sit around, and while it is usually inexpensive and simple to collect, it can be difficult to deliver results without labels. Machine learning practitioners can save priceless, otherwise wasted resources if it is possible to avoid manual data labeling while still getting good results.
Given two identical datasets, a supervised learning task with a fully labeled dataset will unquestionably train a better model than a set with a portion of unlabeled points. When labels are scarce and unlabeled data is abundant, however, semi-supervised learning is effective. In this case, our model is exposed to situations similar to those it might face during deployment without having to spend time and resources labeling a huge number of additional images.
Where data labeling is challenging, some of the most potent applications might be.
Labeling data is especially laborious, and time-consuming, and can sometimes require domain knowledge in many NLP tasks, such as webpage classification, speech analysis, or named-entity recognition; or in less conventional machine learning applications, such as protein sequence classification. Here, dataset engineering is more effective when using as much unlabeled data as possible.
Combining unlabeled and labeled datasets will undoubtedly improve model performance whenever data collection is convenient.
Semi-supervised Learning Techniques
Now let’s introduce some semi-supervised learning implementations.
Consistency Regularization
Utilizing consistency regularization is primarily done so as a way to benefit from the continuity and cluster assumptions.
Let’s say we have a dataset in the semi-supervised setting that contains examples of two classes that are both labeled and unlabeled.
We handle labeled and unlabeled data points differently during training: for labeled data points, we optimize using conventional supervised learning, calculating loss by comparing our prediction to our label; for unlabeled data points, we want to enforce that similar data points have similar predictions on our low-dimensional manifold.
Pseudo-labeling
Pseudo-labelling is where, during training, model predictions are converted into a “one-hot” label.
Take the datapoint that our model predicts will be blue with a probability of.75 as an example from our classifier for the moons dataset.
All confident model predictions are converted into “one-hot” vectors, where the most confident class becomes the label. From this, we train on the new “one-hot” probability distribution as a pseudo-label.
Not only are artificial labels possible, but training over pseudo-labels also reduces entropy, which encourages the model’s predictions to be highly confident on unlabeled data points. In a similar vein, by assuming some predictions to be accurate, we avoid deducing any broad principles about the true data distribution (inductive learning). Pseudo-labels provide a method of transductive learning in which one infers general principles from a set of training data to a different set of precise test data.
Semi-supervised Learning Examples
It is impossible to label data in a timely manner because the amount of data is constantly increasing exponentially. Consider a TikTok user who is active and regularly uploads up to 20 videos per day. Additionally, there are 1 billion active users. Semi-supervised learning in this situation has a wide range of applications, including the classification of text documents, web content, and speech.
Speech Recognition
Semi-supervised learning can be used to get around problems and improve performance because labeling audio requires a lot of time and resources. In order to enhance its speech recognition models, Facebook (now Meta) successfully used semi-supervised learning, specifically the self-training method. The base model, which was trained using 100 hours of audio data with human annotations, served as their starting point. In order to improve the performance of the models, another 500 hours of unlabeled speech data were added. The results showed a significant improvement, with the word error rate (WER) declining by 33.9 percent.
Web Content Classification
A large team of human resources would be required to classify the information on web pages by assigning corresponding labels because there are billions of websites out there that offer all different types of content. To enhance user experience, different forms of semi-supervised learning are used to annotate and categorize web content. In order to better understand human language and the suitability of potential search results to queries, many search engines—including Google—apply SSL to their ranking component. Google Search uses SSL to locate the material that is most pertinent to a given user query.
Text Document Classification
Building a classifier for text documents is another instance of how semi-supervised learning can be used effectively. This is where the technique works well because it is very challenging for human annotators to read through numerous texts with a lot of words just to assign a basic label, like a type or genre.
For instance, a classifier can be constructed on top of deep learning neural networks like LSTM (long short-term memory) networks, which can find long-term dependencies in data and retrain past knowledge over time. It typically takes a lot of data, both labeled and unlabeled, to train a neural network. The most relevant words can be manually labeled on a small number of text examples to train a base LSTM model, and then it can be applied to a larger number of unlabeled samples. This is why a semi-supervised learning framework works perfectly.
The Yonsei University researchers’ SALnet text classifier from Seoul, South Korea, shows how well the SSL approach works for tasks like sentiment analysis.
Applications Of Semi-supervised Learning
Industry adoption of semi-supervised learning models is rising. Here are a few of the main applications.
- Speech Analysis- It serves as the most well-known illustration of a semi-supervised learning application. Since labeling audio data is the most challenging task and requires a lot of human resources, this issue can be naturally solved by implementing SSL in a semi-supervised learning model.
- Web content classification- However, because it requires some level of human intervention, it is very important and impossible to label every page on the internet. However, the use of semi-supervised learning algorithms can help to solve this issue.
To rank a website for a particular query, Google also employs semi-supervised learning algorithms. - Protein sequence classification- Because DNA strands are longer, they necessitate active human involvement. As a result, the development of the semi-supervised model in this field has been close.
- Text document classifier- Semi-supervised learning is the best model to get around this problem because, as we all know, it would be very difficult to find a large amount of labeled text data.
When To Use And Not Use Semi-supervised Learning?
Semi-supervised learning exhibits promising results in classification tasks with little labeled data and lots of unlabeled data while leaving the door open for other ML tasks. Basically, the method can be applied to almost any supervised algorithm with the necessary adjustments. Additionally, if the data matches the profile, SSL works well for clustering and anomaly detection as well. Semi-supervised learning is still a young field, but it has already been shown to be successful in a variety of applications.
That said, not all tasks are suitable for semi-supervised learning. The method might not work if the labeled data sample isn’t representative of the distribution as a whole. Consider the situation where you must categorize pictures of colored objects that appear different from each angle. The results will be inaccurate unless you have a significant amount of labeled data. Semi-supervised learning isn’t the best option, though, if there are a lot of labeled data available. Whether you like it or not, supervised learning won’t go away anytime soon because many real-world applications still require a ton of labeled data.
Takeaways
In a generic semi-supervised algorithm, given a dataset of labeled and unlabeled data, examples are handled one of two different ways:
- Labeled data points are handled as in conventional supervised learning; predictions are made, losses are computed, and network weights are updated by gradient descent.
- The model is assisted in making more reliable and consistent predictions by the use of unlabeled data points. Unlabeled examples are used to advance labeled examples, either by adding an additional unsupervised loss term or by using pseudo-labels.
Read More: 13 Data Mining Projects And Applications