In March, OpenAI released GPT-4, the successor of the famous GPT-3 model. While it has many impressive features such as the ability to handle longer context windows, more advanced reasoning and natural language generation and understanding skills, one of the most surprising added features is the ability to process image input. Indeed, GPT-4 is multimodal, meaning it can generate content based on both images and textual inputs.
What is multimodal learning? Why would we do this? How can we leverage and combine the information from different modalities of data that are represented in completely different formats (pixels, waveforms, sentences…)? All questions that came to my mind when I first heard of multimodal learning.
As human beings, we experience the world in different ways: we see, we hear, we feel, we taste and smell things. Information comes to us in a variety of flavours and we use all those impressions to derive context around us. When biking through the city for example, we not only use our vision, but also our hearing to lead us safely through traffic. Same goes for the social interactions we have with our peers: we listen to what they say, we listen to the tone of their voice, we look at body language and facial expressions to get a better understanding of the message they in fact want to convey.
This has inspired the field of multimodal learning, which aims at processing and integrating information from different modalities. By leveraging information from different modalities, the model can handle noisy, incomplete or ambiguous data more effectively and can lead to improved accuracy and performance in a variety of tasks.
Some of these multimodal tasks include:
- Medical diagnosis based on medical images (X-ray, MRI …) and patient data such as medical history and current symptoms
- Autonomous vehicles using cameras, radar and lidar data combined
- Social media analysis: combine image, text and video to detect for example fake news or perform sentiment analysis
- Sarcasm/sentiment detection: combine speech data with visual data
- Image captioning: describe images in natural language text
Unfortunately, combining these different sources of information does not seem to be very straightforward. One of the main challenges is finding the appropriate representation for each modality and ensuring that they can be aligned and translated into a common feature space. Another challenge is the fusion and co-learning of the modalities, which involves deciding how the different types of information should be combined to produce the best result. Scalability is also a concern, as multimodal learning often requires large amounts of data to train models effectively. Finally, interpretability can be a challenge, as it can be difficult to understand and disentangle how different modalities are contributing to the final output.
Let’s now dive into the 5 different challenges identified:
How to represent the different modalities?
Multimodal learning involves combining information from highly diverse sources such as images (represented as pixels), audio (represented as waveforms), text (represented as sequences of words) and others in a way that all relevant information from each modality is captured and that allows a machine learning model to make sense of it and gain insights from it.
There are two main approaches, namely coordinated and joint representation. Joint representation involves embedding all modalities in one single vector representation in the same space. Coordinated representation, on the other hand, will represent all modalities separately, but will ensure that their representations work together towards a common goal. This involves leveraging alignment techniques such as attention mechanisms or cross-modal embedding techniques to align the representations across modalities.
How can we translate information between different modalities?
The translation problem refers to the challenge of translating information between different modalities, such as converting auditory signals (sound) to linguistic signals (text).
This problem can be tackled by for instance leveraging encoder-decoder models or attention mechanisms to map the features from one modality to the other. In case of an image captioning task, the model will need to translate the visual information in the image into a textual format capturing the semantic context of the image. To do this, the model can use an encoder to generate a visual representation of the image and a decoder to generate a natural language description of the image based on that encoded representation.
How do we align information from different modalities?
The alignment challenge is important because it allows the model to identify relationships between the different modalities and use this information to improve its predictions. Let’s say we want to detect emotions in a video. We have visual information in the form of frames and we have audio information in the form of sound waves. The model needs to find a way to align the visual information with the audio information in such a way that it can identify relationships between audio and video in order to detect the correct emotions. This is often done leveraging an attention mechanism which will enable the model to identify which parts of the input modality correspond to each other and to assign higher weights to those parts that are more relevant for the prediction. For example, the model might attend to specific frames of the video and specific segments of the audio signal that are relevant to the emotion being predicted.
How can we join information from multiple modalities to perform classification or regression?
Commonly used data fusion techniques include early fusion, late fusion and intermediate fusion.
In early fusion the inputs from different modalities are combined at the beginning of the model architecture. In other words, the features extracted from the different modalities are concatenated together and then fed as input to the first layer of the model. The main advantage is that this approach allows the model to capture the interactions between modalities already from the beginning. Additionally, it will simplify the learning process since all modalities are trained together. A drawback however is that the input vector size can become too large to effectively handle resulting in computational overhead. To improve early fusion’s performance, techniques such as PCA and CCA are often used. PCA can be used to reduce the dimensionality of the high-dimensional multimodal data while preserving the most important information. CCA, on the other hand, can be used to find linear combinations of features that are most correlated across different modalities.
In late fusion, the actual fusion of the different modalities happens at prediction time. Basically, the different modalities are processed by separate unimodal networks. Subsequently, the learned representations are combined at a later stage and eventually fed into the final network layer to make a prediction. This approach resembles an ensemble classifier. The fusion mechanism at the end can be voting, weighted sum or a ML approach. Since each modality will be separately processed in unimodal networks, these networks can leverage a more specialised feature extraction method, potentially leading to a better representation of each modality. However, the individual processing makes this approach quite computationally expensive since now multiple separate models need to be trained. Lastly, since every modality is trained individually, some correlations between modalities might not be captured discarding some potentially important information.
Intermediate fusion is the most flexible method allowing the fusion of the different modalities at different levels of the model network. Each modality is first processed separately and then the extracted representations are combined at some intermediate layer in the model. The main challenge here lies in determining the optimal combination of modalities and the layers at which they are merged.
How can we transfer knowledge between modalities?
Co-learning is about transferring knowledge between modalities by incorporating external data. The idea behind co-learning is to use one modality to help learn the other modality, which can be especially useful when one modality has limited resources, noisy input, or unreliable labels. By incorporating external data, co-learning can help improve the accuracy and robustness of the learning process. There are various techniques used in co-learning such as transfer learning, multi-task learning, and domain adaptation.
In conclusion, multimodal learning is an exciting and challenging field of research that aims to combine information from different modalities in a way that allows machines to understand and learn from the world around us in the same way humans do. Multimodal learning has the potential to enhance many applications such as image and speech recognition, natural language processing, and robotics. However, there are still many challenges to overcome in this field, including representation, translation, alignment, fusion, co-learning, scalability, and interpretability. A powerful example of the potential of multimodal learning is GPT-4 which, in contrast to GPT-3, can process both text as well as image input. However, the specifics of how GPT-4 achieves this multimodality are yet to be made public. As researchers continue to explore these challenges, we can expect to see even more innovative and advanced multimodal learning systems in the future.