Pretty neat explainer by the researchers themselves of Apple's multi-modal M4-21 model from just a few days ago. Whats neat and new is the combination of 4 different tokenisers to achieve a total of 20 different modalities, including text, image, edges, color palettes, metadata, geometric data and feature maps. 1. One of the standout features is that each of these modalities maps to all other and leading to its ability to perform steerable multi-modal generation. This means the model can generate outputs in one modality based on inputs from another. 2. This then naturally works the other way around as well for multi-modal multimodal retrieval, where the model can predict global embeddings across all other modalities from any input modality, which then allows for very versatile retrieval capabilities, making it useful in applications like image search and retrieval based on text descriptions. 3. Equally, related vision tasks are handled with great precision, including surface estimation, depth estimation, semantic segmentation and 3D human pose estimation. 4. All of this comes in different sizes, the smallest model being just 198m parameters (think local model on your iPhone!), the biggest one with 2.8b parameters All of this seems to perform incredibly well compared to much larger and more specialised models https://storage.googleapis.com/four_m_site/videos/4M-21_Website_Video.mp4 Already published as open source on github https://github.com/apple/ml-4m