Modern AI Models for Vision and Multimodal Understanding

Entdecken Sie neue Fähigkeiten mit 30% Rabatt auf Kurse von Branchenexperten. Jetzt sparen.

Diese kurs ist nicht verfügbar in Deutsch (Deutschland)

Wir übersetzen es in weitere Sprachen.

Modern AI Models for Vision and Multimodal Understanding

Dieser Kurs ist Teil von Spezialisierung für Computer Vision

Dozent: Tom Yeh

Bei Coursera Plus enthalten

Mehr erfahren

4 Module

Verschaffen Sie sich einen Einblick in ein Thema und lernen Sie die Grundlagen.

Stufe Fortgeschritten

Empfohlene Erfahrung

1 Woche zu vervollständigen

unter 10 Stunden pro Woche

Flexibler Zeitplan

In Ihrem eigenen Lerntempo lernen

4 Module

Verschaffen Sie sich einen Einblick in ein Thema und lernen Sie die Grundlagen.

Stufe Fortgeschritten

Empfohlene Erfahrung

1 Woche zu vervollständigen

unter 10 Stunden pro Woche

Flexibler Zeitplan

In Ihrem eigenen Lerntempo lernen

Was Sie lernen werden

Apply Nonlinear Support Vector Machines (NSVMs) and Fourier transforms to analyze and process visual data.
Use probabilistic reasoning and implement Recurrent Neural Networks (RNNs) to model temporal sequences and contextual dependencies in visual data.
Explain the principles of transformer architectures and how Vision Transformers (ViT) perform image classification and visual understanding tasks.
Implement CLIP for multimodal learning, and utilize diffusion models to generate high-fidelity images.

Kompetenzen, die Sie erwerben

Kategorie: Linear Algebra

Wichtige Details

Zertifikat zur Vorlage

Zu Ihrem LinkedIn-Profil hinzufügen

Kürzlich aktualisiert!

August 2025

Bewertungen

18 Aufgaben

Unterrichtet in Englisch

Erfahren Sie, wie Mitarbeiter führender Unternehmen gefragte Kompetenzen erwerben.

Weitere Informationen zu Coursera für Unternehmen

Logos von Petrobras, TATA, Danone, Capgemini, P&G und L'Oreal

Erweitern Sie Ihre Fachkenntnisse

Dieser Kurs ist Teil der Spezialisierung Spezialisierung für Computer Vision

Wenn Sie sich für diesen Kurs anmelden, werden Sie auch für diese Spezialisierung angemeldet.

Lernen Sie neue Konzepte von Branchenexperten
Gewinnen Sie ein Grundverständnis bestimmter Themen oder Tools
Erwerben Sie berufsrelevante Kompetenzen durch praktische Projekte
Erwerben Sie ein Berufszertifikat zur Vorlage

In diesem Kurs gibt es 4 Module

Step into the frontier of artificial intelligence with this advanced course designed to explore the latest models powering visual and multimodal intelligence. From foundational mathematical tools to state-of-the-art architectures, you'll gain the skills to understand and build systems that interpret images, text, and more—just like today’s leading AI models.

You'll begin by discovering how Nonlinear Support Vector Machines (NSVMs) and Fourier transforms lay the groundwork for signal processing and pattern recognition in visual data. You'll then build a strong foundation in probabilistic reasoning and temporal modeling with RNNs, enabling AI systems to understand sequences and context. After, you'll learn how transformer architectures revolutionize both language and vision tasks. Finally, you'll dive into multimodal learning with CLIP, which connects images and text, and explore diffusion models that generate high-fidelity images through iterative refinement. This course is ideal for learners who want to go beyond traditional deep learning and explore the models shaping the future of AI. With a blend of theory, code, and real-world applications, you'll be equipped to tackle cutting-edge challenges in computer vision and multimodal AI. This course can be taken for academic credit as part of CU Boulder’s MS in Computer Science degree offered on the Coursera platform. These fully accredited graduate degrees offer targeted courses, short 8-week sessions, and pay-as-you-go tuition. Admission is based on performance in three preliminary courses, not academic history. CU degrees on Coursera are ideal for recent graduates or working professionals. Learn more: https://coursera.org/degrees/ms-computer-science-boulder.

Welcome to Modern AI Models for Vision and Multimodal Understanding, the third course in the Computer Vision specialization. In this first module, you’ll explore foundational mathematical tools used in modern AI models for vision and multimodal understanding. You’ll begin with Support Vector Machines (SVMs), learning how linear and radial basis function (RBF) kernels define decision boundaries and how support vectors influence classification. Then, you’ll dive into the Fourier Transform, starting with 1D signals and progressing to 2D applications. You’ll learn how to move between time/spatial and frequency domains using the Discrete Fourier Transform (DFT) and its inverse, and how these transformations reveal patterns and structures in data. By the end of this module, you’ll understand how SVMs and Fourier analysis contribute to feature extraction, signal decomposition, and model interpretability in AI systems.

Das ist alles enthalten

14 Videos7 Lektüren4 Aufgaben

14 VideosInsgesamt 80 Minuten

Meet Your Instructor 2 Minuten
Linear SVM11 Minuten
Visualize Linear8 Minuten
Radial Basis Function (RBF)6 Minuten
RBF Kernel3 Minuten
Visualize a RBF SVM10 Minuten
1D DFT5 Minuten
1D Inverse DFT 7 Minuten
1D Basic Functions5 Minuten
Frequency and Time6 Minuten
2D DFT2 Minuten
2D Inverse DFT2 Minuten
2D Basic Functions4 Minuten
Frequency and Spatial 3 Minuten

7 LektürenInsgesamt 49 Minuten

Earn Academic Credit for your Work!10 Minuten
Course Support10 Minuten
Inside the Course5 Minuten
Assessment Expectations10 Minuten
AI Citation and Acknowledgement10 Minuten
Get the Workbook: SVM2 Minuten
Get the Workbook: Fourier 1D & 2D2 Minuten

4 AufgabenInsgesamt 75 Minuten

Support Vector Machine (SVM)15 Minuten
Fourier 1D15 Minuten
Fourier 2D15 Minuten
SMV and Fourier30 Minuten

This module invites you to explore how probability theory and sequential modeling power modern AI systems. You’ll begin by examining how conditional and joint probabilities shape predictions in language and image models, and how the chain rule enables structured generative processes. Then, you’ll transition to recurrent neural networks (RNNs), learning how they handle sequential data through hidden states and feedback loops. You’ll compare RNNs to feedforward models, explore architectures like one-to-many and sequence-to-sequence, and address challenges like vanishing gradients. By the end, you’ll understand how probabilistic reasoning and temporal modeling combine to support tasks ranging from text generation to autoregressive image synthesis.

Das ist alles enthalten

15 Videos2 Lektüren5 Aufgaben

15 VideosInsgesamt 122 Minuten

Probability in Language Models 10 Minuten
Conditional Probabilities 8 Minuten
The Chain Rule of Probabilities10 Minuten
Calculating Joint Probabilities 12 Minuten
Pixel-Base Image Models12 Minuten
Autoregressive Image Model16 Minuten
Attention Mechanisms in Transformer Models13 Minuten
Batch vs Recurrent4 Minuten
MLP vs RNN11 Minuten
Many to One3 Minuten
One to Many2 Minuten
One to One5 Minuten
Sequence to Sequence2 Minuten
Deep RNN5 Minuten
Autoregressive RNN3 Minuten

2 LektürenInsgesamt 4 Minuten

Get the Workbook: Probability2 Minuten
Get the Workbook: RNN2 Minuten

5 AufgabenInsgesamt 90 Minuten

Probability Part One15 Minuten
Probability Part Two15 Minuten
RNN Part One15 Minuten
RNN Part Two15 Minuten
Probability and RNN30 Minuten

This module explores how attention-based architectures have reshaped the landscape of deep learning for both language and vision. You’ll begin by unpacking the mechanics of the Transformer, including self-attention, multi-head attention, and the encoder-decoder structure that enables parallel sequence modeling. Then, you’ll transition to Vision Transformers (ViTs), where images are tokenized and processed using the same principles that revolutionized NLP. Along the way, you’ll examine how normalization, positional encoding, and projection layers contribute to model performance. By the end, you’ll understand how Transformers and ViTs unify sequence and spatial reasoning in modern AI systems.

Das ist alles enthalten

15 Videos2 Lektüren5 Aufgaben

15 VideosInsgesamt 80 Minuten

Batch vs Recurrent vs Attention6 Minuten
Attention + MLP4 Minuten
Dot-Product Self-Attention4 Minuten
QKV Self-Attention4 Minuten
Transformer Encoder3 Minuten
Self vs Cross Attention5 Minuten
Encoder and Decoder for Transformer7 Minuten
Decoder Output Layer3 Minuten
Image to Tokens10 Minuten
Normalization for ViT3 Minuten
Self-Attention for ViT5 Minuten
Multi-Head Attention8 Minuten
MLP Forward Feed3 Minuten
ViT Output Layer4 Minuten
Loss Gradient for ViT3 Minuten

2 LektürenInsgesamt 4 Minuten

Get the Workbook: Transformer2 Minuten
Get the Workbook: ViT2 Minuten

5 AufgabenInsgesamt 90 Minuten

Transformer Part One15 Minuten
Transformer Part Two15 Minuten
ViT Part One15 Minuten
ViT Part Two15 Minuten
Transformer and ViT30 Minuten

In this module, you’ll explore two transformative approaches in multimodal and generative AI. First, you’ll dive into CLIP, a model that learns a shared embedding space for images and text using contrastive pre-training. You’ll see how CLIP enables zero-shot classification by comparing image embeddings to textual descriptions, without needing labeled training data. Then, you’ll shift to diffusion models, which generate images through a gradual denoising process. You’ll learn how noise prediction, time conditioning, and reverse diffusion combine to produce high-quality samples. This module highlights how foundational models can bridge modalities and synthesize data with remarkable flexibility.

Das ist alles enthalten

11 Videos2 Lektüren4 Aufgaben

11 VideosInsgesamt 75 Minuten

Batch of Pairs5 Minuten
Image Encoder (Batch)6 Minuten
Text Encoder (Batch)10 Minuten
Joint Embedding4 Minuten
Contrastive Pre-Training12 Minuten
Zero-Shot Image Classifier6 Minuten
Zero-Shot Image Prediction6 Minuten
Diffusion Introduction4 Minuten
Noise Prediction5 Minuten
Time Conditioning and Parallel Training4 Minuten
Reverse Diffusion6 Minuten

2 LektürenInsgesamt 4 Minuten

Get the Workbook: CLIP2 Minuten
Get the Workbook: Diffusion2 Minuten

4 AufgabenInsgesamt 75 Minuten

CLIP Part One15 Minuten
CLIP Part Two15 Minuten
Diffusion15 Minuten
CLIP and Diffusion30 Minuten

Erwerben Sie ein Karrierezertifikat.

Fügen Sie dieses Zeugnis Ihrem LinkedIn-Profil, Lebenslauf oder CV hinzu. Teilen Sie sie in Social Media und in Ihrer Leistungsbeurteilung.

Dozent

Tom Yeh

University of Colorado Boulder

4 Kurse10.149 Lernende

von

University of Colorado Boulder

Mehr von Algorithms entdecken

Packt
Machine Learning – Modern Computer Vision & Generative AI
Kurs
Status: Kostenloser Testzeitraum
University of Colorado Boulder
Introduction to Deep Learning
Kurs
Status: Kostenloser Testzeitraum
Codio
Multimodal Generative AI: Vision, Speech, and Assistants
Kurs
Status: Kostenloser Testzeitraum
Packt
Advanced PyTorch Techniques and Applications
Kurs

Warum entscheiden sich Menschen für Coursera für ihre Karriere?

Felipe M.

Lernender seit 2018

„Es ist eine großartige Erfahrung, in meinem eigenen Tempo zu lernen. Ich kann lernen, wenn ich Zeit und Nerven dazu habe.“

Jennifer J.

Lernender seit 2020

„Bei einem spannenden neuen Projekt konnte ich die neuen Kenntnisse und Kompetenzen aus den Kursen direkt bei der Arbeit anwenden.“

Larry W.

Lernender seit 2021

„Wenn mir Kurse zu Themen fehlen, die meine Universität nicht anbietet, ist Coursera mit die beste Alternative.“

Chaitanya A.

„Man lernt nicht nur, um bei der Arbeit besser zu werden. Es geht noch um viel mehr. Bei Coursera kann ich ohne Grenzen lernen.“

Neue Karrieremöglichkeiten mit Coursera Plus

Unbegrenzter Zugang zu 10,000+ Weltklasse-Kursen, praktischen Projekten und berufsqualifizierenden Zertifikatsprogrammen - alles in Ihrem Abonnement enthalten

Mehr erfahren

Bringen Sie Ihre Karriere mit einem Online-Abschluss voran.

Erwerben Sie einen Abschluss von erstklassigen Universitäten – 100 % online

Erkunden Sie die Abschlüsse

Schließen Sie sich mehr als 3.400 Unternehmen in aller Welt an, die sich für Coursera for Business entschieden haben.

Schulen Sie Ihre Mitarbeiter*innen, um sich in der digitalen Wirtschaft zu behaupten.

Mehr erfahren

Häufig gestellte Fragen

To access the course materials, assignments and to earn a Certificate, you will need to purchase the Certificate experience when you enroll in a course. You can try a Free Trial instead, or apply for Financial Aid. The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

When you enroll in the course, you get access to all of the courses in the Specialization, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile.

If you subscribed, you get a 7-day free trial during which you can cancel at no penalty. After that, we don’t give refunds, but you can cancel your subscription at any time. See our full refund policy.