Build Multimodal Generative AI Applications

Diese kurs ist nicht verfügbar in Deutsch (Deutschland)

Wir übersetzen es in weitere Sprachen.

Build Multimodal Generative AI Applications

Dieser Kurs ist Teil von IBM RAG and Agentic AI (berufsbezogenes Zertifikat)

Dozenten: Hailey Quach

Bei Coursera Plus enthalten

Mehr erfahren

3 Module

Verschaffen Sie sich einen Einblick in ein Thema und lernen Sie die Grundlagen.

Stufe Mittel

Empfohlene Erfahrung

7 Stunden zu vervollständigen

Flexibler Zeitplan

In Ihrem eigenen Lerntempo lernen

3 Module

Verschaffen Sie sich einen Einblick in ein Thema und lernen Sie die Grundlagen.

Stufe Mittel

Empfohlene Erfahrung

7 Stunden zu vervollständigen

Flexibler Zeitplan

In Ihrem eigenen Lerntempo lernen

Was Sie lernen werden

Build the job-ready skills you need to build multimodal generative AI applications in just 3 weeks
Understand the fundamental concepts and challenges in multimodal AI, including the integration of text, speech, images, and video
Build multimodal AI applications using state-of-the-art models and frameworks such as IBM’s Granite, Meta’s Llama, OpenAI’s Whisper, DALL·E and Sora
Develop multimodal AI solutions, including chatbots and image/video generation models, using IBM watsonx.ai, Hugging Face, Flask and Gradio

Kompetenzen, die Sie erwerben

Kategorie: Web Applications
Kategorie: Flask (Web Framework)
Kategorie: OpenAI
Kategorie: Image Analysis
Kategorie: PyTorch (Machine Learning Library)
Kategorie: Large Language Modeling
Kategorie: Artificial Intelligence
Kategorie: Tensorflow
Kategorie: Prompt Engineering
Kategorie: Generative AI
Kategorie: Application Development
Kategorie: Natural Language Processing
Kategorie: Computer Vision

Wichtige Details

Zertifikat zur Vorlage

Zu Ihrem LinkedIn-Profil hinzufügen

Kürzlich aktualisiert!

Mai 2025

Bewertungen

6 Aufgaben

Unterrichtet in Englisch

Erfahren Sie, wie Mitarbeiter führender Unternehmen gefragte Kompetenzen erwerben.

Weitere Informationen zu Coursera für Unternehmen

Logos von Petrobras, TATA, Danone, Capgemini, P&G und L'Oreal

Erweitern Sie Ihr Fachwissen im Bereich Software Development

Dieser Kurs ist Teil der Spezialisierung IBM RAG and Agentic AI (berufsbezogenes Zertifikat)

Wenn Sie sich für diesen Kurs anmelden, werden Sie auch für dieses berufsbezogene Zertifikat angemeldet.

Lernen Sie neue Konzepte von Branchenexperten
Gewinnen Sie ein Grundverständnis bestimmter Themen oder Tools
Erwerben Sie berufsrelevante Kompetenzen durch praktische Projekte
Erwerben Sie ein Berufszertifikat von IBM zur Vorlage

Erwerben Sie ein Karrierezertifikat.

Fügen Sie diese Qualifikation zur Ihrem LinkedIn-Profil oder Ihrem Lebenslauf hinzu.

Teilen Sie es in den sozialen Medien und in Ihrer Leistungsbeurteilung.

In diesem Kurs gibt es 3 Module

Ready to level up your GenAI skills? Step into the exciting world of multimodal AI, where language, images, and speech come together to build smarter, more interactive applications.

In this hands-on course, you’ll learn how to build systems that work across multiple modalities, from creating AI-powered storytellers and meeting assistants to developing image captioning tools and video generation apps. You’ll gain experience with real-world tools like IBM’s Granite, OpenAI’s Whisper, Sora and DALL·E, Meta’s Llama, Mistral’s Mixtral, and Gradio. Plus, you'll explore multimodal search, question answering, and retrieval systems that combine text, speech, and visual data. By the end of the course, you’ll be able to design and build full-stack multimodal AI solutions using Python and frameworks like Flask and Gradio. If you’re looking to gain in-demand skills for building the next generation of AI applications, enroll today and power up your AI career!

This module provides an in-depth introduction to multimodal AI, focusing on how AI systems process and integrate multiple data types, including text, speech, and images. You will explore core concepts and some of the challenges you will face in multimodal AI, gaining foundational skills with text and speech processing techniques. Through hands-on labs, you will apply AI-powered storytelling, speech-to-text transcription, and text-to-speech synthesis to real-world applications, such as AI-generated audiobooks and automated meeting assistants. 

Das ist alles enthalten

4 Videos2 Lektüren2 Aufgaben2 App-Elemente6 Plug-ins

4 VideosInsgesamt 28 Minuten

Video: Course Introduction4 MinutenModulvorschau
Video: Introduction to Multimodal AI 8 Minuten
Text-to-Speech Technologies 8 Minuten
Speech-to-Text Technologies 7 Minuten

2 LektürenInsgesamt 5 Minuten

Reading: Course Overview3 Minuten
Reading: Summary and Highlights 2 Minuten

2 AufgabenInsgesamt 36 Minuten

Practice Quiz: Introduction to Multimodal AI: Text and Speech Processing15 Minuten
Graded Quiz: Foundations of Multimodal AI21 Minuten

2 App-ElementeInsgesamt 75 Minuten

Lab: Use Mistral and gTTS to Create Your Personal Storyteller30 Minuten
Lab: Build a Meeting Assistant with Whisper, LangChain, & Gradio45 Minuten

6 Plug-insInsgesamt 32 Minuten

Helpful Tips for Course Completion3 Minuten
Reading: What is Multimodal Generative AI and Why Does It Matter? 5 Minuten
Reading: What is Computer Vision? 7 Minuten
Reading: Text Processing, Speech Processing, and Text-to-Speech 7 Minuten
Reading: Challenges in Multimodal AI Integration 5 Minuten
Cheat Sheet: Foundations of Multimodal AI 5 Minuten

This module explores how AI processes generate visual data by integrating images and videos with text. You will examine text-to-image/image-to-text and text-to-video/video-to-text models, image captioning, and the fusion techniques necessary for effective multimodal AI systems. Through hands-on labs, you will apply state-of-the-art models like DALL·E and Sora to generate images and videos from text prompts. Additionally, you will implement an image captioning system using Meta’s Llama 4, gaining practical experience in combining vision and language models for real-world applications.

Das ist alles enthalten

2 Videos1 Lektüre2 Aufgaben2 App-Elemente3 Plug-ins

2 VideosInsgesamt 14 Minuten

Video: Understanding Image Captioning with Meta's Llama7 MinutenModulvorschau
Demo Video: Text-to-Video Generation with OpenAI's Sora7 Minuten

1 LektüreInsgesamt 3 Minuten

Reading: Summary and Highlights 3 Minuten

2 AufgabenInsgesamt 31 Minuten

Image Generation and Captioning 10 Minuten
Graded Quiz: Integrating Visual and Video Modalities 21 Minuten

2 App-ElementeInsgesamt 50 Minuten

Lab: DALL·E Image Generation Guide for Beginners20 Minuten
Lab: Build an Image Captioning System with watsonx and IBM's Granite30 Minuten

3 Plug-insInsgesamt 35 Minuten

Reading: Introduction to Text-to-Video and Image-to-Video Technologies12 Minuten
Reading: Strengths, Limitations, and Practical Applications of Multimodal Vision Models in Real World Scenarios8 Minuten
Cheat Sheet: Integrating Visual and Video Modalities 15 Minuten

The final module explores advanced multimodal AI applications, integrating image, text, and retrieval-based systems to build innovative solutions. You will dive into multimodal retrieval and search, multimodal Question Answering (QA), and chatbots, learning how cross-modal retrieval techniques enhance search engines and recommendation systems. Additionally, you will learn how integrating visual and textual data improves chatbot interactions. Through hands-on labs, you will build fully functional web applications with multimodal capabilities using Flask, applying state-of-the-art models and frameworks. 

Das ist alles enthalten

3 Videos3 Lektüren2 Aufgaben2 App-Elemente1 Plug-in

3 VideosInsgesamt 18 Minuten

Introduction to Multimodal Retrieval-Augmented Generation (MM-RAG)6 MinutenModulvorschau
Multimodal Chatbots and QA Systems 7 Minuten
Video: Course Wrap-up3 Minuten

3 LektürenInsgesamt 6 Minuten

Summary and Highlights 2 Minuten
Reading: Congratulations and Next Steps 2 Minuten
Thanks from the Course Team 2 Minuten

2 AufgabenInsgesamt 36 Minuten

Build Advanced Multimodal Applications15 Minuten
Graded Quiz: Advanced Multimodal Applications 21 Minuten

2 App-ElementeInsgesamt 75 Minuten

Lab: Build a Style Finder Using Multimodal Retrieval and Search45 Minuten
Lab: Building Your First GenAI-Powered Image-Based Web Application: AI Nutrition Coach30 Minuten

1 Plug-inInsgesamt 10 Minuten

Cheat Sheet: Advanced Multimodal Applications 10 Minuten

Dozenten

Hailey Quach

IBM

2 Kurse378 Lernende

von

IBM

Mehr von Software Development entdecken

IBM
Fundamentals of Building AI Agents
Kurs
IBM
Agentic AI Fundamentals with LangChain and LangGraph
Kurs
IBM
Agentic AI with LangGraph, CrewAI, AutoGen and BeeAI
Kurs
IBM
Advanced RAG with Vector Databases and Retrievers
Kurs

Warum entscheiden sich Menschen für Coursera für ihre Karriere?

Felipe M.

Lernender seit 2018

„Es ist eine großartige Erfahrung, in meinem eigenen Tempo zu lernen. Ich kann lernen, wenn ich Zeit und Nerven dazu habe.“

Jennifer J.

Lernender seit 2020

„Bei einem spannenden neuen Projekt konnte ich die neuen Kenntnisse und Kompetenzen aus den Kursen direkt bei der Arbeit anwenden.“

Larry W.

Lernender seit 2021

„Wenn mir Kurse zu Themen fehlen, die meine Universität nicht anbietet, ist Coursera mit die beste Alternative.“

Chaitanya A.

„Man lernt nicht nur, um bei der Arbeit besser zu werden. Es geht noch um viel mehr. Bei Coursera kann ich ohne Grenzen lernen.“

Neue Karrieremöglichkeiten mit Coursera Plus

Unbegrenzter Zugang zu 10,000+ Weltklasse-Kursen, praktischen Projekten und berufsqualifizierenden Zertifikatsprogrammen - alles in Ihrem Abonnement enthalten

Mehr erfahren

Bringen Sie Ihre Karriere mit einem Online-Abschluss voran.

Erwerben Sie einen Abschluss von erstklassigen Universitäten – 100 % online

Erkunden Sie die Abschlüsse

Schließen Sie sich mehr als 3.400 Unternehmen in aller Welt an, die sich für Coursera for Business entschieden haben.

Schulen Sie Ihre Mitarbeiter*innen, um sich in der digitalen Wirtschaft zu behaupten.

Mehr erfahren

Häufig gestellte Fragen

Skills in multimodal generative AI, where systems integrate text, speech, images, and video, are in high demand for roles such as AI developer, machine learning engineer, multimodal AI researcher, and full-stack developer specializing in AI-powered user experiences.

Not necessarily. If you’re a Python developer, you can start building with generative AI using tools like IBM watsonx.ai, Flask, and Gradio—no advanced ML background required.

Multimodal AI apps go beyond typical app development by incorporating multimodal large language models (MLLMs) and media-based inputs like speech, images, and video. You’ll still use familiar tools like Python, Flask and Gradio, but you’ll also learn to integrate and orchestrate models for tasks like transcription, image generation, and AI-powered storytelling.

Access to lectures and assignments depends on your type of enrollment. If you take a course in audit mode, you will be able to see most course materials for free. To access graded assignments and to earn a Certificate, you will need to purchase the Certificate experience, during or after your audit. If you don't see the audit option:

The course may not offer an audit option. You can try a Free Trial instead, or apply for Financial Aid.
The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

When you enroll in the course, you get access to all of the courses in the Certificate, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile. If you only want to read and view the course content, you can audit the course for free.