Build Multimodal Generative AI Applications

Ce cours n'est pas disponible en Français (France)

Nous sommes actuellement en train de le traduire dans plus de langues.

Build Multimodal Generative AI Applications

Ce cours fait partie de IBM RAG and Agentic AI Certificat Professionnel

Instructeurs : Hailey Quach

Inclus avec Coursera Plus

3 modules

Obtenez un aperçu d'un sujet et apprenez les principes fondamentaux.

niveau Intermédiaire

Expérience recommandée

7 heures pour terminer

3 semaines à 2 heures par semaine

Planning flexible

Apprenez à votre propre rythme

3 modules

Obtenez un aperçu d'un sujet et apprenez les principes fondamentaux.

niveau Intermédiaire

Expérience recommandée

7 heures pour terminer

3 semaines à 2 heures par semaine

Planning flexible

Apprenez à votre propre rythme

Ce que vous apprendrez

Build the job-ready skills you need to build multimodal generative AI applications in just 3 weeks
Understand the fundamental concepts and challenges in multimodal AI, including the integration of text, speech, images, and video
Build multimodal AI applications using state-of-the-art models and frameworks such as IBM’s Granite, Meta’s Llama, OpenAI’s Whisper, DALL·E and Sora
Develop multimodal AI solutions, including chatbots and image/video generation models, using IBM watsonx.ai, Hugging Face, Flask and Gradio

Compétences que vous acquerrez

Catégorie : Web Applications
Catégorie : Flask (Web Framework)
Catégorie : OpenAI
Catégorie : Image Analysis
Catégorie : PyTorch (Machine Learning Library)
Catégorie : Large Language Modeling
Catégorie : Artificial Intelligence
Catégorie : Tensorflow
Catégorie : Prompt Engineering
Catégorie : Generative AI
Catégorie : Application Development
Catégorie : Natural Language Processing
Catégorie : Computer Vision

Détails à connaître

Certificat partageable

Ajouter à votre profil LinkedIn

Récemment mis à jour !

mai 2025

Évaluations

6 devoirs

Enseigné en Anglais

Découvrez comment les employés des entreprises prestigieuses maîtrisent des compétences recherchées

En savoir plus sur Coursera pour les affaires

logos de Petrobras, TATA, Danone, Capgemini, P&G et L'Oreal

Élaborez votre expertise en Software Development

Ce cours fait partie de la IBM RAG and Agentic AI Certificat Professionnel

Lorsque vous vous inscrivez à ce cours, vous êtes également inscrit(e) à ce Certificat Professionnel.

Apprenez de nouveaux concepts auprès d'experts du secteur
Acquérez une compréhension de base d'un sujet ou d'un outil
Développez des compétences professionnelles avec des projets pratiques
Obtenez un certificat professionnel partageable auprès de IBM

Obtenez un certificat professionnel

Ajoutez cette qualification à votre profil LinkedIn ou à votre CV

Partagez-le sur les réseaux sociaux et dans votre évaluation de performance

Il y a 3 modules dans ce cours

Ready to level up your GenAI skills? Step into the exciting world of multimodal AI, where language, images, and speech come together to build smarter, more interactive applications.

In this hands-on course, you’ll learn how to build systems that work across multiple modalities, from creating AI-powered storytellers and meeting assistants to developing image captioning tools and video generation apps. You’ll gain experience with real-world tools like IBM’s Granite, OpenAI’s Whisper, Sora and DALL·E, Meta’s Llama, Mistral’s Mixtral, and Gradio. Plus, you'll explore multimodal search, question answering, and retrieval systems that combine text, speech, and visual data. By the end of the course, you’ll be able to design and build full-stack multimodal AI solutions using Python and frameworks like Flask and Gradio. If you’re looking to gain in-demand skills for building the next generation of AI applications, enroll today and power up your AI career!

This module provides an in-depth introduction to multimodal AI, focusing on how AI systems process and integrate multiple data types, including text, speech, and images. You will explore core concepts and some of the challenges you will face in multimodal AI, gaining foundational skills with text and speech processing techniques. Through hands-on labs, you will apply AI-powered storytelling, speech-to-text transcription, and text-to-speech synthesis to real-world applications, such as AI-generated audiobooks and automated meeting assistants. 

Inclus

4 vidéos2 lectures2 devoirs2 éléments d'application6 plugins

4 vidéosTotal 28 minutes

Video: Course Introduction4 minutesPrévisualiser le module
Video: Introduction to Multimodal AI 8 minutes
Text-to-Speech Technologies 8 minutes
Speech-to-Text Technologies 7 minutes

2 lecturesTotal 5 minutes

Reading: Course Overview3 minutes
Reading: Summary and Highlights 2 minutes

2 devoirsTotal 36 minutes

Graded Quiz: Foundations of Multimodal AI21 minutes
Practice Quiz: Introduction to Multimodal AI: Text and Speech Processing15 minutes

2 éléments d'applicationTotal 75 minutes

Lab: Use Mistral and gTTS to Create Your Personal Storyteller30 minutes
Lab: Build a Meeting Assistant with Whisper, LangChain, & Gradio45 minutes

6 pluginsTotal 32 minutes

Helpful Tips for Course Completion3 minutes
Reading: What is Multimodal Generative AI and Why Does It Matter? 5 minutes
Reading: What is Computer Vision? 7 minutes
Reading: Text Processing, Speech Processing, and Text-to-Speech 7 minutes
Reading: Challenges in Multimodal AI Integration 5 minutes
Cheat Sheet: Foundations of Multimodal AI 5 minutes

This module explores how AI processes generate visual data by integrating images and videos with text. You will examine text-to-image/image-to-text and text-to-video/video-to-text models, image captioning, and the fusion techniques necessary for effective multimodal AI systems. Through hands-on labs, you will apply state-of-the-art models like DALL·E and Sora to generate images and videos from text prompts. Additionally, you will implement an image captioning system using Meta’s Llama 4, gaining practical experience in combining vision and language models for real-world applications.

Inclus

2 vidéos1 lecture2 devoirs2 éléments d'application3 plugins

2 vidéosTotal 14 minutes

Video: Understanding Image Captioning with Meta's Llama7 minutesPrévisualiser le module
Demo Video: Text-to-Video Generation with OpenAI's Sora7 minutes

1 lectureTotal 3 minutes

Reading: Summary and Highlights 3 minutes

2 devoirsTotal 31 minutes

Graded Quiz: Integrating Visual and Video Modalities 21 minutes
Image Generation and Captioning 10 minutes

2 éléments d'applicationTotal 50 minutes

Lab: DALL·E Image Generation Guide for Beginners20 minutes
Lab: Build an Image Captioning System with watsonx and IBM's Granite30 minutes

3 pluginsTotal 35 minutes

Reading: Introduction to Text-to-Video and Image-to-Video Technologies12 minutes
Reading: Strengths, Limitations, and Practical Applications of Multimodal Vision Models in Real World Scenarios8 minutes
Cheat Sheet: Integrating Visual and Video Modalities 15 minutes

The final module explores advanced multimodal AI applications, integrating image, text, and retrieval-based systems to build innovative solutions. You will dive into multimodal retrieval and search, multimodal Question Answering (QA), and chatbots, learning how cross-modal retrieval techniques enhance search engines and recommendation systems. Additionally, you will learn how integrating visual and textual data improves chatbot interactions. Through hands-on labs, you will build fully functional web applications with multimodal capabilities using Flask, applying state-of-the-art models and frameworks. 

Inclus

3 vidéos3 lectures2 devoirs2 éléments d'application1 plugin

3 vidéosTotal 18 minutes

Introduction to Multimodal Retrieval-Augmented Generation (MM-RAG)6 minutesPrévisualiser le module
Multimodal Chatbots and QA Systems 7 minutes
Video: Course Wrap-up3 minutes

3 lecturesTotal 6 minutes

Summary and Highlights 2 minutes
Reading: Congratulations and Next Steps 2 minutes
Thanks from the Course Team 2 minutes

2 devoirsTotal 36 minutes

Graded Quiz: Advanced Multimodal Applications 21 minutes
Build Advanced Multimodal Applications15 minutes

2 éléments d'applicationTotal 75 minutes

Lab: Build a Style Finder Using Multimodal Retrieval and Search45 minutes
Lab: Building Your First GenAI-Powered Image-Based Web Application: AI Nutrition Coach30 minutes

1 pluginTotal 10 minutes

Cheat Sheet: Advanced Multimodal Applications 10 minutes

Instructeurs

Hailey Quach

IBM

2 Cours216 apprenants

Offert par

IBM

En savoir plus sur Software Development

IBM
Fundamentals of Building AI Agents
Cours
IBM
Agentic AI Fundamentals with LangChain and LangGraph
Cours
IBM
Agentic AI with LangGraph, CrewAI, AutoGen and BeeAI
Cours
IBM
Advanced RAG with Vector Databases and Retrievers
Cours

Pour quelles raisons les étudiants sur Coursera nous choisissent-ils pour leur carrière ?

Felipe M.

Étudiant(e) depuis 2018

’Pouvoir suivre des cours à mon rythme à été une expérience extraordinaire. Je peux apprendre chaque fois que mon emploi du temps me le permet et en fonction de mon humeur.’

Jennifer J.

Étudiant(e) depuis 2020

’J'ai directement appliqué les concepts et les compétences que j'ai appris de mes cours à un nouveau projet passionnant au travail.’

Larry W.

Étudiant(e) depuis 2021

’Lorsque j'ai besoin de cours sur des sujets que mon université ne propose pas, Coursera est l'un des meilleurs endroits où se rendre.’

Chaitanya A.

’Apprendre, ce n'est pas seulement s'améliorer dans son travail : c'est bien plus que cela. Coursera me permet d'apprendre sans limites.’

Ouvrez de nouvelles portes avec Coursera Plus

Accès illimité à 10,000+ cours de niveau international, projets pratiques et programmes de certification prêts à l'emploi - tous inclus dans votre abonnement.

Faites progresser votre carrière avec un diplôme en ligne

Obtenez un diplôme auprès d’universités de renommée mondiale - 100 % en ligne

Découvrir les diplômes

Rejoignez plus de 3 400 entreprises mondiales qui ont choisi Coursera pour les affaires

Améliorez les compétences de vos employés pour exceller dans l’économie numérique

Foire Aux Questions

Skills in multimodal generative AI, where systems integrate text, speech, images, and video, are in high demand for roles such as AI developer, machine learning engineer, multimodal AI researcher, and full-stack developer specializing in AI-powered user experiences.

Not necessarily. If you’re a Python developer, you can start building with generative AI using tools like IBM watsonx.ai, Flask, and Gradio—no advanced ML background required.

Multimodal AI apps go beyond typical app development by incorporating multimodal large language models (MLLMs) and media-based inputs like speech, images, and video. You’ll still use familiar tools like Python, Flask and Gradio, but you’ll also learn to integrate and orchestrate models for tasks like transcription, image generation, and AI-powered storytelling.

Access to lectures and assignments depends on your type of enrollment. If you take a course in audit mode, you will be able to see most course materials for free. To access graded assignments and to earn a Certificate, you will need to purchase the Certificate experience, during or after your audit. If you don't see the audit option:

The course may not offer an audit option. You can try a Free Trial instead, or apply for Financial Aid.
The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

When you enroll in the course, you get access to all of the courses in the Certificate, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile. If you only want to read and view the course content, you can audit the course for free.