[범용 인공 지능의 미래]AGENT AI: SURVEYING THE HORIZONS OF MULTIMODAL INTERACTION

On 25 Jan 2024, Stanford University, Microsoft Research, the University of California, Los Angeles and the University of Washington Co-published a surveying paper called: AGENT AI: SURVEYING THE HORIZONS OF MULTIMODAL INTERACTION

Abstract:

They try to draw the future of the Multi-modal AI systems. They present a new concept ‘Agent AI’, a promising avenue toward Artificial General Intelligence (AGI). (범용 인공 지능을 향한 유망한 길입니다.)

Agent AI system that can perceive and act in many different domains and applications, possibly serving as a route towards AGI using an agent paradigm(표준 양식]). — Figure 1

How to use ubiquitous(어디에나 있는) multi-modal AI systems?

Agents within physical and virtual environments(현실 및 가상 환경의 에이전트) like the Jarvis in Iron Man(아이언맨의 자비스처럼).

Overview of an Agent AI system:

Data flow:

We define “Agent AI” as a class of interactive systems that can

perceive visual stimuli, language inputs, and other environmentally-grounded data,

and can produce meaningful embodied actions.

We explore systems that aim to improve agents based on next-embodied action prediction by incorporating external knowledge, multi-sensory inputs, and human feedback.

mitigate(감소) the hallucinations(환각) of large foundation models and their tendency to generate environmentally incorrect outputs

We envision a future where people can easily create any
virtual reality or simulated scene and interact with agents embodied within the virtual environment. (like the OASIS game in the film “Ready Player One” 영화 ‘레디 플레이어 원’의 오아시스 게임처럼 )

1. Introduction

1.1 Motivation

The AI community is on the cusp of a significant paradigm shift, transitioning from
creating AI models for passive, structured tasks to models capable of assuming dynamic, agentic roles in diverse and complex environments.

고도 구조화 고정 임무task 의 AI 모델 → 다양하고 복잡한 환경에서 역동적이고 Agent. LLMs and VLMs → a blend of linguistic proficiency, visual cognition, contextual memory, intuitive reasoning, and adaptability. 언어 능력, 시각 인지, 문맥 기억, 직관적 추론 및 적응력. → gaming, robotics, and healthcare domains. 게임, 로봇 공학 및 의료 영역 → redefine human experiences and elevate the operational standard. 인간의 경험을 재정립하고 업무 운용 수준을 높임. → transformative impacts on industries and socio-economic 산업과 사회경제에 대한 혁신 영향

1.2 Background

Large Foundation Models: can potentially tackle complex tasks

mathematical reasoning, professional law, generate complex plans for robots and game AI

수학 문제, 법학 문제, 계획 세우기 등등….

Embodied AI: leverages LLMs to perform task planning for robot

generate Python code → execute these code in low-level controller

Interactive Learning:

Feedback-based learning The AI adapts its responses based on direct user feedback.
Observational Learning: The AI observes user interactions and learns implicitly.

2. Agent AI Integration

Previous technology and future challenge

2.1 [Previous technology] Infinite AI agent

AI agent systems’ abilities:

Predictive Modeling
Decision Making
Handling Ambiguity
Continuous Improvement [Infinite AI agent] 2D/3D embodied generation and editing interaction RoboGen: autonomously runs the cycles of task proposition, environment generation, and skill learning. https://robogen-ai.github.io/videos/pipeline_cropped.mp4

2.2 [Challenge] Agent AI with Large Foundation Models

2.2.1 Hallucinations(Fake answer)(환각)

2.2.2 Biases and Inclusivity(편견과 포용성)

Training Data / Historical and Cultural Biases / Language and Context Limitations / Policies and Guidelines / Overgeneralization / Constant Monitoring and Updating / Amplification of Dominant Views / Ethical and Inclusive Design / User Guidelines

학습 데이터 / 역사 및 문화적 편견 / 언어 및 문맥의 한계 / 정책 및 가이드라인 / 지나친 일반화 / 지속적인 모니터링 및 업데이트 / 지배적인 견해의 확대 / 윤리적 및 포용적 디자인 / 사용자 가이드라인

Mitigate(해결법):

Diverse and Inclusive Training Data / Bias Detection and Correction / Ethical Guidelines and Policies / Diverse Representation / Bias Mitigation / Cultural Sensitivity / Accessibility / Language-based Inclusivity / Ethical and Respectful Interactions / User Feedback and Adaptation / Compliance with Inclusivity Guidelines

2.2.3 Data Privacy and Usage

Data Collection, Usage and Purpose / Storage and Security / Data Deletion and Retention / Data Portability and Privacy Policy / Anonymization

2.2.4 Interpretability and Explainability

Imitation Learning → Decoupling 모방 학습 → 디커플링

Decoupling → Generalization 디커플링 → 일반화

Generalization → Emergent Behavior 일반화 → 새로운 행위

2.2.5 Inference Augmentation

Data Enrichment / Algorithm Enhancement / Human-in-the-Loop (HITL) / Real-Time Feedback Integration / Cross-Domain Knowledge Transfer / Customization for Specific Use Cases / Ethical and Bias Considerations. / Continuous Learning and Adaptation

2.2.6 Regulation

To address unpredictable output(uncertainty) from the model.

→ Provide environmental information within the prompt

→ designing prompts to make LLM/VLMs include explanatory text

→ pre-execution verification and modification under human guidance

2.3 [Challenge] Agent AI for Emergent Abilities[새로운 능력]

Current modeling practices require developers to prepare large datasets for each domain to finetune/pretrain models; however, this process is costly and even impossible if the domain is new.

⭐Unseen environments or scenarios?

The new emergent mechanism — Mixed Reality with Knowledge Inference Interaction

Enables the exploration of unseen environments for adaptation to virtual reality.

Environment and Perception with task-planning and skill observation;
Agent learning;
Memory;
Agent action;
Cognition.

3 Agent AI Paradigm

We seek to accomplish several goals with our proposed framework:

Make use of existing pre-trained models and pre-training strategies to effectively bootstrap our agents with an effective understanding of important modalities, such as text or visual inputs.
Support for sufficient long-term task-planning capabilities.
Incorporate a framework for memory that allows for learned knowledge to be encoded and retrieved later.
Allow for environmental feedback to be used to effectively train the agent to learn which actions to take.

LLMs and VLMs|Agent Transformer Definition|Agent Transformer Creationc

4 Agent AI Learning

4.1 Strategy and Mechanism

4.1.1 Reinforcement Learning (RL)

4.1.2 Imitation Learning (IL)

Seeks to leverage expert data to mimic the actions of experienced agents or experts

~~4.1.3 Traditional RGB~~

~~4.1.4 In-context Learning~~

~~4.1.5 Optimization in the Agent System~~

~~4.2 Agent Systems (zero-shot and few-shot level)~~

5 Agent AI Categorization

5.1 Generalist Agent Areas:

5.2 Embodied Agents

5.2.1 Action Agents

5.2.2 Interactive Agents

5.3 Simulation and Environments Agents

5.4 Generative Agents

5.4.1 AR/VR/mixed-reality Agents

5.5 Knowledge and Logical Inference Agents

5.5.1 Knowledge Agent

5.5.2 Logic Agents

5.5.3 Agents for Emotional Reasoning

5.5.4 Neuro-Symbolic Agents

6 Agent AI Application Tasks

6.1 Agents for Gaming

NPC Behavior

Human-NPC Interaction

Agent-based Analysis of Gaming

Scene Synthesis for Gaming

6.2 Robotics

Visual Motor Control

Language Conditioned Manipulation

Skill Optimization

6.3 Healthcare

Diagnostic Agents.

Knowledge Retrieval Agents.

Telemedicine and Remote Monitoring.

6.4 Multimodal Agents

6.4.1 Image-Language Understanding and Generation

6.4.2 Video and Language Understanding and Generation

6.6 Agent for NLP

6.6.1 LLM agent

6.6.2 General LLM agent

6.6.3 Instruction-following LLM agents

7 Agent AI Across Modalities, Domains, and Realities

7.1 Agents for Cross-modal Understanding(image, text, audio, video…)

7.2 Agents for Cross-domain Understanding

7.3 Interactive agent for cross-modality and cross-reality

7.4 Sim to Real Transfer

⭐8 Continuous and Self-improvement for Agent AI

8.1 Human-based Interaction Data

Additional training data
Human preference learning(사람의 선택 학습)
Safety training (red-teaming)(보안)

8.2 Foundation Model Generated Data

LLM Instruction-tuning
Vision-language pairs

9 Agent Dataset and Leaderboard

9.1 “CuisineWorld” Dataset for Multi-agent Gaming

• We provide a dataset and related the benchmark, called Microsoft MindAgent and correspondingly release a dataset “CuisineWorld” to the research community.

Typo in 9.1.2 Task

9.2 Audio-Video-Language Pre-training Dataset

Video Text Retrieval
Video Assisted Informative Question Answering

CVPR 2024 Tutorial on Generalist Agent AI