0%

[범용 인공 지능의 미래]AGENT AI: SURVEYING THE HORIZONS OF MULTIMODAL INTERACTION

Untitled

On 25 Jan 2024, Stanford University, Microsoft Research, the University of California, Los Angeles and the University of Washington Co-published a surveying paper called: AGENT AI: SURVEYING THE HORIZONS OF MULTIMODAL INTERACTION

Abstract:

They try to draw the future of the Multi-modal AI systems. They present a new concept ‘Agent AI’, a promising avenue toward Artificial General Intelligence (AGI). (범용 인공 지능을 향한 유망한 길입니다.)

Agent AI system that can perceive and act in many different domains and applications, possibly serving as a route towards AGI using an agent paradigm(표준 양식]). — Figure 1

How to use ubiquitous(어디에나 있는) multi-modal AI systems?

Agents within physical and virtual environments(현실 및 가상 환경의 에이전트) like the Jarvis in Iron Man(아이언맨의 자비스처럼).

Untitled

Overview of an Agent AI system:

Untitled

Data flow:

Untitled

We define “Agent AI” as a class of interactive systems that can

perceive visual stimuli, language inputs, and other environmentally-grounded data,

and can produce meaningful embodied actions.

We explore systems that aim to improve agents based on next-embodied action prediction by incorporating external knowledge, multi-sensory inputs, and human feedback.

mitigate(감소) the hallucinations(환각) of large foundation models and their tendency to generate environmentally incorrect outputs

We envision a future where people can easily create any
virtual reality or simulated scene and interact with agents embodied within the virtual environment. (like the OASIS game in the film “Ready Player One영화 ‘레디 플레이어 원’의 오아시스 게임처럼 )

1. Introduction

1.1 Motivation

The AI community is on the cusp of a significant paradigm shift, transitioning from
creating AI models for passive, structured tasks to models capable of assuming dynamic, agentic roles in diverse and complex environments.

  • 고도 구조화 고정 임무task 의 AI 모델 → 다양하고 복잡한 환경에서 역동적이고 Agent. LLMs and VLMs → a blend of linguistic proficiency, visual cognition, contextual memory, intuitive reasoning, and adaptability. 언어 능력, 시각 인지, 문맥 기억, 직관적 추론 및 적응력. → gaming, robotics, and healthcare domains. 게임, 로봇 공학 및 의료 영역 → redefine human experiences and elevate the operational standard. 인간의 경험을 재정립하고 업무 운용 수준을 높임. → transformative impacts on industries and socio-economic 산업과 사회경제에 대한 혁신 영향

1.2 Background

Large Foundation Models: can potentially tackle complex tasks

mathematical reasoning, professional law, generate complex plans for robots and game AI

수학 문제, 법학 문제, 계획 세우기 등등….

Untitled

Embodied AI: leverages LLMs to perform task planning for robot

generate Python code → execute these code in low-level controller

preview.gif

Interactive Learning:

  1. Feedback-based learning The AI adapts its responses based on direct user feedback. Untitled
  2. Observational Learning: The AI observes user interactions and learns implicitly. Untitled

2. Agent AI Integration

Previous technology and future challenge

2.1 [Previous technology] Infinite AI agent

AI agent systems’ abilities:

  1. Predictive Modeling
  2. Decision Making
  3. Handling Ambiguity
  4. Continuous Improvement [Infinite AI agent] 2D/3D embodied generation and editing interaction Untitled RoboGen: autonomously runs the cycles of task proposition, environment generation, and skill learning. [https://robogen-ai.github.io/videos/pipeline_cropped.mp4](https://robogen-ai.github.io/videos/pipeline_cropped.mp4) https://robogen-ai.github.io/videos/pipeline_cropped.mp4

2.2 [Challenge] Agent AI with Large Foundation Models

2.2.1 Hallucinations(Fake answer)(환각)

2.2.2 Biases and Inclusivity(편견과 포용성)

Training Data / Historical and Cultural Biases / Language and Context Limitations / Policies and Guidelines / Overgeneralization / Constant Monitoring and Updating / Amplification of Dominant Views / Ethical and Inclusive Design / User Guidelines

학습 데이터 / 역사 및 문화적 편견 / 언어 및 문맥의 한계 / 정책 및 가이드라인 / 지나친 일반화 / 지속적인 모니터링 및 업데이트 / 지배적인 견해의 확대 / 윤리적 및 포용적 디자인 / 사용자 가이드라인

Mitigate(해결법):

Diverse and Inclusive Training Data / Bias Detection and Correction / Ethical Guidelines and Policies / Diverse Representation / Bias Mitigation / Cultural Sensitivity / Accessibility / Language-based Inclusivity / Ethical and Respectful Interactions / User Feedback and Adaptation / Compliance with Inclusivity Guidelines

2.2.3 Data Privacy and Usage

Data Collection, Usage and Purpose / Storage and Security / Data Deletion and Retention / Data Portability and Privacy Policy / Anonymization

2.2.4 Interpretability and Explainability

Imitation Learning → Decoupling 모방 학습 → 디커플링

Decoupling → Generalization 디커플링 → 일반화

Generalization → Emergent Behavior 일반화 → 새로운 행위

Untitled

2.2.5 Inference Augmentation

Data Enrichment / Algorithm Enhancement / Human-in-the-Loop (HITL) / Real-Time Feedback Integration / Cross-Domain Knowledge Transfer / Customization for Specific Use Cases / Ethical and Bias Considerations. / Continuous Learning and Adaptation

2.2.6 Regulation

To address unpredictable output(uncertainty) from the model.

→ Provide environmental information within the prompt

→ designing prompts to make LLM/VLMs include explanatory text

→ pre-execution verification and modification under human guidance

2.3 [Challenge] Agent AI for Emergent Abilities[새로운 능력]

Current modeling practices require developers to prepare large datasets for each domain to finetune/pretrain models; however, this process is costly and even impossible if the domain is new.

⭐Unseen environments or scenarios?

The new emergent mechanism — Mixed Reality with Knowledge Inference Interaction

Enables the exploration of unseen environments for adaptation to virtual reality.

Untitled

  1. Environment and Perception with task-planning and skill observation;
  2. Agent learning;
  3. Memory;
  4. Agent action;
  5. Cognition.

3 Agent AI Paradigm

We seek to accomplish several goals with our proposed framework:

  • Make use of existing pre-trained models and pre-training strategies to effectively bootstrap our agents with an effective understanding of important modalities, such as text or visual inputs.
  • Support for sufficient long-term task-planning capabilities.
  • Incorporate a framework for memory that allows for learned knowledge to be encoded and retrieved later.
  • Allow for environmental feedback to be used to effectively train the agent to learn which actions to take.

LLMs and VLMs|Agent Transformer Definition|Agent Transformer Creationc

Untitled

Untitled

4 Agent AI Learning

4.1 Strategy and Mechanism

4.1.1 Reinforcement Learning (RL)

4.1.2 Imitation Learning (IL)

Seeks to leverage expert data to mimic the actions of experienced agents or experts

4.1.3 Traditional RGB

4.1.4 In-context Learning

4.1.5 Optimization in the Agent System

4.2 Agent Systems (zero-shot and few-shot level)

5 Agent AI Categorization

Untitled

5.1 Generalist Agent Areas:

5.2 Embodied Agents

5.2.1 Action Agents

5.2.2 Interactive Agents

5.3 Simulation and Environments Agents

5.4 Generative Agents

5.4.1 AR/VR/mixed-reality Agents

5.5 Knowledge and Logical Inference Agents

5.5.1 Knowledge Agent

5.5.2 Logic Agents

5.5.3 Agents for Emotional Reasoning

5.5.4 Neuro-Symbolic Agents

6 Agent AI Application Tasks

6.1 Agents for Gaming

NPC Behavior

Human-NPC Interaction

Agent-based Analysis of Gaming

Scene Synthesis for Gaming

6.2 Robotics

Visual Motor Control

Language Conditioned Manipulation

Skill Optimization

6.3 Healthcare

Diagnostic Agents.

Knowledge Retrieval Agents.

Telemedicine and Remote Monitoring.

6.4 Multimodal Agents

6.4.1 Image-Language Understanding and Generation

6.4.2 Video and Language Understanding and Generation

6.6 Agent for NLP

6.6.1 LLM agent

6.6.2 General LLM agent

6.6.3 Instruction-following LLM agents

7 Agent AI Across Modalities, Domains, and Realities

7.1 Agents for Cross-modal Understanding(image, text, audio, video…)

7.2 Agents for Cross-domain Understanding

7.3 Interactive agent for cross-modality and cross-reality

7.4 Sim to Real Transfer

8 Continuous and Self-improvement for Agent AI

8.1 Human-based Interaction Data

  • Additional training data
  • Human preference learning(사람의 선택 학습)
  • Safety training (red-teaming)(보안)

8.2 Foundation Model Generated Data

  • LLM Instruction-tuning
  • Vision-language pairs

9 Agent Dataset and Leaderboard

9.1 “CuisineWorld” Dataset for Multi-agent Gaming

• We provide a dataset and related the benchmark, called Microsoft MindAgent and correspondingly release a dataset “CuisineWorld” to the research community.

Typo in 9.1.2 Task

9.2 Audio-Video-Language Pre-training Dataset

  1. Video Text Retrieval

  2. Video Assisted Informative Question Answering

CVPR 2024 Tutorial on Generalist Agent AI

由于我们近百年来的落后发展和来自西方资本为主的国际社会的排斥,虽然我们在最近几十年几代人的拼命努力,我们终于在大部分领域追到了世界前列的水平。

但是,由于大部分行业和领域都起始于西方社会,我们为了迎合国际市场,不得不遵守西方定制的统一规范标准,这在我看来是非常危险的一件事情。诚然,全世界如果都能在同一行业使用同一套标准来生产产品可以使得各企业之间的产品拥有更好的互利互通性,但是,却无法避免的存在非常多的不确定性,比如:

  1. 无法保证规范标准没有利好偏向:以我的经验来看,大部分西方定制的标准都存在一定的利好偏向的。几大行业寡头,为了定制有利于自己产品生产的规范,会联合起来定制一些并没有那么符合实际的规范标准。
  2. 无法保证某些标准不存在专利寡头:由于有些标准确立初始只有某些特定的企业有于其匹配的专利产品,一旦这种标准被确定,几乎就等于所有企业都得购买特定企业的专利才能进行生产,这种标准也是非常的不合理的。
  3. 无法保证国家制裁导致的标准制裁:在某些情况下,由于被制裁,我们甚至无法被允许和国际社会使用同一套标准,因为标准中往往会出现第2点中提到的专利寡头。

因此,基于现在的严峻形式,我觉得我们各行各业的同胞们,应该定制我们自己的规范联盟,凡防于为来。

其实,关于标准定制竞争的成功案例,我们其实已经有了很多先例,比如银联与visa,master之争,华为的5G规范之争,但是,我仍然认为,这些规范的定义仍然是在西方集团框架之下的抵抗。但是我在反思,为什么我们不能跳出西方框架,自己走一条自己的路呢?我们的产品远销全世界,我们的产品质量也是最好的,我们有自己的技术,自己的销售渠道,为什么要听他人的标准呢?我们各行业的企业家们,为什么不可以按照自己的标准来生产产品呢? 为什么我们不能成立一个像IEEE那像的行业标准定制联盟呢?

我认为,基于国情,我们的行业标准定制联盟,不能完全模仿西方世界的方法。我们应该发展我们自己特色的行业定制联盟。以下是我的几点想法:

  1. 踩着别人的肩膀过河:如题,我们白手起家,完全重新定义一些标准也是不现实的,我们完全可以模仿他人已经定制好的标准,再经过一定的改进,定制自己的标准。特别是某一些行业,别人都不带我们玩了,我们也没必要客气,没必要有什么契约精神和道德规范,就是模仿之后再定制我们自己的规范就好,踩着别人的肩膀过河,站在“巨人”的肩膀做事嘛。
  2. 做新型领域开拓者:对于成熟的行业,我们可以“踩着别人的肩膀过河”,而对于新兴的行业,我们应该有自己的行业自行,直接定制自己的行业规范和标准。特别是AI,芯片和操作系统这些领域,我们必须自己定制自己的标准,大家一起集中力量做大做强,这些领域急需整个行业的人一起努力来摆脱制裁。
  3. 各企业团结起来:我们既然要定自己的标准规范,自然也要有人去定制和遵守,这点上我呼吁各行业各企业之间摈弃以往因为竞争带来的嫌隙,理解唇亡齿寒的道理。大企业带头,大家团结一致,一起定制符合行业持续发展的标准和规范。
  4. 前置专利共享:定制标注和规范,不可避免的存在一些前置专利的问题,我认为,我们企业之间要是能大方免费或者低价共享这些前置专利的话,那将是百利无害的。毕竟没有市场的专利是无法转换为利益的。
  5. 政府参与控制:我认为企业以逐利为唯一目的是无可厚非的,但是我们不能忽视了消费者和国家人民的利益,政府的适量参与和干预还是非常有必要的。至于为什么不能直接由政府直接定制规范,毕竟专业的事让专业的人做可能会更加合适。

总的来说,我在这里是呼吁各行各业的企业们团结起来,我们一起成立一个“统一规范制定联盟”,大家一起定制我们自己的行业标准,一起避免被外部势力卡脖子的现象发生。拿操作系统举例,近几年国内操作系统发展迅猛,但是各个企业之间也各自为营,做着自己标准的事情,这非此不利于操作系统上层的软件生态的开发,各企业之间若是能开诚布公,共享自己的专利,规范和标准,不重复造车,集中力量做好一套标准系统,软件开发者一次代码构建,便可应用在各个操作系统中,将会大大吸引广大开发者的开发热忱。

以上都是我的一些胡言乱语,疯狂头脑风暴,在内行人看来可能非常不现实甚至是搞笑。读者要是看了觉得有道理,可以联系我,大家一起深入探讨构建更好的蓝图,要是觉得是个笑话,就请您全当是个笑话,一笑而过吧。

AI version “Mining” system for AIGC: Take stable diffusion image generation model as a example

英文来自于Notion 机器翻译

介于现在AIGC模型的兴起,如果我们要生成一张图片,一些人在计算机算力上是非常缺失的,而另一些人却拥有非常多的计算机算力。此外,AI应用中,70%左右的算力都应用在了推演中。所以我在想,是否可以设计一个类似于以太币挖矿。我们可以用数字货币购买gas,而购买gas约等于购买算力和存储空间,然后我们便可以用gas部署我们的AI智能合约(Smart contract)。

With the rise of AIGC models, if we want to generate an image, some people are very lacking in computer computing power, while others have a lot of computer computing power. Moreover, in AI applications, about 70% of the computing power is applied in inference. So I’m wondering if we could design something similar to Ethereum mining. We can buy gas with cryptocurrency, and purchasing gas is approximately equivalent to buying computing power and storage space. Then we can use the gas to deploy our AI smart contracts.

由于运行合约的时候,模型可能会被拆分到各台机器中运行,而可被拆分的模型是一个新兴的研究方向,如 Forward-Forward Algorithm和我的DFF: Distributed Forward-Forward Algorithm论文。目前研究的并不是很完善。因此,我们在这里将 stable diffusion model中的steps作为被分割的对象。即,我们将每一个step分割到各个机器中运行,如下图:

When running contracts, the model may be split into different machines for operation, and a model that can be split is a new research direction, such as the Forward-Forward Algorithm and my DFF: Distributed Forward-Forward Algorithm paper. The current research is not complete. Therefore, we will use the steps in the stable diffusion model as the split objects here. That is, we will split each step into different machines for operation, as shown in the figure below:

ai_miner.drawio.png

想在完成这套系统主要存在的问题有:(以后还会补充)

  1. 如何将通用的问题切割成小问题?
  2. 如果是要训练模型,训练的时候数据集又怎么处理?
  3. 每一个Node中如何评价是否“挖到”了我们希望得到的模型结果?
  4. 如何backpropagation?需要backpropagation吗?
  5. 如何解决并发的问题,如何将模型设计成并发推演或训练的?

The main problems that exist in completing this system are: (will be supplemented later)

  1. How to split general problems into smaller problems?
  2. If it is to train the model, how to handle the dataset during training?
  3. How to evaluate in each Node whether we have “mined” the model results we hope to get?
  4. How to backpropagate? Is backpropagation needed?
  5. How to solve concurrency issues, and how to design the model for concurrent inference or training?

My rethink about Image generation and recognition

英文部分是Notion机器翻译

人类的大脑是图像生成最利害的机器,虽然我们无法直接向外表现出来,但是我们可以在大脑中想象出千奇百怪的东西,甚至在我们的梦境中,我们可以完全畅游在自由想像的世界。

The human brain is the most potent image generation machine. Although we can’t express it directly, we can imagine all sorts of things in our brains. Even in our dreams, we can fully roam the world of free imagination.

但是,我们人类是如何在脑中生成这些东西的呢?在我的理解中,纵使我们可以想象出任何东西,但是这个“任何东西”,很难超过我们的认知范围,即我们常说的想象的局限性。

But how do we humans generate these things in our minds? In my understanding, though we can imagine anything, this “anything” is hard to exceed our cognitive range, which is often referred to as the limitation of imagination.

我认为我们拥有这种局限性的原因是因为我们需要先观察已有的事物,然后抽象思考这个事物的特征,才能想象出基于这个事物部分特征的其他事物。

I believe we have this limitation because we need to observe existing things first, then abstract and think about the features of this thing, in order to imagine other things based on some features of this thing.

因此,如果我们遵从人类的思考方式创建AI模型的话,我假设模型分为3部分: 观察,想象和思考。

Therefore, if we create AI models following human thinking, I assume the model is divided into three parts: observation, imagination, and thinking.

humangen.drawio.png

上图就是我认为的一个人类大脑的思维逻辑,或者说是一种图像生成的AI模型的结构,该结构有点类似于GAN(Generative Adversarial Network)模型。

The above figure is what I think is the logic of human brain thinking, or the structure of an image generation AI model, which is somewhat similar to the GAN (Generative Adversarial Network) model.

Observation: 我们将会观察真实图片,学习并提取真实图片的隐藏特征值(Hidden Feature)。

Observation: We will observe real images, learn and extract the hidden features of real images.

Hidden Feature: 观察到的特征值会被应用于两个方面,一方面是用于思考,一方面是用于生成图片。

Hidden Feature: The observed features will be applied in two ways, one for thinking and one for generating images.

Imagination: 我们通过总结到的Hidden Feature,可以试图生成一张新的图片,这张图片和原图Real image很相似,但是又不同,目的是为了生成一张能够让observation提取到和Real image相似特征的图片。

Imagination: We can try to generate a new image through the summarized Hidden Feature. This image is similar to the original Real image, but different. The purpose is to generate an image that allows the observation to extract features similar to the Real image.

让我们重新回到我们人类的思考学习方式上,当我们牙牙学语时候,假设我们认识“车”这个单词,老师会给我们一张车的图片卡,然后反复教我们这个东西是“车”。当我们记住了这张图片和与之对应的单词“车”之后,神奇的一幕发生了,即便我们看到了另一个模样的车之后,我们仍然能够“猜”出这是车!而这样神奇的能力我认为可以归因于人类想象力。因为人类在看到“车”的图片卡之后,我们不但是学习了车这个图片的特征,还学习到了很多抽象的特征,比如车有四个轮子,有方向盘等等,并基于这些抽象特征,天马行空的想象出了不同的车,而这些想象出来的车的样子,也有利于我们去识别真正现实中的车。在通过N次的对真实图片和想象图片的观察之后,我们便能思考“车”是什么,同是也具备了想象车的样子的模型。如下图:

Let’s return to our human thinking learning mode. When we were babbling, suppose we knew the word “car”, the teacher would give us a picture card of a car, and repeatedly teach us that this thing is a “car”. After we remembered this picture and the corresponding word “car”, a miraculous scene happened. Even if we saw another car, we could still “guess” it was a car! I attribute such magical ability to human imagination. Because after seeing the picture card of the “car”, we not only learned the features of the car picture, but also learned many abstract features, such as the car has four wheels, a steering wheel, etc., and based on these abstract features, the sky-horse-like imagination came out. Different cars, and the appearance of these imagined cars, are also conducive to our identification of real cars in reality. After observing the real image and the imagined image for N times, we can think about what a “car” is, and also have a model that imagines the appearance of a car. As below:

baby_learn.drawio.png

Thinking: 对于特征的思考,我们大致需要思考两个方面的问题:

图片是真的还是想象出来的?

一个真假二分类问题

图像描述的是什么?

一个NLP描述问题

Thinking: For thinking about features, we generally need to think about two aspects:

Is the picture real or imagined?

A true or false binary classification problem

What does the image describe?

An NLP description problem

因此,基于这一套思维逻辑,我们或许能够制作一个one-shot或者few-shot的模型,通过反复观测(Observation)和想象(Imagination)同一张图片并对他们进行思考,我们或许能生三个模型:encoder,generator,和thinker三个模型。如下图:

Therefore, based on this set of thinking logic, we may be able to make a one-shot or few-shot model, through repeated Observation and Imagination of the same picture and thinking about them, we may be able to generate three models: encoder, generator, and thinker. As below:

aigen.drawio.png

Loss 的设计,当我们设计loss函数的时候,主要是帧对在Thinker模块的输出,Real和Fake和对应的NLP描述均可以设计对应的距离loss函数。

When designing the Loss function, it is mainly designed for the output of the Thinker module, both Real and Fake and the corresponding NLP description can design the corresponding distance loss function.

另外,我们还可以设计一个prompt编码模块,给我们的想象力加一点提示hint,从而更好的控制generator生成的内容。

Additionally, we can design a prompt encoding module to add a hint to our imagination, thereby better controlling the content generated by the generator.

aigen_with_prompt.drawio.png

以上便是我对于人类学习和AI学习的一点点思考。

The above is a little bit of my thinking about human learning and AI learning.

Problem

When I try to use Interactive Annotation in Label Studio with Segment Anything Model Commits version: b2c31d3. I meet a problem, previous annotation is disappeared after making new annotation.(Latest version is fixed this problem[link])

Fix

To fix this problem, we need to modify the label-studio-ml-backend/label_studio_ml/examples/segment_anything_model/segment_anything_model.py file.
change the method to generate result id.

1
2
3

''.join(random.SystemRandom().choice(string.ascii_uppercase + string.ascii_lowercase + string.digits))

to

1
2
3
from uuid import uuid4
str(uuid4())[:4]

Combat Effectiveness Detection using YOLOv8 and Tensorflow.js

love
tensorflow.js


Combat Effectveness Detection application right in your browser. Serving YOLOv8 in browser using tensorflow.js
with webgl backend.

DEMO

Check it!

Setup

1
2
3
git clone https://github.com/dengbuqi/Combat-Effectiveness-Detection_yolov8-tfjs
cd Combat-Effectiveness-Detection_yolov8-tfjs
yarn install #Install dependencies

Scripts

1
2
yarn start # Start dev server
yarn build # Build for productions

Model

YOLOv8n model converted to tensorflow.js.

1
2
used model : yolov8n
size : 13 Mb

Use another model

Use another YOLOv8 model.

  1. Export YOLOv8 model to tfjs format. Read more on the official documentation

    1
    2
    3
    4
    5
    6
    7
    from ultralytics import YOLO

    # Load a model
    model = YOLO("yolov8n.pt") # load an official model

    # Export the model
    model.export(format="tfjs")
  2. Copy yolov8*_web_model to ./public

  3. Update modelName in App.jsx to new model name

    1
    2
    3
    4
    ...
    // model configs
    const modelName = "yolov8*"; // change to new model name
    ...
  4. Done! 😊

Reference

heartbeat-js + FaceAPI heart pulse rate monitoring

This project combines the heartbeat-js and FaceAPI
By detecting the human face using FaceAPI, we can estimate the heart pulse rate using rPPG method implemented by hearbeat-js project.

Demo

Check the demo here~

Welcome to Hexo! This is your very first post. Check documentation for more info. If you get any problems when using Hexo, you can find the answer in troubleshooting or you can ask me on GitHub.

Quick Start

Create a new post

1
$ hexo new "My New Post"

More info: Writing

Run server

1
$ hexo server

More info: Server

Generate static files

1
$ hexo generate

More info: Generating

Deploy to remote sites

1
$ hexo deploy

More info: Deployment