0%

1. ComfyUI

https://github.com/comfyanonymous/ComfyUI

在介绍我们去中心化的内容生成平台之前,我们先介绍一下ComfyUI是什么,以及他的理念。

ComfyUI—The most powerful and modular diffusion model GUI and backend.

Before introducing our decentralized content generation platform, let’s explain what ComfyUI is and its concept firstly.

ComfyUI—The most powerful and modular diffusion model GUI and backend.

image.png

从它的介绍可知,它是一个让你在stable diffusion等大量生成模型之间设计一条满足自己工作流程需要的框架。每一个模型或者数据操作被设计成了一个个node被允许添加到你们的工作流中。在这里你可以结合各种模型之间的优势,在本地构建自己特色的内容生成工作流。包括但不止text2img, img2img, impating, lora 等工作流。

From its description, we can see that it is a framework that allows you to design a workflow that meets your needs among a large number of generative models such as stable diffusion. Each model or data operation is designed as a node that can be added to your workflow. Here you can combine the advantages of various models to build your own unique content generation workflow locally. This includes but is not limited to, text2img, img2img, inpainting, lora, and other workflows.

2. ComfyUI痛点 ComfyUI Pain Points

但是,ComfyUi也存在一个巨大的痛点,那就是如果你的工作流中运行了大量的模型,那么你将需要一台超高性能的电脑来同时运行这些模型。现实是我们个人可能运行工作流中的一个或一些模型。

However, ComfyUI also has a significant pain point. If your workflow runs a large number of models, you will need a high-performance computer to run these models on one local computer. The reality is that we as individuals may only be able to run one or a few models in the workflow in our PC.

image.png

3. 去中性化的ComfyUI Decentralized ComfyUI

得益于ComfyUI的设计,我们其实可以把工作流中的每一个node看作一个独立的个体使用。那么我们为什么要在不同的电脑中创建同样的node呢?这完全是一中算力的浪费。试想,我的电脑只够我运行A模型node,而Smith的电脑够运行B模型node,Jone则是C模型node。如果我的工作流程需要A,B,C这四个node,那是不是我可以借用Smith和Jone的算力便可运行?同样的Smith和Jone也可以使用我的node运行他们想要的工作流。唯一需要的就是我们node作为API服务暴露给各位。

Thanks to ComfyUI’s design, we can actually treat each node in the workflow as an independent entity. So why should we create the same nodes on different computers? This is a complete waste of computing power. Imagine if my computer is only capable of running model node A, while Smith’s computer can run model node B, and Jones can run model node C. If my workflow requires nodes A, B, and C, could I borrow the computing power from Smith and Jones to run it? Correspondingly, Smith and Jones could also use my node to run the workflows they want. The only thing needed for us is to expose our nodes as API services to everyone.

DComfyU.drawio.png

4. 如何构建去中性化的ComfyUI How to Build a Decentralized ComfyUI

在我的计划中,我们应该是高度依赖ComfyUI或者它的node思想来构建我们的去中性化平台。主要的设计有下面几点:

In my plan, we should heavily rely on ComfyUI or its node concept to build our decentralized platform. The main designs include the following points:

1. ComfyUI API service node 服务节点 ComfyUI API Service Node

我们可以基于ComfyUI的特性,直接构建API service node 节点,尝试将其他模型的节点node自动打包成一个个的API service暴露给其他使用者。

Based on ComfyUI’s features, we can directly build API service nodes, attempting to automatically package the other model’s nodes into individual API services exposed to other users.

2. Node management center 节点管理中心 Node Management Center

虽然我们可以自己构建很多ComfyUI API service node节点,但是大部分人是不具备公有IP将自己的服务暴露在。我们应该构建一个节点管理中心,管理服务节点的同时,转发各个私有IP服务节点,暴露给其他用户。而使用者也可以通过管理中心使用自己想要的节点服务。

Although we can build many ComfyUI API service nodes ourselves, most people do not have a public IP to expose their services. We should build a node management center that manages service nodes while forwarding various private IP service nodes, exposing them to other users. Users can also use the node services they want through the management center.

3. Security (use tokens, data transmission crypto, anonymous, and so on) 安全(tokens, 数据传输加密,匿名化等等) Security (use tokens, data transmission crypto, anonymous, and so on)

节点管理中心构建成功的化,接下来我们就需要在服务节点,管理中心,使用者之间构建安全可靠的通信服务。token认证,传输数据加,匿名化等都需要我们提供支持。

If the node management center is successfully built, we then need to build secure and reliable communication services between service nodes, management centers, and users. We need to provide support for token authentication, data transmission encryption, anonymization, and more.

4. Payment Strategy(cyber currency? Onlie payment? Credit cart?) 支付策略(加密货币, 在线支付, 信用卡) Payment Strategy (cryptocurrency? Online payment? Credit card?)

为了更大化的推广我们的去中心化节点,利益当然是最大的推手。我们需要节点的使用者支付一定的费用给节点的提供者,同时节点管理中心也需要一定的资金支持服务器和电信服务的费用。这样子才可以促进整个社区的正向发展。

To maximize the promotion of our decentralized nodes, profit is certainly the biggest driver. We need node users to pay a certain fee to node providers, and the node management center also needs some funds to support server and telecommunication service costs. By this way, we can promote the positive development of the entire community.

5. Everyone can make money, and everyone can use large computing power services 每个人都可以通过它挣钱,每个人都可以使用大计算量的服务

至此,每一个人都可以使用自己的电脑搭建一个节点来获得收益,而大家也可以任意的使用最大量的算力生成自己想要的内容。

Therefore, everyone can use their computer to set up a node to earn income, and everyone can use the maximum amount of computing power to generate the content they want.

5. 另外 Moreover

另外, 虽然我们的平台是基于ComfyUI的概念,但是,我认为通过合理的设计,我们不但可以把这个平台使用到内容生成工作流中,我们可以应用到任何的AI模型工作流算力共享中。我们力求构建一个全球AI算力共享平台。或许以后还可以使用这个平台进行模型的训练和数据的收集等AI模型的前置 。

In addition, although our platform is based on the concept of ComfyUI, I believe that through reasonable design, we can not only use this platform in content generation workflows, but we can also apply it to any AI model workflow computing power sharing. We strive to build a global AI computing power sharing platform. Perhaps in the future, we can also use this platform for model training and data collection, and other AI model prerequisites.

Dynamic Neural Architecture(DNA)

Check out my project on GitHub

This is my little research project. I try to make a dynamic neural network with one layer for decentralized artificial intelligence(DAI) purpose. We can dynamic delete and create neural cell node. This idea is inspired by Hebbian theory. I am not sure is will work or not. Let’s see!

Fire together wire together(Hebbian theory)

Let us assume that the persistence or repetition of a reverberatory activity (or “trace”) tends to induce lasting cellular changes that add to its stability. … When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased. 

– From Hebbian theory

So, In my understanding, for a AI model neural, let’s say the assumption of ‘the persistence or repetition of a reverberatory activity’ is the training process, the ‘cell’ is a neural structure int the model, ‘When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased.’ means if neural A is activated, A will active(fire) neural B sequently.

How about to make a AI model base on this theory.

Before I design my model, I try to modify some rule in Hebbian theory. In Hebbian theory, cell A will be activated before B, then cell B will be activated by cell A. But, to make a parallel architecture for my future plan about my decentralized model. I am going to activate neural A and neral B. Because In my opinion, they will finally be activated for a same activity no matter the order.

Neural cell

So. In my idea, neural cell can be assumed as a weighted signal function minus a signal leaking[Hodgkin–Huxley model].

I=CmdVm dt+gK(VmVK)+gNa(VmVNa)+gl(VmVl)I=C_m \frac{\mathrm{d} V_m}{\mathrm{~d} t}+g_K\left(V_m-V_K\right)+g_{N a}\left(V_m-V_{N a}\right)+g_l\left(V_m-V_l\right)

So, our neural cell can be simplified as ->

I=wVmbI=wV_m-b

Yes, we can also replace thie simple function to a complicate function such as KAN model.

Can it work?

In common, the universal approximation theorem said a feed-forward network with a single hidden layer containing a finite number of neurons can approximate any continuous function. So, I think If we have enough neural cell, we can approximate any continuous function. Meanwhile, according the Kolmogorov–Arnold representation theorem, every multivariate continuous function. can be represented as a superposition of the two-argument addition of continuous functions of one variable, we can make many simple many to one neural cells for each variable to represent any continuous function. [Mybe I was wrong.]

Create and delete neural cell

Yes, In my DNA system, the neural cells will be dynamically created or deleted.

Neural cell creation

It’s a simulation of ‘fire together wire together’. When the passthrough(we consider it as the weight $w_A$) is large in neural cell A, we need to create a branch neural cell B to enhance this signal. Once we crate the neural cell B. A and B will equally share the passthrough
which means the initialization will be:

wA=0.5×wA;bA=bA;wB=0.5×wB;bB=0;w_A = 0.5 \times w_A; \\ b_A = b_A; \\ w_B = 0.5 \times w_B; \\ b_B = 0; \\

Neural cell deletion

On the opposite, we also need to delete the very less passthrough’s neural cell. We will just delete it and it’s input and output relation with other neural cells.

But how

Well, I still struggle in how to make a rule to tell us when we need to create the neural cells and when we need to delete the neural cells.

My naive idea for now is just create the new branch for top 10% high passthrough neural cells and delete the bottom 10% low passthrough neural cells. [TBD, need rethinking]

But When(Extinction)

In my plan, the neural creation and deletion process(sounds like model pruning) will be implement when we finish n epochs train and the loss not going down for more k epochs(refer the early stopping method, but we do not stop). And I name this process as a new term – ‘extinction’. It will be like, we have gone through many epochs in the history of our earth, and in some epochs we have met the species mass extinctions and entered another thriving epoch. Our DNA model, also, after n epochs, the model undergoes a mass extinction of neural cells, then the model enters to a new thriving epoch.

Model structure initialization

Before we train DNA model, we need to confirm the initial model structure. The initial model will just be like a Fully connected layer but each node in this layer is the single valuable(signal) and each edge in this layer is neural cell.
Beside, each neural cell’s trainable parameters $w$ and $b$ will be set to 1 and 0 respectively.

| DNA model layer |

TODO

  • Neural cell class
  • Model data structure to manage each neural cell’s input and output
  • Neural cell creation function
  • Neural cell deletion function
  • Extinction function
  • Training example
  • Result evaluation

Next level

  • Forward-Forward algorithm or backpropagation-free method implement
  • Neural cell distributing on different device(DAI).
  • Self-perception, self-learning, and self-evolution.

YOLO-World: Real-Time Open-Vocabulary Object Detection

https://github.com/AILab-CVC/YOLO-World/blob/master/yolo_world/models/layers/yolo_bricks.py

Open Vocabulary Object Detection

Open-vocabulary detection (OVOD) aims to generalize beyond the limited number of base classes labeled during the training phase. The goal is to detect novel classes defined by an unbounded (open) vocabulary at inference. [From link]

Novelty

Traditional Object Detector

Traditional Object Detector

Previous Open-Vocabulary Detector

Previous Open-Vocabulary Detector

YOLO-World Model Architecture

Text Encoder(Clip)

YOLO Detector

YOLOv8 Backbone

Untitled

  • Untitled

Re-parameterizable Vision-Language Path Aggregation Network (Vision-Language PAN)

Untitled

Text-guided CSPLayer(T-CSPLayer)

Dark Bottkneck(C2f Layer)

From: https://openmmlab.medium.com/dive-into-yolov8-how-does-this-state-of-the-art-model-work-10f18f74bab1

Max-Sigmoid

Xl=Xlδ(maxj{1..C}(XlWj)),X_l^{\prime}=X_l \cdot \delta\left(\max _{j \in\{1 . . C\}}\left(X_l W_j^{\top}\right)\right)^{\top},

Image-Pooling Attention(I-Pooling Attention)

W=W+ MultiHead-Attention (W,X~,X~)W^{\prime}=W+\text { MultiHead-Attention }(W, \tilde{X}, \tilde{X})

Text Contrastive Head

Region-Text Matching

Sk,j=αL2-Norm(ek)L2-Norm(wj)T+βSk,j = {\alpha}{\cdot}L2\text{-}Norm(e_k)\cdot L2\text{-}Norm(w_j)^T + \beta

Loss

L(I)=Lcon+λI(Liou+Ldfl)\mathcal{L}(I)=\mathcal{L}_{\mathrm{con}}+\lambda_I \cdot\left(\mathcal{L}_{\mathrm{iou}}+\mathcal{L}_{\mathrm{dfl}}\right)
Lcon\mathcal{L}_{\mathrm{con}} is region-text contrastive loss
Liou\mathcal{L}_{\mathrm{iou}} is IoU loss
Ldfl\mathcal{L}_{\mathrm{dfl}} is distributed focal loss
λI\lambda_I is an indicator factor and set to 1 when input image I is from detection or grounding data and set to 0 when it is from the image-text data.

[범용 인공 지능의 미래]AGENT AI: SURVEYING THE HORIZONS OF MULTIMODAL INTERACTION

Untitled

On 25 Jan 2024, Stanford University, Microsoft Research, the University of California, Los Angeles and the University of Washington Co-published a surveying paper called: AGENT AI: SURVEYING THE HORIZONS OF MULTIMODAL INTERACTION

Abstract:

They try to draw the future of the Multi-modal AI systems. They present a new concept ‘Agent AI’, a promising avenue toward Artificial General Intelligence (AGI). (범용 인공 지능을 향한 유망한 길입니다.)

Agent AI system that can perceive and act in many different domains and applications, possibly serving as a route towards AGI using an agent paradigm(표준 양식]). — Figure 1

How to use ubiquitous(어디에나 있는) multi-modal AI systems?

Agents within physical and virtual environments(현실 및 가상 환경의 에이전트) like the Jarvis in Iron Man(아이언맨의 자비스처럼).

Untitled

Overview of an Agent AI system:

Untitled

Data flow:

Untitled

We define “Agent AI” as a class of interactive systems that can

perceive visual stimuli, language inputs, and other environmentally-grounded data,

and can produce meaningful embodied actions.

We explore systems that aim to improve agents based on next-embodied action prediction by incorporating external knowledge, multi-sensory inputs, and human feedback.

mitigate(감소) the hallucinations(환각) of large foundation models and their tendency to generate environmentally incorrect outputs

We envision a future where people can easily create any
virtual reality or simulated scene and interact with agents embodied within the virtual environment. (like the OASIS game in the film “Ready Player One영화 ‘레디 플레이어 원’의 오아시스 게임처럼 )

1. Introduction

1.1 Motivation

The AI community is on the cusp of a significant paradigm shift, transitioning from
creating AI models for passive, structured tasks to models capable of assuming dynamic, agentic roles in diverse and complex environments.

  • 고도 구조화 고정 임무task 의 AI 모델 → 다양하고 복잡한 환경에서 역동적이고 Agent. LLMs and VLMs → a blend of linguistic proficiency, visual cognition, contextual memory, intuitive reasoning, and adaptability. 언어 능력, 시각 인지, 문맥 기억, 직관적 추론 및 적응력. → gaming, robotics, and healthcare domains. 게임, 로봇 공학 및 의료 영역 → redefine human experiences and elevate the operational standard. 인간의 경험을 재정립하고 업무 운용 수준을 높임. → transformative impacts on industries and socio-economic 산업과 사회경제에 대한 혁신 영향

1.2 Background

Large Foundation Models: can potentially tackle complex tasks

mathematical reasoning, professional law, generate complex plans for robots and game AI

수학 문제, 법학 문제, 계획 세우기 등등….

Untitled

Embodied AI: leverages LLMs to perform task planning for robot

generate Python code → execute these code in low-level controller

preview.gif

Interactive Learning:

  1. Feedback-based learning The AI adapts its responses based on direct user feedback. Untitled
  2. Observational Learning: The AI observes user interactions and learns implicitly. Untitled

2. Agent AI Integration

Previous technology and future challenge

2.1 [Previous technology] Infinite AI agent

AI agent systems’ abilities:

  1. Predictive Modeling
  2. Decision Making
  3. Handling Ambiguity
  4. Continuous Improvement [Infinite AI agent] 2D/3D embodied generation and editing interaction Untitled RoboGen: autonomously runs the cycles of task proposition, environment generation, and skill learning. [https://robogen-ai.github.io/videos/pipeline_cropped.mp4](https://robogen-ai.github.io/videos/pipeline_cropped.mp4) https://robogen-ai.github.io/videos/pipeline_cropped.mp4

2.2 [Challenge] Agent AI with Large Foundation Models

2.2.1 Hallucinations(Fake answer)(환각)

2.2.2 Biases and Inclusivity(편견과 포용성)

Training Data / Historical and Cultural Biases / Language and Context Limitations / Policies and Guidelines / Overgeneralization / Constant Monitoring and Updating / Amplification of Dominant Views / Ethical and Inclusive Design / User Guidelines

학습 데이터 / 역사 및 문화적 편견 / 언어 및 문맥의 한계 / 정책 및 가이드라인 / 지나친 일반화 / 지속적인 모니터링 및 업데이트 / 지배적인 견해의 확대 / 윤리적 및 포용적 디자인 / 사용자 가이드라인

Mitigate(해결법):

Diverse and Inclusive Training Data / Bias Detection and Correction / Ethical Guidelines and Policies / Diverse Representation / Bias Mitigation / Cultural Sensitivity / Accessibility / Language-based Inclusivity / Ethical and Respectful Interactions / User Feedback and Adaptation / Compliance with Inclusivity Guidelines

2.2.3 Data Privacy and Usage

Data Collection, Usage and Purpose / Storage and Security / Data Deletion and Retention / Data Portability and Privacy Policy / Anonymization

2.2.4 Interpretability and Explainability

Imitation Learning → Decoupling 모방 학습 → 디커플링

Decoupling → Generalization 디커플링 → 일반화

Generalization → Emergent Behavior 일반화 → 새로운 행위

Untitled

2.2.5 Inference Augmentation

Data Enrichment / Algorithm Enhancement / Human-in-the-Loop (HITL) / Real-Time Feedback Integration / Cross-Domain Knowledge Transfer / Customization for Specific Use Cases / Ethical and Bias Considerations. / Continuous Learning and Adaptation

2.2.6 Regulation

To address unpredictable output(uncertainty) from the model.

→ Provide environmental information within the prompt

→ designing prompts to make LLM/VLMs include explanatory text

→ pre-execution verification and modification under human guidance

2.3 [Challenge] Agent AI for Emergent Abilities[새로운 능력]

Current modeling practices require developers to prepare large datasets for each domain to finetune/pretrain models; however, this process is costly and even impossible if the domain is new.

⭐Unseen environments or scenarios?

The new emergent mechanism — Mixed Reality with Knowledge Inference Interaction

Enables the exploration of unseen environments for adaptation to virtual reality.

Untitled

  1. Environment and Perception with task-planning and skill observation;
  2. Agent learning;
  3. Memory;
  4. Agent action;
  5. Cognition.

3 Agent AI Paradigm

We seek to accomplish several goals with our proposed framework:

  • Make use of existing pre-trained models and pre-training strategies to effectively bootstrap our agents with an effective understanding of important modalities, such as text or visual inputs.
  • Support for sufficient long-term task-planning capabilities.
  • Incorporate a framework for memory that allows for learned knowledge to be encoded and retrieved later.
  • Allow for environmental feedback to be used to effectively train the agent to learn which actions to take.

LLMs and VLMs|Agent Transformer Definition|Agent Transformer Creationc

Untitled

Untitled

4 Agent AI Learning

4.1 Strategy and Mechanism

4.1.1 Reinforcement Learning (RL)

4.1.2 Imitation Learning (IL)

Seeks to leverage expert data to mimic the actions of experienced agents or experts

4.1.3 Traditional RGB

4.1.4 In-context Learning

4.1.5 Optimization in the Agent System

4.2 Agent Systems (zero-shot and few-shot level)

5 Agent AI Categorization

Untitled

5.1 Generalist Agent Areas:

5.2 Embodied Agents

5.2.1 Action Agents

5.2.2 Interactive Agents

5.3 Simulation and Environments Agents

5.4 Generative Agents

5.4.1 AR/VR/mixed-reality Agents

5.5 Knowledge and Logical Inference Agents

5.5.1 Knowledge Agent

5.5.2 Logic Agents

5.5.3 Agents for Emotional Reasoning

5.5.4 Neuro-Symbolic Agents

6 Agent AI Application Tasks

6.1 Agents for Gaming

NPC Behavior

Human-NPC Interaction

Agent-based Analysis of Gaming

Scene Synthesis for Gaming

6.2 Robotics

Visual Motor Control

Language Conditioned Manipulation

Skill Optimization

6.3 Healthcare

Diagnostic Agents.

Knowledge Retrieval Agents.

Telemedicine and Remote Monitoring.

6.4 Multimodal Agents

6.4.1 Image-Language Understanding and Generation

6.4.2 Video and Language Understanding and Generation

6.6 Agent for NLP

6.6.1 LLM agent

6.6.2 General LLM agent

6.6.3 Instruction-following LLM agents

7 Agent AI Across Modalities, Domains, and Realities

7.1 Agents for Cross-modal Understanding(image, text, audio, video…)

7.2 Agents for Cross-domain Understanding

7.3 Interactive agent for cross-modality and cross-reality

7.4 Sim to Real Transfer

8 Continuous and Self-improvement for Agent AI

8.1 Human-based Interaction Data

  • Additional training data
  • Human preference learning(사람의 선택 학습)
  • Safety training (red-teaming)(보안)

8.2 Foundation Model Generated Data

  • LLM Instruction-tuning
  • Vision-language pairs

9 Agent Dataset and Leaderboard

9.1 “CuisineWorld” Dataset for Multi-agent Gaming

• We provide a dataset and related the benchmark, called Microsoft MindAgent and correspondingly release a dataset “CuisineWorld” to the research community.

Typo in 9.1.2 Task

9.2 Audio-Video-Language Pre-training Dataset

  1. Video Text Retrieval

  2. Video Assisted Informative Question Answering

CVPR 2024 Tutorial on Generalist Agent AI

由于我们近百年来的落后发展和来自西方资本为主的国际社会的排斥,虽然我们在最近几十年几代人的拼命努力,我们终于在大部分领域追到了世界前列的水平。

但是,由于大部分行业和领域都起始于西方社会,我们为了迎合国际市场,不得不遵守西方定制的统一规范标准,这在我看来是非常危险的一件事情。诚然,全世界如果都能在同一行业使用同一套标准来生产产品可以使得各企业之间的产品拥有更好的互利互通性,但是,却无法避免的存在非常多的不确定性,比如:

  1. 无法保证规范标准没有利好偏向:以我的经验来看,大部分西方定制的标准都存在一定的利好偏向的。几大行业寡头,为了定制有利于自己产品生产的规范,会联合起来定制一些并没有那么符合实际的规范标准。
  2. 无法保证某些标准不存在专利寡头:由于有些标准确立初始只有某些特定的企业有于其匹配的专利产品,一旦这种标准被确定,几乎就等于所有企业都得购买特定企业的专利才能进行生产,这种标准也是非常的不合理的。
  3. 无法保证国家制裁导致的标准制裁:在某些情况下,由于被制裁,我们甚至无法被允许和国际社会使用同一套标准,因为标准中往往会出现第2点中提到的专利寡头。

因此,基于现在的严峻形式,我觉得我们各行各业的同胞们,应该定制我们自己的规范联盟,凡防于为来。

其实,关于标准定制竞争的成功案例,我们其实已经有了很多先例,比如银联与visa,master之争,华为的5G规范之争,但是,我仍然认为,这些规范的定义仍然是在西方集团框架之下的抵抗。但是我在反思,为什么我们不能跳出西方框架,自己走一条自己的路呢?我们的产品远销全世界,我们的产品质量也是最好的,我们有自己的技术,自己的销售渠道,为什么要听他人的标准呢?我们各行业的企业家们,为什么不可以按照自己的标准来生产产品呢? 为什么我们不能成立一个像IEEE那像的行业标准定制联盟呢?

我认为,基于国情,我们的行业标准定制联盟,不能完全模仿西方世界的方法。我们应该发展我们自己特色的行业定制联盟。以下是我的几点想法:

  1. 踩着别人的肩膀过河:如题,我们白手起家,完全重新定义一些标准也是不现实的,我们完全可以模仿他人已经定制好的标准,再经过一定的改进,定制自己的标准。特别是某一些行业,别人都不带我们玩了,我们也没必要客气,没必要有什么契约精神和道德规范,就是模仿之后再定制我们自己的规范就好,踩着别人的肩膀过河,站在“巨人”的肩膀做事嘛。
  2. 做新型领域开拓者:对于成熟的行业,我们可以“踩着别人的肩膀过河”,而对于新兴的行业,我们应该有自己的行业自行,直接定制自己的行业规范和标准。特别是AI,芯片和操作系统这些领域,我们必须自己定制自己的标准,大家一起集中力量做大做强,这些领域急需整个行业的人一起努力来摆脱制裁。
  3. 各企业团结起来:我们既然要定自己的标准规范,自然也要有人去定制和遵守,这点上我呼吁各行业各企业之间摈弃以往因为竞争带来的嫌隙,理解唇亡齿寒的道理。大企业带头,大家团结一致,一起定制符合行业持续发展的标准和规范。
  4. 前置专利共享:定制标注和规范,不可避免的存在一些前置专利的问题,我认为,我们企业之间要是能大方免费或者低价共享这些前置专利的话,那将是百利无害的。毕竟没有市场的专利是无法转换为利益的。
  5. 政府参与控制:我认为企业以逐利为唯一目的是无可厚非的,但是我们不能忽视了消费者和国家人民的利益,政府的适量参与和干预还是非常有必要的。至于为什么不能直接由政府直接定制规范,毕竟专业的事让专业的人做可能会更加合适。

总的来说,我在这里是呼吁各行各业的企业们团结起来,我们一起成立一个“统一规范制定联盟”,大家一起定制我们自己的行业标准,一起避免被外部势力卡脖子的现象发生。拿操作系统举例,近几年国内操作系统发展迅猛,但是各个企业之间也各自为营,做着自己标准的事情,这非此不利于操作系统上层的软件生态的开发,各企业之间若是能开诚布公,共享自己的专利,规范和标准,不重复造车,集中力量做好一套标准系统,软件开发者一次代码构建,便可应用在各个操作系统中,将会大大吸引广大开发者的开发热忱。

以上都是我的一些胡言乱语,疯狂头脑风暴,在内行人看来可能非常不现实甚至是搞笑。读者要是看了觉得有道理,可以联系我,大家一起深入探讨构建更好的蓝图,要是觉得是个笑话,就请您全当是个笑话,一笑而过吧。

AI version “Mining” system for AIGC: Take stable diffusion image generation model as a example

英文来自于Notion 机器翻译

介于现在AIGC模型的兴起,如果我们要生成一张图片,一些人在计算机算力上是非常缺失的,而另一些人却拥有非常多的计算机算力。此外,AI应用中,70%左右的算力都应用在了推演中。所以我在想,是否可以设计一个类似于以太币挖矿。我们可以用数字货币购买gas,而购买gas约等于购买算力和存储空间,然后我们便可以用gas部署我们的AI智能合约(Smart contract)。

With the rise of AIGC models, if we want to generate an image, some people are very lacking in computer computing power, while others have a lot of computer computing power. Moreover, in AI applications, about 70% of the computing power is applied in inference. So I’m wondering if we could design something similar to Ethereum mining. We can buy gas with cryptocurrency, and purchasing gas is approximately equivalent to buying computing power and storage space. Then we can use the gas to deploy our AI smart contracts.

由于运行合约的时候,模型可能会被拆分到各台机器中运行,而可被拆分的模型是一个新兴的研究方向,如 Forward-Forward Algorithm和我的DFF: Distributed Forward-Forward Algorithm论文。目前研究的并不是很完善。因此,我们在这里将 stable diffusion model中的steps作为被分割的对象。即,我们将每一个step分割到各个机器中运行,如下图:

When running contracts, the model may be split into different machines for operation, and a model that can be split is a new research direction, such as the Forward-Forward Algorithm and my DFF: Distributed Forward-Forward Algorithm paper. The current research is not complete. Therefore, we will use the steps in the stable diffusion model as the split objects here. That is, we will split each step into different machines for operation, as shown in the figure below:

ai_miner.drawio.png

想在完成这套系统主要存在的问题有:(以后还会补充)

  1. 如何将通用的问题切割成小问题?
  2. 如果是要训练模型,训练的时候数据集又怎么处理?
  3. 每一个Node中如何评价是否“挖到”了我们希望得到的模型结果?
  4. 如何backpropagation?需要backpropagation吗?
  5. 如何解决并发的问题,如何将模型设计成并发推演或训练的?

The main problems that exist in completing this system are: (will be supplemented later)

  1. How to split general problems into smaller problems?
  2. If it is to train the model, how to handle the dataset during training?
  3. How to evaluate in each Node whether we have “mined” the model results we hope to get?
  4. How to backpropagate? Is backpropagation needed?
  5. How to solve concurrency issues, and how to design the model for concurrent inference or training?

My rethink about Image generation and recognition

英文部分是Notion机器翻译

人类的大脑是图像生成最利害的机器,虽然我们无法直接向外表现出来,但是我们可以在大脑中想象出千奇百怪的东西,甚至在我们的梦境中,我们可以完全畅游在自由想像的世界。

The human brain is the most potent image generation machine. Although we can’t express it directly, we can imagine all sorts of things in our brains. Even in our dreams, we can fully roam the world of free imagination.

但是,我们人类是如何在脑中生成这些东西的呢?在我的理解中,纵使我们可以想象出任何东西,但是这个“任何东西”,很难超过我们的认知范围,即我们常说的想象的局限性。

But how do we humans generate these things in our minds? In my understanding, though we can imagine anything, this “anything” is hard to exceed our cognitive range, which is often referred to as the limitation of imagination.

我认为我们拥有这种局限性的原因是因为我们需要先观察已有的事物,然后抽象思考这个事物的特征,才能想象出基于这个事物部分特征的其他事物。

I believe we have this limitation because we need to observe existing things first, then abstract and think about the features of this thing, in order to imagine other things based on some features of this thing.

因此,如果我们遵从人类的思考方式创建AI模型的话,我假设模型分为3部分: 观察,想象和思考。

Therefore, if we create AI models following human thinking, I assume the model is divided into three parts: observation, imagination, and thinking.

humangen.drawio.png

上图就是我认为的一个人类大脑的思维逻辑,或者说是一种图像生成的AI模型的结构,该结构有点类似于GAN(Generative Adversarial Network)模型。

The above figure is what I think is the logic of human brain thinking, or the structure of an image generation AI model, which is somewhat similar to the GAN (Generative Adversarial Network) model.

Observation: 我们将会观察真实图片,学习并提取真实图片的隐藏特征值(Hidden Feature)。

Observation: We will observe real images, learn and extract the hidden features of real images.

Hidden Feature: 观察到的特征值会被应用于两个方面,一方面是用于思考,一方面是用于生成图片。

Hidden Feature: The observed features will be applied in two ways, one for thinking and one for generating images.

Imagination: 我们通过总结到的Hidden Feature,可以试图生成一张新的图片,这张图片和原图Real image很相似,但是又不同,目的是为了生成一张能够让observation提取到和Real image相似特征的图片。

Imagination: We can try to generate a new image through the summarized Hidden Feature. This image is similar to the original Real image, but different. The purpose is to generate an image that allows the observation to extract features similar to the Real image.

让我们重新回到我们人类的思考学习方式上,当我们牙牙学语时候,假设我们认识“车”这个单词,老师会给我们一张车的图片卡,然后反复教我们这个东西是“车”。当我们记住了这张图片和与之对应的单词“车”之后,神奇的一幕发生了,即便我们看到了另一个模样的车之后,我们仍然能够“猜”出这是车!而这样神奇的能力我认为可以归因于人类想象力。因为人类在看到“车”的图片卡之后,我们不但是学习了车这个图片的特征,还学习到了很多抽象的特征,比如车有四个轮子,有方向盘等等,并基于这些抽象特征,天马行空的想象出了不同的车,而这些想象出来的车的样子,也有利于我们去识别真正现实中的车。在通过N次的对真实图片和想象图片的观察之后,我们便能思考“车”是什么,同是也具备了想象车的样子的模型。如下图:

Let’s return to our human thinking learning mode. When we were babbling, suppose we knew the word “car”, the teacher would give us a picture card of a car, and repeatedly teach us that this thing is a “car”. After we remembered this picture and the corresponding word “car”, a miraculous scene happened. Even if we saw another car, we could still “guess” it was a car! I attribute such magical ability to human imagination. Because after seeing the picture card of the “car”, we not only learned the features of the car picture, but also learned many abstract features, such as the car has four wheels, a steering wheel, etc., and based on these abstract features, the sky-horse-like imagination came out. Different cars, and the appearance of these imagined cars, are also conducive to our identification of real cars in reality. After observing the real image and the imagined image for N times, we can think about what a “car” is, and also have a model that imagines the appearance of a car. As below:

baby_learn.drawio.png

Thinking: 对于特征的思考,我们大致需要思考两个方面的问题:

图片是真的还是想象出来的?

一个真假二分类问题

图像描述的是什么?

一个NLP描述问题

Thinking: For thinking about features, we generally need to think about two aspects:

Is the picture real or imagined?

A true or false binary classification problem

What does the image describe?

An NLP description problem

因此,基于这一套思维逻辑,我们或许能够制作一个one-shot或者few-shot的模型,通过反复观测(Observation)和想象(Imagination)同一张图片并对他们进行思考,我们或许能生三个模型:encoder,generator,和thinker三个模型。如下图:

Therefore, based on this set of thinking logic, we may be able to make a one-shot or few-shot model, through repeated Observation and Imagination of the same picture and thinking about them, we may be able to generate three models: encoder, generator, and thinker. As below:

aigen.drawio.png

Loss 的设计,当我们设计loss函数的时候,主要是帧对在Thinker模块的输出,Real和Fake和对应的NLP描述均可以设计对应的距离loss函数。

When designing the Loss function, it is mainly designed for the output of the Thinker module, both Real and Fake and the corresponding NLP description can design the corresponding distance loss function.

另外,我们还可以设计一个prompt编码模块,给我们的想象力加一点提示hint,从而更好的控制generator生成的内容。

Additionally, we can design a prompt encoding module to add a hint to our imagination, thereby better controlling the content generated by the generator.

aigen_with_prompt.drawio.png

以上便是我对于人类学习和AI学习的一点点思考。

The above is a little bit of my thinking about human learning and AI learning.

Problem

When I try to use Interactive Annotation in Label Studio with Segment Anything Model Commits version: b2c31d3. I meet a problem, previous annotation is disappeared after making new annotation.(Latest version is fixed this problem[link])

Fix

To fix this problem, we need to modify the label-studio-ml-backend/label_studio_ml/examples/segment_anything_model/segment_anything_model.py file.
change the method to generate result id.

1
2
3

''.join(random.SystemRandom().choice(string.ascii_uppercase + string.ascii_lowercase + string.digits))

to

1
2
3
from uuid import uuid4
str(uuid4())[:4]

Combat Effectiveness Detection using YOLOv8 and Tensorflow.js

love
tensorflow.js


Combat Effectveness Detection application right in your browser. Serving YOLOv8 in browser using tensorflow.js
with webgl backend.

DEMO

Check it!

Setup

1
2
3
git clone https://github.com/dengbuqi/Combat-Effectiveness-Detection_yolov8-tfjs
cd Combat-Effectiveness-Detection_yolov8-tfjs
yarn install #Install dependencies

Scripts

1
2
yarn start # Start dev server
yarn build # Build for productions

Model

YOLOv8n model converted to tensorflow.js.

1
2
used model : yolov8n
size : 13 Mb

Use another model

Use another YOLOv8 model.

  1. Export YOLOv8 model to tfjs format. Read more on the official documentation

    1
    2
    3
    4
    5
    6
    7
    from ultralytics import YOLO

    # Load a model
    model = YOLO("yolov8n.pt") # load an official model

    # Export the model
    model.export(format="tfjs")
  2. Copy yolov8*_web_model to ./public

  3. Update modelName in App.jsx to new model name

    1
    2
    3
    4
    ...
    // model configs
    const modelName = "yolov8*"; // change to new model name
    ...
  4. Done! 😊

Reference