0%

本人学习计算机已经快11年了,做人工智能的研究也有五年多了。研究生活当然不可能是轻松的,因为我们每天面对的都是突破人类知识极限的工作,不可能那么的顺风顺水。就算如何凯明这样的顶尖人物也曾经表示过:科研中95%的时间是令人沮丧的。我们其实就是为了那5%成功的喜悦而在一直的坚持着。但是,我们一生的时间是有限的,来自家庭,社会,前路的压力又是无限的。个人层面来说,如何在有限的时间内根据自身的实力,获得最大化的输入输出产出比将会是我们研究路上需要持续关注的问题。本人在这些问题上也是磕磕碰碰,跌跌撞撞很多年,下面是我这几年来,对于自己研究生活的一些心得体会。

  • 1. 心态:

    首先我想说的就是心态。在我的人生经历里面,心情和心态是最重要的一件事情。一个人,无论做什么事情,只有保持着愉悦的心情,乐观的心态,往往事情都会往好的方向进展。这次不行,下次再来,下次不行,那就换个方法,换个方式再尝试。毫无意外的成功当然是满心欢喜,但是接连不断的失败才是普通人的常态。

    因此,每当我面临什么人生中的重要决策,重大事件的时候,我亲爱的爷爷(说到这,我突然非常的想念我的爷爷,我的爷爷于去年2024年4月的时候去世了,他离开之后没有闭上眼睛,我飞了2000公里的飞机回去亲手给我爷爷合上的眼睛,我想,他多少是有在等我回去吧。)。我亲爱的爷爷和我说的最多的就是那句引用自毛泽东的话: 战略上藐视敌人,战术上重视敌人。”战略上藐视敌人”意味着我们要敢于面对生活中的困难和敌人,要相信我们是有能力克服所有困难的。”战术上重视敌人”意味着我们在具体行动和计划上要非常认真谨慎,要做到没有任何差错。

    总而言之,科学研究,心态很重要,不管身边的人多厉害,自己多着急毕业,一定要保持自己的心情愉悦。以一种平和的心态去认真的做自己的研究,水到渠成,自然而然,唯手熟尔。

  • 2.最新研究信息获取:

    说到具体的科学研究,我相信一开始大家都是迷茫的,不知道要做什么,做什么容易出结果,自己有善长做什么?说实话,我到现在都还搞不清。但是我想,大家都在做的应该都不差吧,站在巨人的肩膀上总该是最保险的吧。所以,除非你有一个非常坚定想做的东西,愿意为它付出终生的东西,不然就获取最新的研究信息然后跟着别人做吧。我知道肯定很多人会反驳我,说是科研是要有深度,有次序性才能有巨大成果。但是考虑到我们在做的是人工智能领域这个飞速发展的领域,已经有大把的资源和人力投入到了这个领域,我半年前看到文章现在来说都已经过时不再用的这种领域,真的不再渴求能多坚持研究了。加上我们都还是普通人,真的没办法和有资源,有人脉和大拿们比。

    那到底要怎么获取最新的研究信息呢?

    我认为可以大致分为三种渠道:

    1. 网络渠道:网络渠道是最好的了,不求人,信息量大,更新快。有大量的免费网站可以让我们每天看到最新的研究情报。 比如: https://deeplearn.org/ https://paperswithcode.com/ https://huggingface.co/models https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/ 这些网站都会贴出实时的最新论文和代码,我们可以根据自己的研究follow.
    2. 招聘市场: 招聘市场要什么人,这就意味着这个方向肯定是时下最热门和最有可能创造商业价值的方向。我们不一定在这个方向研究,但是这个方向基本的知识和技能时需要具备的,同时也时为了可以更好的支持自己做项目的能力。
    3. 人际交往: 别小看了人与人的力量。有句老话叫做“当局者迷,旁观者清”。当我们陷入到研究的困境之中的时候,我常常会看不到问题的重点在哪。尝试着与有相关经验的人交流,提问,或许能够摆脱我们的思维惯性,找到更好的方法。那如何认识这些有相关经验的人呢?我觉得就一句话:迈开腿,打开嘴,不害羞,勇敢问。多开学会,多加入交流学习组,不管时在网上还是现实世界中,都打开自己的社交圈,不要害羞,多问,多听,多学习,总能获取到正确的信息的。
  • 3.个人研究:

    在我们进行了大量而广泛的论文阅读之后,我们应该时大概知道自己可以尝试做的某个方向了。个人认为,与导师确定研究方向是非常有必要的,但是在你和他推荐你的研究方向的时候,请务必准备的足够充分和自信,最好是有一定的研究成果了。为什么呢?因为这样子导师才会有信心认为你可以坚持做下去这个方向的研究,或者即便做不下去了,以现在的研究成果也能有一些小的产出。

    那具体怎么确定研究方向,做自己的研究,并获得成果能?我觉得有一下几个方面:

    1. 广泛查看高质量论文: 计算机和人工智能行业很特殊,是一个理论与工程实践都非常重要的学科,你甚至都无法做假,因为代码不会说谎。这意味着,我们在看到一篇很有价值的论文的时候,还得确认其代码,运行看看是否如论文描写的那样有用。当我们积累了某个方向的论文和代码知识之后,研究idea自然就会有了。
    2. 以他人的研究为基础产生研究idea: 虽然完全从无到有构建一个研究idea也是可能的,但是这样的idea需要我们有结实的数学基础证明其理论可行性,有极度熟练的变成能力将理论变成工程学上可用的代码,还需要有一点运气(成功的运气和在你完成之前其他人没有出版的运气)。
    3. 切勿追求极致的完美: 我相信,科研人一开始都想将东西做到完美,完美之后才考虑发表。但是阿,人生能有几次完美。如果我们的结果达到可以投稿水平了,我建议是可以先投稿一篇论文,然后继续研究cycle,将这个问题做得更完美。
    4. 要有一个研究cycle的计划: 完成一篇论文之后,就得安排好进入下一个研究cycle(发现问题→查阅论文→找出可以改进的点→实验→结果→论文)。
  • 4.身体健康和研究时间规划:

    在面对高压的科研环境和项目压力的时候,我们难免会面临很多健康问题。这个时候强健的体魄就尤为重要了。这一方面我并不是一个正面的例子。在最忙的时候没有办法,还是会通宵的跑实验和写论文。但是,身体健康是持续研究的一个重要的保障。为了有一个良好的身体状态和心情去坚持研究工作,合理的时间规划就显得尤为重要了。

    1. 首先,schdule做出来,就要坚持,要做自己能完成的schdule。
    2. 其次,每一个任务要有一定的富余时间,不能卡deadline,否则的话如果有什么意外事件发生耽误时间了的话,你就会特别的急,特别焦虑的去做接下来的事了(势必也会熬夜)。
    3. 要分清任务的轻重缓急,我知道这句话大家都知道,但是真的很重要,比较轻且不急的事情我们可以放下慢慢做,被耽搁了也无所谓。其实还有一类事情,就算预估会失败的事情,对这种事情,我们也需要懂得取舍,该放弃就放弃吧,现在的放弃是为了以后更好的结果。
    4. 最后就是,如果真的遇到了紧急情况,熬夜是不可避免的,熬夜不要超过两天,可以选择在白天睡四个小时左右补充体力。
    5. 人生苦短,即使行乐,该玩的时候好好玩,该认真工作的时候好好工作。
  • 5.论文写作:

    我们有了一定的研究成果之后,就得写论文了。论文除了有一个好的实验结果之外,也可以是一个好的idea或者是个好的故事。

    写好一个故事是写论文的一个重要的技巧,没有任何一个reviwer会花时间去弄清楚你的所有细节。你呈现给他们什么,他们就会看什么。你的写作有趣,有意义,才会吸引reviwer去看。如何写好一篇论文的故事呢?

    1. 写作能力: 这个没有任何人能帮助你,连chatGPT也没办法帮助我们的。这不是简单的单词语法的问题(当然单词语法也很重要,即便是有GPT帮忙了,我们仍然要小心),是语言习惯和结构习惯的问题,只有通过大量的阅读来提高。
    2. 构建故事之-Motivation: 一个好的Motivation就是一个好的开头,motivation有趣或者解决实际问题的话,就会非常吸引人们继续看你的论文。除非你是最顶尖的研究员,不然的话请遵守这样的规则。
    3. 构建故事之-Contribution:
      好的故事当然得有好的方法和贡献才能体现文章的价值。或许我们有些时候很难作出突破性的贡献,这让我们很沮丧,但是我们可以从不同的出发点证明我们方法的贡献,比如:参数量,新颖有效的算法设计,计算时间,内存使用量,效率,多任务等等。同时,我们可以适当的描写我们这些贡献可以为哪些应用,哪些后续任务提供宝贵经验,这样子来体现论文的价值。
    4. 构建故事之-Result and Conclusion: 结果和结论的描述非常重要,审稿人通常非常关心这一部分的描述,因此,请慎重的选择需要呈现的结果,不要只是随意的把SOTA结果放上去比较。本论文的结果需要和与本论文类似的条件下的模型的结果进行比较,从而体现本论文方法的优越性。如果可能,也尽量做一些ablation study来进一步证明论文种使用的方法能够提高性能。
    5. 多使用简洁,简单的句子,少用复杂长句子: 作为东亚地区的人,书写英文的时候容易出现一个思维惯式,认为复杂的长句子才能体现我们写作的专业性。从论文写作上来说,有这样的写作能力的话,长句子当然能体现我们写作的实力和专业性。但是,东亚地区的人语言习惯上容易把主语混用,导致一个句子中表达的主语有很多个,这会导致阅读困难。因此,我认为,在我们写作能力有限的情况下,简单的,多个的句子方而更能解释说明我们想要表达的内容。
    6. 图表一定要严谨 审查员第一次看论文的时候,第一眼看到就是,abstract, figure, table, result 和 conclusion. 这意味这我们的图表需要极致的完美,因为图表的缺陷非常容易被察觉出来。图表中任何错误都容易被放大。最容易出现的错误就图表的caption, 图表内文字错误等等。
    7. 活用工具 在撰写论文时,熟练运用各种工具可以显著提升写作效率,例如使用LaTeX排版论文和制作图表、DeepL辅助翻译、Grammarly检查语法、ChatGPT帮助构建框架和逻辑、Notion AI优化文本表达,Diagram绘制图表,以及google scholar构建reference样式等。
  • 6.论文投稿:

    论文完成后,就让我们投稿吧,常有这么一句话:选择大于努力。选择一个好的期刊可以让我们的论文获得更高的性价比。先来介绍一下计算机领域国际期刊和学会的分类。

    期刊:

    Journal Impact Factor(IF): 通常,SCI论文IF越高期刊水平越高

    IF.png

    索引: SCI/SCIE, EI/EI Compendex, Scopus, ACM Digital Library/IEEE Xplore等

    JCR journal ranking(SCI索引):

    1区 顶尖期刊,排名前25%
    2区 中高水平期刊,排名25%-50%
    3区 中等水平期刊,排名50%-75%
    4区 相对一般,排名75%-100%

    出版商:

    IEEE/ACM/Springer Nature/Elsevier/Wiley/Oxford/MDPI/Hindawi/Frontiers

    通常来说, JCR 1区; 出版商:IEEE, ACM, Nature/Elsevier; CFF A区最佳。实力按照介绍顺序递减。

    学术会议Conference:

    Conference的情况比较复杂,每年都会有比较大的变化,韩国有BKlist, 中国有CCF list,可做参考。通常 国家<地区<国际,举办次数越多的,起码证明他有稳定的投稿支持它一直开下去。

    相比于journal, Conference能够快速接触到最尖端的最新的技术和最利害的人。我认为多参加conference对于个人的研究还是非常有利的。不要嫌累或需要花钱,这是我们能够最直接了解专业动向,认识志同道合之人的地方。学会上了解到的知识和认识的人或许会成为我们下一篇论文的一部分。

  • 7.论文合作:

    我认为,要是有和实验室之外其他人合作的机会的话,那将会是一个非常不错的经历。这样的合作能够让我们从不同的视角查看我们的研究的合理性。学术研究不是蒙头造车,只有多交流,多合作才能发现主要的问题和想出有趣的方法。

    论文合作主要有两种类型:
    同专业合作:

    我没有这方面的经验,但是要是有这种机会的话,建议多参与。目前优秀的论文大多都不是一个机构写出来的,经常是几个机构一起合作完成。

    交叉学科合作:

    由于计算机,特别是AI的特殊性,我们寻求交叉学科的合作其实是非常有优势的,而且可能任务也不难,但是这样的交叉学科合作的经历能够丰富我们个人简历,显得更据有实力,甚至之后以计算机专业为基础转成其他专业的研究也是可能的。

  • 8.项目经验:

    项目经验不仅是对所学知识的综合运用,更是将理论转化为实际解决方案的关键途径。在科研之外,系统地参与高质量项目能够显著提升我们的综合竞争力,体现我们在真实问题场景中的思考能力与执行力。在用人单位眼中,是否具备将理论知识落地为实际方案的能力同样重要。项目经验正是这种“可落地能力”的集中体现。一个完整的项目往往包含需求分析、方案设计、模型实现、效果评估和部署优化等多个环节,能全面展示我们的技术广度、工程能力和协作能力。

    培养实际解决问题的能力

    通过项目实践,我们不仅熟悉了各种工具链与工程环境,更重要的是学会了如何在复杂、真实的系统中定位问题、拆解问题,并找到最优或次优解。这种能力远远超越了纸面上的知识,它代表着一种系统性思维方式与强执行力,真正做到“解决问题”而不是“做研究”。

    求职的时候

    在求职过程中,项目经验往往比论文更直观、更具说服力。用人单位更容易从项目经历中看到我们是否具备岗位所需的技能、是否能快速适应工作环境。例如,在简历或面试中,我们可以通过“问题背景-解决方案-实现方式-结果分析”来结构化介绍项目,让面试官快速理解我们的能力亮点。此外,许多岗位也更看重项目中的实际产出,如代码质量、性能指标、用户反馈等,这些都是论文中难以体现的。

  • 9.资源限制:

    资源是有限的,人类的欲望是无限的。我们在做研究的时候势必会遇到资源有限的问题。

    物理资源有限(算力、设备、时间……):

    • 优化模型结构、使用轻量级方法

    • 用更巧的方法

    • 利用开源工具

    • 把问题拆成多个小问题,一个一个小问题来解决,解决一个小问题出一篇论文

    • 先解决“最有性价比的问题”。

      人脉资源有限:

    • 积极参与线上社区、技术论坛、开源项目、学术会议

    • 小规模合作也可能激发新的灵感

    • 一个好项目、一篇好文章、一次分享,都是你“资源积累”的过程,个人人气的积累

    • 试图 高效沟通、深度协作

  • 10.挫折管理:

    知道自己的实力,量力而行

    最后但我们作出了所有的努力之后,可能结果还是不会如我们期望的那样。我们应该提前做好有不好结果的打算。我们要了解自己,了解自己的极限。人一辈子都在认识自己,理解自己,从而突破自己的极限,所以,人生路上,这些挫折和失败都是不可避免,可以接受的,我们失败了,振作起来在试一试。事不过三,三次都不行,我们就放弃,找找其他办法吧,条条大陆通罗马

1. ComfyUI

https://github.com/comfyanonymous/ComfyUI

在介绍我们去中心化的内容生成平台之前,我们先介绍一下ComfyUI是什么,以及他的理念。

ComfyUI—The most powerful and modular diffusion model GUI and backend.

Before introducing our decentralized content generation platform, let’s explain what ComfyUI is and its concept firstly.

ComfyUI—The most powerful and modular diffusion model GUI and backend.

image.png

从它的介绍可知,它是一个让你在stable diffusion等大量生成模型之间设计一条满足自己工作流程需要的框架。每一个模型或者数据操作被设计成了一个个node被允许添加到你们的工作流中。在这里你可以结合各种模型之间的优势,在本地构建自己特色的内容生成工作流。包括但不止text2img, img2img, impating, lora 等工作流。

From its description, we can see that it is a framework that allows you to design a workflow that meets your needs among a large number of generative models such as stable diffusion. Each model or data operation is designed as a node that can be added to your workflow. Here you can combine the advantages of various models to build your own unique content generation workflow locally. This includes but is not limited to, text2img, img2img, inpainting, lora, and other workflows.

2. ComfyUI痛点 ComfyUI Pain Points

但是,ComfyUi也存在一个巨大的痛点,那就是如果你的工作流中运行了大量的模型,那么你将需要一台超高性能的电脑来同时运行这些模型。现实是我们个人可能运行工作流中的一个或一些模型。

However, ComfyUI also has a significant pain point. If your workflow runs a large number of models, you will need a high-performance computer to run these models on one local computer. The reality is that we as individuals may only be able to run one or a few models in the workflow in our PC.

image.png

3. 去中性化的ComfyUI Decentralized ComfyUI

得益于ComfyUI的设计,我们其实可以把工作流中的每一个node看作一个独立的个体使用。那么我们为什么要在不同的电脑中创建同样的node呢?这完全是一中算力的浪费。试想,我的电脑只够我运行A模型node,而Smith的电脑够运行B模型node,Jone则是C模型node。如果我的工作流程需要A,B,C这四个node,那是不是我可以借用Smith和Jone的算力便可运行?同样的Smith和Jone也可以使用我的node运行他们想要的工作流。唯一需要的就是我们node作为API服务暴露给各位。

Thanks to ComfyUI’s design, we can actually treat each node in the workflow as an independent entity. So why should we create the same nodes on different computers? This is a complete waste of computing power. Imagine if my computer is only capable of running model node A, while Smith’s computer can run model node B, and Jones can run model node C. If my workflow requires nodes A, B, and C, could I borrow the computing power from Smith and Jones to run it? Correspondingly, Smith and Jones could also use my node to run the workflows they want. The only thing needed for us is to expose our nodes as API services to everyone.

DComfyU.drawio.png

4. 如何构建去中性化的ComfyUI How to Build a Decentralized ComfyUI

在我的计划中,我们应该是高度依赖ComfyUI或者它的node思想来构建我们的去中性化平台。主要的设计有下面几点:

In my plan, we should heavily rely on ComfyUI or its node concept to build our decentralized platform. The main designs include the following points:

1. ComfyUI API service node 服务节点 ComfyUI API Service Node

我们可以基于ComfyUI的特性,直接构建API service node 节点,尝试将其他模型的节点node自动打包成一个个的API service暴露给其他使用者。

Based on ComfyUI’s features, we can directly build API service nodes, attempting to automatically package the other model’s nodes into individual API services exposed to other users.

2. Node management center 节点管理中心 Node Management Center

虽然我们可以自己构建很多ComfyUI API service node节点,但是大部分人是不具备公有IP将自己的服务暴露在。我们应该构建一个节点管理中心,管理服务节点的同时,转发各个私有IP服务节点,暴露给其他用户。而使用者也可以通过管理中心使用自己想要的节点服务。

Although we can build many ComfyUI API service nodes ourselves, most people do not have a public IP to expose their services. We should build a node management center that manages service nodes while forwarding various private IP service nodes, exposing them to other users. Users can also use the node services they want through the management center.

3. Security (use tokens, data transmission crypto, anonymous, and so on) 安全(tokens, 数据传输加密,匿名化等等) Security (use tokens, data transmission crypto, anonymous, and so on)

节点管理中心构建成功的化,接下来我们就需要在服务节点,管理中心,使用者之间构建安全可靠的通信服务。token认证,传输数据加,匿名化等都需要我们提供支持。

If the node management center is successfully built, we then need to build secure and reliable communication services between service nodes, management centers, and users. We need to provide support for token authentication, data transmission encryption, anonymization, and more.

4. Payment Strategy(cyber currency? Onlie payment? Credit cart?) 支付策略(加密货币, 在线支付, 信用卡) Payment Strategy (cryptocurrency? Online payment? Credit card?)

为了更大化的推广我们的去中心化节点,利益当然是最大的推手。我们需要节点的使用者支付一定的费用给节点的提供者,同时节点管理中心也需要一定的资金支持服务器和电信服务的费用。这样子才可以促进整个社区的正向发展。

To maximize the promotion of our decentralized nodes, profit is certainly the biggest driver. We need node users to pay a certain fee to node providers, and the node management center also needs some funds to support server and telecommunication service costs. By this way, we can promote the positive development of the entire community.

5. Everyone can make money, and everyone can use large computing power services 每个人都可以通过它挣钱,每个人都可以使用大计算量的服务

至此,每一个人都可以使用自己的电脑搭建一个节点来获得收益,而大家也可以任意的使用最大量的算力生成自己想要的内容。

Therefore, everyone can use their computer to set up a node to earn income, and everyone can use the maximum amount of computing power to generate the content they want.

5. 另外 Moreover

另外, 虽然我们的平台是基于ComfyUI的概念,但是,我认为通过合理的设计,我们不但可以把这个平台使用到内容生成工作流中,我们可以应用到任何的AI模型工作流算力共享中。我们力求构建一个全球AI算力共享平台。或许以后还可以使用这个平台进行模型的训练和数据的收集等AI模型的前置 。

In addition, although our platform is based on the concept of ComfyUI, I believe that through reasonable design, we can not only use this platform in content generation workflows, but we can also apply it to any AI model workflow computing power sharing. We strive to build a global AI computing power sharing platform. Perhaps in the future, we can also use this platform for model training and data collection, and other AI model prerequisites.

Dynamic Neural Architecture(DNA)

Check out my project on GitHub

This is my little research project. I try to make a dynamic neural network with one layer for decentralized artificial intelligence(DAI) purpose. We can dynamic delete and create neural cell node. This idea is inspired by Hebbian theory. I am not sure is will work or not. Let’s see!

Fire together wire together(Hebbian theory)

Let us assume that the persistence or repetition of a reverberatory activity (or “trace”) tends to induce lasting cellular changes that add to its stability. … When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased. 

– From Hebbian theory

So, In my understanding, for a AI model neural, let’s say the assumption of ‘the persistence or repetition of a reverberatory activity’ is the training process, the ‘cell’ is a neural structure int the model, ‘When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased.’ means if neural A is activated, A will active(fire) neural B sequently.

How about to make a AI model base on this theory.

Before I design my model, I try to modify some rule in Hebbian theory. In Hebbian theory, cell A will be activated before B, then cell B will be activated by cell A. But, to make a parallel architecture for my future plan about my decentralized model. I am going to activate neural A and neral B. Because In my opinion, they will finally be activated for a same activity no matter the order.

Neural cell

So. In my idea, neural cell can be assumed as a weighted signal function minus a signal leaking[Hodgkin–Huxley model].

I=CmdVm dt+gK(VmVK)+gNa(VmVNa)+gl(VmVl)I=C_m \frac{\mathrm{d} V_m}{\mathrm{~d} t}+g_K\left(V_m-V_K\right)+g_{N a}\left(V_m-V_{N a}\right)+g_l\left(V_m-V_l\right)

So, our neural cell can be simplified as ->

I=wVmbI=wV_m-b

Yes, we can also replace thie simple function to a complicate function such as KAN model.

Can it work?

In common, the universal approximation theorem said a feed-forward network with a single hidden layer containing a finite number of neurons can approximate any continuous function. So, I think If we have enough neural cell, we can approximate any continuous function. Meanwhile, according the Kolmogorov–Arnold representation theorem, every multivariate continuous function. can be represented as a superposition of the two-argument addition of continuous functions of one variable, we can make many simple many to one neural cells for each variable to represent any continuous function. [Mybe I was wrong.]

Create and delete neural cell

Yes, In my DNA system, the neural cells will be dynamically created or deleted.

Neural cell creation

It’s a simulation of ‘fire together wire together’. When the passthrough(we consider it as the weight $w_A$) is large in neural cell A, we need to create a branch neural cell B to enhance this signal. Once we crate the neural cell B. A and B will equally share the passthrough
which means the initialization will be:

wA=0.5×wA;bA=bA;wB=0.5×wB;bB=0;w_A = 0.5 \times w_A; \\ b_A = b_A; \\ w_B = 0.5 \times w_B; \\ b_B = 0; \\

Neural cell deletion

On the opposite, we also need to delete the very less passthrough’s neural cell. We will just delete it and it’s input and output relation with other neural cells.

But how

Well, I still struggle in how to make a rule to tell us when we need to create the neural cells and when we need to delete the neural cells.

My naive idea for now is just create the new branch for top 10% high passthrough neural cells and delete the bottom 10% low passthrough neural cells. [TBD, need rethinking]

But When(Extinction)

In my plan, the neural creation and deletion process(sounds like model pruning) will be implement when we finish n epochs train and the loss not going down for more k epochs(refer the early stopping method, but we do not stop). And I name this process as a new term – ‘extinction’. It will be like, we have gone through many epochs in the history of our earth, and in some epochs we have met the species mass extinctions and entered another thriving epoch. Our DNA model, also, after n epochs, the model undergoes a mass extinction of neural cells, then the model enters to a new thriving epoch.

Model structure initialization

Before we train DNA model, we need to confirm the initial model structure. The initial model will just be like a Fully connected layer but each node in this layer is the single valuable(signal) and each edge in this layer is neural cell.
Beside, each neural cell’s trainable parameters $w$ and $b$ will be set to 1 and 0 respectively.

| DNA model layer |

TODO

  • Neural cell class
  • Model data structure to manage each neural cell’s input and output
  • Neural cell creation function
  • Neural cell deletion function
  • Extinction function
  • Training example
  • Result evaluation

Next level

  • Forward-Forward algorithm or backpropagation-free method implement
  • Neural cell distributing on different device(DAI).
  • Self-perception, self-learning, and self-evolution.

YOLO-World: Real-Time Open-Vocabulary Object Detection

https://github.com/AILab-CVC/YOLO-World/blob/master/yolo_world/models/layers/yolo_bricks.py

Open Vocabulary Object Detection

Open-vocabulary detection (OVOD) aims to generalize beyond the limited number of base classes labeled during the training phase. The goal is to detect novel classes defined by an unbounded (open) vocabulary at inference. [From link]

Novelty

Traditional Object Detector

Traditional Object Detector

Previous Open-Vocabulary Detector

Previous Open-Vocabulary Detector

YOLO-World Model Architecture

Text Encoder(Clip)

YOLO Detector

YOLOv8 Backbone

Untitled

  • Untitled

Re-parameterizable Vision-Language Path Aggregation Network (Vision-Language PAN)

Untitled

Text-guided CSPLayer(T-CSPLayer)

Dark Bottkneck(C2f Layer)

From: https://openmmlab.medium.com/dive-into-yolov8-how-does-this-state-of-the-art-model-work-10f18f74bab1

Max-Sigmoid

Xl=Xlδ(maxj{1..C}(XlWj)),X_l^{\prime}=X_l \cdot \delta\left(\max _{j \in\{1 . . C\}}\left(X_l W_j^{\top}\right)\right)^{\top},

Image-Pooling Attention(I-Pooling Attention)

W=W+ MultiHead-Attention (W,X~,X~)W^{\prime}=W+\text { MultiHead-Attention }(W, \tilde{X}, \tilde{X})

Text Contrastive Head

Region-Text Matching

Sk,j=αL2-Norm(ek)L2-Norm(wj)T+βSk,j = {\alpha}{\cdot}L2\text{-}Norm(e_k)\cdot L2\text{-}Norm(w_j)^T + \beta

Loss

L(I)=Lcon+λI(Liou+Ldfl)\mathcal{L}(I)=\mathcal{L}_{\mathrm{con}}+\lambda_I \cdot\left(\mathcal{L}_{\mathrm{iou}}+\mathcal{L}_{\mathrm{dfl}}\right)
Lcon\mathcal{L}_{\mathrm{con}} is region-text contrastive loss
Liou\mathcal{L}_{\mathrm{iou}} is IoU loss
Ldfl\mathcal{L}_{\mathrm{dfl}} is distributed focal loss
λI\lambda_I is an indicator factor and set to 1 when input image I is from detection or grounding data and set to 0 when it is from the image-text data.

[범용 인공 지능의 미래]AGENT AI: SURVEYING THE HORIZONS OF MULTIMODAL INTERACTION

Untitled

On 25 Jan 2024, Stanford University, Microsoft Research, the University of California, Los Angeles and the University of Washington Co-published a surveying paper called: AGENT AI: SURVEYING THE HORIZONS OF MULTIMODAL INTERACTION

Abstract:

They try to draw the future of the Multi-modal AI systems. They present a new concept ‘Agent AI’, a promising avenue toward Artificial General Intelligence (AGI). (범용 인공 지능을 향한 유망한 길입니다.)

Agent AI system that can perceive and act in many different domains and applications, possibly serving as a route towards AGI using an agent paradigm(표준 양식]). — Figure 1

How to use ubiquitous(어디에나 있는) multi-modal AI systems?

Agents within physical and virtual environments(현실 및 가상 환경의 에이전트) like the Jarvis in Iron Man(아이언맨의 자비스처럼).

Untitled

Overview of an Agent AI system:

Untitled

Data flow:

Untitled

We define “Agent AI” as a class of interactive systems that can

perceive visual stimuli, language inputs, and other environmentally-grounded data,

and can produce meaningful embodied actions.

We explore systems that aim to improve agents based on next-embodied action prediction by incorporating external knowledge, multi-sensory inputs, and human feedback.

mitigate(감소) the hallucinations(환각) of large foundation models and their tendency to generate environmentally incorrect outputs

We envision a future where people can easily create any
virtual reality or simulated scene and interact with agents embodied within the virtual environment. (like the OASIS game in the film “Ready Player One영화 ‘레디 플레이어 원’의 오아시스 게임처럼 )

1. Introduction

1.1 Motivation

The AI community is on the cusp of a significant paradigm shift, transitioning from
creating AI models for passive, structured tasks to models capable of assuming dynamic, agentic roles in diverse and complex environments.

  • 고도 구조화 고정 임무task 의 AI 모델 → 다양하고 복잡한 환경에서 역동적이고 Agent. LLMs and VLMs → a blend of linguistic proficiency, visual cognition, contextual memory, intuitive reasoning, and adaptability. 언어 능력, 시각 인지, 문맥 기억, 직관적 추론 및 적응력. → gaming, robotics, and healthcare domains. 게임, 로봇 공학 및 의료 영역 → redefine human experiences and elevate the operational standard. 인간의 경험을 재정립하고 업무 운용 수준을 높임. → transformative impacts on industries and socio-economic 산업과 사회경제에 대한 혁신 영향

1.2 Background

Large Foundation Models: can potentially tackle complex tasks

mathematical reasoning, professional law, generate complex plans for robots and game AI

수학 문제, 법학 문제, 계획 세우기 등등….

Untitled

Embodied AI: leverages LLMs to perform task planning for robot

generate Python code → execute these code in low-level controller

preview.gif

Interactive Learning:

  1. Feedback-based learning The AI adapts its responses based on direct user feedback. Untitled
  2. Observational Learning: The AI observes user interactions and learns implicitly. Untitled

2. Agent AI Integration

Previous technology and future challenge

2.1 [Previous technology] Infinite AI agent

AI agent systems’ abilities:

  1. Predictive Modeling
  2. Decision Making
  3. Handling Ambiguity
  4. Continuous Improvement [Infinite AI agent] 2D/3D embodied generation and editing interaction Untitled RoboGen: autonomously runs the cycles of task proposition, environment generation, and skill learning. [https://robogen-ai.github.io/videos/pipeline_cropped.mp4](https://robogen-ai.github.io/videos/pipeline_cropped.mp4) https://robogen-ai.github.io/videos/pipeline_cropped.mp4

2.2 [Challenge] Agent AI with Large Foundation Models

2.2.1 Hallucinations(Fake answer)(환각)

2.2.2 Biases and Inclusivity(편견과 포용성)

Training Data / Historical and Cultural Biases / Language and Context Limitations / Policies and Guidelines / Overgeneralization / Constant Monitoring and Updating / Amplification of Dominant Views / Ethical and Inclusive Design / User Guidelines

학습 데이터 / 역사 및 문화적 편견 / 언어 및 문맥의 한계 / 정책 및 가이드라인 / 지나친 일반화 / 지속적인 모니터링 및 업데이트 / 지배적인 견해의 확대 / 윤리적 및 포용적 디자인 / 사용자 가이드라인

Mitigate(해결법):

Diverse and Inclusive Training Data / Bias Detection and Correction / Ethical Guidelines and Policies / Diverse Representation / Bias Mitigation / Cultural Sensitivity / Accessibility / Language-based Inclusivity / Ethical and Respectful Interactions / User Feedback and Adaptation / Compliance with Inclusivity Guidelines

2.2.3 Data Privacy and Usage

Data Collection, Usage and Purpose / Storage and Security / Data Deletion and Retention / Data Portability and Privacy Policy / Anonymization

2.2.4 Interpretability and Explainability

Imitation Learning → Decoupling 모방 학습 → 디커플링

Decoupling → Generalization 디커플링 → 일반화

Generalization → Emergent Behavior 일반화 → 새로운 행위

Untitled

2.2.5 Inference Augmentation

Data Enrichment / Algorithm Enhancement / Human-in-the-Loop (HITL) / Real-Time Feedback Integration / Cross-Domain Knowledge Transfer / Customization for Specific Use Cases / Ethical and Bias Considerations. / Continuous Learning and Adaptation

2.2.6 Regulation

To address unpredictable output(uncertainty) from the model.

→ Provide environmental information within the prompt

→ designing prompts to make LLM/VLMs include explanatory text

→ pre-execution verification and modification under human guidance

2.3 [Challenge] Agent AI for Emergent Abilities[새로운 능력]

Current modeling practices require developers to prepare large datasets for each domain to finetune/pretrain models; however, this process is costly and even impossible if the domain is new.

⭐Unseen environments or scenarios?

The new emergent mechanism — Mixed Reality with Knowledge Inference Interaction

Enables the exploration of unseen environments for adaptation to virtual reality.

Untitled

  1. Environment and Perception with task-planning and skill observation;
  2. Agent learning;
  3. Memory;
  4. Agent action;
  5. Cognition.

3 Agent AI Paradigm

We seek to accomplish several goals with our proposed framework:

  • Make use of existing pre-trained models and pre-training strategies to effectively bootstrap our agents with an effective understanding of important modalities, such as text or visual inputs.
  • Support for sufficient long-term task-planning capabilities.
  • Incorporate a framework for memory that allows for learned knowledge to be encoded and retrieved later.
  • Allow for environmental feedback to be used to effectively train the agent to learn which actions to take.

LLMs and VLMs|Agent Transformer Definition|Agent Transformer Creationc

Untitled

Untitled

4 Agent AI Learning

4.1 Strategy and Mechanism

4.1.1 Reinforcement Learning (RL)

4.1.2 Imitation Learning (IL)

Seeks to leverage expert data to mimic the actions of experienced agents or experts

4.1.3 Traditional RGB

4.1.4 In-context Learning

4.1.5 Optimization in the Agent System

4.2 Agent Systems (zero-shot and few-shot level)

5 Agent AI Categorization

Untitled

5.1 Generalist Agent Areas:

5.2 Embodied Agents

5.2.1 Action Agents

5.2.2 Interactive Agents

5.3 Simulation and Environments Agents

5.4 Generative Agents

5.4.1 AR/VR/mixed-reality Agents

5.5 Knowledge and Logical Inference Agents

5.5.1 Knowledge Agent

5.5.2 Logic Agents

5.5.3 Agents for Emotional Reasoning

5.5.4 Neuro-Symbolic Agents

6 Agent AI Application Tasks

6.1 Agents for Gaming

NPC Behavior

Human-NPC Interaction

Agent-based Analysis of Gaming

Scene Synthesis for Gaming

6.2 Robotics

Visual Motor Control

Language Conditioned Manipulation

Skill Optimization

6.3 Healthcare

Diagnostic Agents.

Knowledge Retrieval Agents.

Telemedicine and Remote Monitoring.

6.4 Multimodal Agents

6.4.1 Image-Language Understanding and Generation

6.4.2 Video and Language Understanding and Generation

6.6 Agent for NLP

6.6.1 LLM agent

6.6.2 General LLM agent

6.6.3 Instruction-following LLM agents

7 Agent AI Across Modalities, Domains, and Realities

7.1 Agents for Cross-modal Understanding(image, text, audio, video…)

7.2 Agents for Cross-domain Understanding

7.3 Interactive agent for cross-modality and cross-reality

7.4 Sim to Real Transfer

8 Continuous and Self-improvement for Agent AI

8.1 Human-based Interaction Data

  • Additional training data
  • Human preference learning(사람의 선택 학습)
  • Safety training (red-teaming)(보안)

8.2 Foundation Model Generated Data

  • LLM Instruction-tuning
  • Vision-language pairs

9 Agent Dataset and Leaderboard

9.1 “CuisineWorld” Dataset for Multi-agent Gaming

• We provide a dataset and related the benchmark, called Microsoft MindAgent and correspondingly release a dataset “CuisineWorld” to the research community.

Typo in 9.1.2 Task

9.2 Audio-Video-Language Pre-training Dataset

  1. Video Text Retrieval

  2. Video Assisted Informative Question Answering

CVPR 2024 Tutorial on Generalist Agent AI

由于我们近百年来的落后发展和来自西方资本为主的国际社会的排斥,虽然我们在最近几十年几代人的拼命努力,我们终于在大部分领域追到了世界前列的水平。

但是,由于大部分行业和领域都起始于西方社会,我们为了迎合国际市场,不得不遵守西方定制的统一规范标准,这在我看来是非常危险的一件事情。诚然,全世界如果都能在同一行业使用同一套标准来生产产品可以使得各企业之间的产品拥有更好的互利互通性,但是,却无法避免的存在非常多的不确定性,比如:

  1. 无法保证规范标准没有利好偏向:以我的经验来看,大部分西方定制的标准都存在一定的利好偏向的。几大行业寡头,为了定制有利于自己产品生产的规范,会联合起来定制一些并没有那么符合实际的规范标准。
  2. 无法保证某些标准不存在专利寡头:由于有些标准确立初始只有某些特定的企业有于其匹配的专利产品,一旦这种标准被确定,几乎就等于所有企业都得购买特定企业的专利才能进行生产,这种标准也是非常的不合理的。
  3. 无法保证国家制裁导致的标准制裁:在某些情况下,由于被制裁,我们甚至无法被允许和国际社会使用同一套标准,因为标准中往往会出现第2点中提到的专利寡头。

因此,基于现在的严峻形式,我觉得我们各行各业的同胞们,应该定制我们自己的规范联盟,凡防于为来。

其实,关于标准定制竞争的成功案例,我们其实已经有了很多先例,比如银联与visa,master之争,华为的5G规范之争,但是,我仍然认为,这些规范的定义仍然是在西方集团框架之下的抵抗。但是我在反思,为什么我们不能跳出西方框架,自己走一条自己的路呢?我们的产品远销全世界,我们的产品质量也是最好的,我们有自己的技术,自己的销售渠道,为什么要听他人的标准呢?我们各行业的企业家们,为什么不可以按照自己的标准来生产产品呢? 为什么我们不能成立一个像IEEE那像的行业标准定制联盟呢?

我认为,基于国情,我们的行业标准定制联盟,不能完全模仿西方世界的方法。我们应该发展我们自己特色的行业定制联盟。以下是我的几点想法:

  1. 踩着别人的肩膀过河:如题,我们白手起家,完全重新定义一些标准也是不现实的,我们完全可以模仿他人已经定制好的标准,再经过一定的改进,定制自己的标准。特别是某一些行业,别人都不带我们玩了,我们也没必要客气,没必要有什么契约精神和道德规范,就是模仿之后再定制我们自己的规范就好,踩着别人的肩膀过河,站在“巨人”的肩膀做事嘛。
  2. 做新型领域开拓者:对于成熟的行业,我们可以“踩着别人的肩膀过河”,而对于新兴的行业,我们应该有自己的行业自行,直接定制自己的行业规范和标准。特别是AI,芯片和操作系统这些领域,我们必须自己定制自己的标准,大家一起集中力量做大做强,这些领域急需整个行业的人一起努力来摆脱制裁。
  3. 各企业团结起来:我们既然要定自己的标准规范,自然也要有人去定制和遵守,这点上我呼吁各行业各企业之间摈弃以往因为竞争带来的嫌隙,理解唇亡齿寒的道理。大企业带头,大家团结一致,一起定制符合行业持续发展的标准和规范。
  4. 前置专利共享:定制标注和规范,不可避免的存在一些前置专利的问题,我认为,我们企业之间要是能大方免费或者低价共享这些前置专利的话,那将是百利无害的。毕竟没有市场的专利是无法转换为利益的。
  5. 政府参与控制:我认为企业以逐利为唯一目的是无可厚非的,但是我们不能忽视了消费者和国家人民的利益,政府的适量参与和干预还是非常有必要的。至于为什么不能直接由政府直接定制规范,毕竟专业的事让专业的人做可能会更加合适。

总的来说,我在这里是呼吁各行各业的企业们团结起来,我们一起成立一个“统一规范制定联盟”,大家一起定制我们自己的行业标准,一起避免被外部势力卡脖子的现象发生。拿操作系统举例,近几年国内操作系统发展迅猛,但是各个企业之间也各自为营,做着自己标准的事情,这非此不利于操作系统上层的软件生态的开发,各企业之间若是能开诚布公,共享自己的专利,规范和标准,不重复造车,集中力量做好一套标准系统,软件开发者一次代码构建,便可应用在各个操作系统中,将会大大吸引广大开发者的开发热忱。

以上都是我的一些胡言乱语,疯狂头脑风暴,在内行人看来可能非常不现实甚至是搞笑。读者要是看了觉得有道理,可以联系我,大家一起深入探讨构建更好的蓝图,要是觉得是个笑话,就请您全当是个笑话,一笑而过吧。

AI version “Mining” system for AIGC: Take stable diffusion image generation model as a example

英文来自于Notion 机器翻译

介于现在AIGC模型的兴起,如果我们要生成一张图片,一些人在计算机算力上是非常缺失的,而另一些人却拥有非常多的计算机算力。此外,AI应用中,70%左右的算力都应用在了推演中。所以我在想,是否可以设计一个类似于以太币挖矿。我们可以用数字货币购买gas,而购买gas约等于购买算力和存储空间,然后我们便可以用gas部署我们的AI智能合约(Smart contract)。

With the rise of AIGC models, if we want to generate an image, some people are very lacking in computer computing power, while others have a lot of computer computing power. Moreover, in AI applications, about 70% of the computing power is applied in inference. So I’m wondering if we could design something similar to Ethereum mining. We can buy gas with cryptocurrency, and purchasing gas is approximately equivalent to buying computing power and storage space. Then we can use the gas to deploy our AI smart contracts.

由于运行合约的时候,模型可能会被拆分到各台机器中运行,而可被拆分的模型是一个新兴的研究方向,如 Forward-Forward Algorithm和我的DFF: Distributed Forward-Forward Algorithm论文。目前研究的并不是很完善。因此,我们在这里将 stable diffusion model中的steps作为被分割的对象。即,我们将每一个step分割到各个机器中运行,如下图:

When running contracts, the model may be split into different machines for operation, and a model that can be split is a new research direction, such as the Forward-Forward Algorithm and my DFF: Distributed Forward-Forward Algorithm paper. The current research is not complete. Therefore, we will use the steps in the stable diffusion model as the split objects here. That is, we will split each step into different machines for operation, as shown in the figure below:

ai_miner.drawio.png

想在完成这套系统主要存在的问题有:(以后还会补充)

  1. 如何将通用的问题切割成小问题?
  2. 如果是要训练模型,训练的时候数据集又怎么处理?
  3. 每一个Node中如何评价是否“挖到”了我们希望得到的模型结果?
  4. 如何backpropagation?需要backpropagation吗?
  5. 如何解决并发的问题,如何将模型设计成并发推演或训练的?

The main problems that exist in completing this system are: (will be supplemented later)

  1. How to split general problems into smaller problems?
  2. If it is to train the model, how to handle the dataset during training?
  3. How to evaluate in each Node whether we have “mined” the model results we hope to get?
  4. How to backpropagate? Is backpropagation needed?
  5. How to solve concurrency issues, and how to design the model for concurrent inference or training?

My rethink about Image generation and recognition

英文部分是Notion机器翻译

人类的大脑是图像生成最利害的机器,虽然我们无法直接向外表现出来,但是我们可以在大脑中想象出千奇百怪的东西,甚至在我们的梦境中,我们可以完全畅游在自由想像的世界。

The human brain is the most potent image generation machine. Although we can’t express it directly, we can imagine all sorts of things in our brains. Even in our dreams, we can fully roam the world of free imagination.

但是,我们人类是如何在脑中生成这些东西的呢?在我的理解中,纵使我们可以想象出任何东西,但是这个“任何东西”,很难超过我们的认知范围,即我们常说的想象的局限性。

But how do we humans generate these things in our minds? In my understanding, though we can imagine anything, this “anything” is hard to exceed our cognitive range, which is often referred to as the limitation of imagination.

我认为我们拥有这种局限性的原因是因为我们需要先观察已有的事物,然后抽象思考这个事物的特征,才能想象出基于这个事物部分特征的其他事物。

I believe we have this limitation because we need to observe existing things first, then abstract and think about the features of this thing, in order to imagine other things based on some features of this thing.

因此,如果我们遵从人类的思考方式创建AI模型的话,我假设模型分为3部分: 观察,想象和思考。

Therefore, if we create AI models following human thinking, I assume the model is divided into three parts: observation, imagination, and thinking.

humangen.drawio.png

上图就是我认为的一个人类大脑的思维逻辑,或者说是一种图像生成的AI模型的结构,该结构有点类似于GAN(Generative Adversarial Network)模型。

The above figure is what I think is the logic of human brain thinking, or the structure of an image generation AI model, which is somewhat similar to the GAN (Generative Adversarial Network) model.

Observation: 我们将会观察真实图片,学习并提取真实图片的隐藏特征值(Hidden Feature)。

Observation: We will observe real images, learn and extract the hidden features of real images.

Hidden Feature: 观察到的特征值会被应用于两个方面,一方面是用于思考,一方面是用于生成图片。

Hidden Feature: The observed features will be applied in two ways, one for thinking and one for generating images.

Imagination: 我们通过总结到的Hidden Feature,可以试图生成一张新的图片,这张图片和原图Real image很相似,但是又不同,目的是为了生成一张能够让observation提取到和Real image相似特征的图片。

Imagination: We can try to generate a new image through the summarized Hidden Feature. This image is similar to the original Real image, but different. The purpose is to generate an image that allows the observation to extract features similar to the Real image.

让我们重新回到我们人类的思考学习方式上,当我们牙牙学语时候,假设我们认识“车”这个单词,老师会给我们一张车的图片卡,然后反复教我们这个东西是“车”。当我们记住了这张图片和与之对应的单词“车”之后,神奇的一幕发生了,即便我们看到了另一个模样的车之后,我们仍然能够“猜”出这是车!而这样神奇的能力我认为可以归因于人类想象力。因为人类在看到“车”的图片卡之后,我们不但是学习了车这个图片的特征,还学习到了很多抽象的特征,比如车有四个轮子,有方向盘等等,并基于这些抽象特征,天马行空的想象出了不同的车,而这些想象出来的车的样子,也有利于我们去识别真正现实中的车。在通过N次的对真实图片和想象图片的观察之后,我们便能思考“车”是什么,同是也具备了想象车的样子的模型。如下图:

Let’s return to our human thinking learning mode. When we were babbling, suppose we knew the word “car”, the teacher would give us a picture card of a car, and repeatedly teach us that this thing is a “car”. After we remembered this picture and the corresponding word “car”, a miraculous scene happened. Even if we saw another car, we could still “guess” it was a car! I attribute such magical ability to human imagination. Because after seeing the picture card of the “car”, we not only learned the features of the car picture, but also learned many abstract features, such as the car has four wheels, a steering wheel, etc., and based on these abstract features, the sky-horse-like imagination came out. Different cars, and the appearance of these imagined cars, are also conducive to our identification of real cars in reality. After observing the real image and the imagined image for N times, we can think about what a “car” is, and also have a model that imagines the appearance of a car. As below:

baby_learn.drawio.png

Thinking: 对于特征的思考,我们大致需要思考两个方面的问题:

图片是真的还是想象出来的?

一个真假二分类问题

图像描述的是什么?

一个NLP描述问题

Thinking: For thinking about features, we generally need to think about two aspects:

Is the picture real or imagined?

A true or false binary classification problem

What does the image describe?

An NLP description problem

因此,基于这一套思维逻辑,我们或许能够制作一个one-shot或者few-shot的模型,通过反复观测(Observation)和想象(Imagination)同一张图片并对他们进行思考,我们或许能生三个模型:encoder,generator,和thinker三个模型。如下图:

Therefore, based on this set of thinking logic, we may be able to make a one-shot or few-shot model, through repeated Observation and Imagination of the same picture and thinking about them, we may be able to generate three models: encoder, generator, and thinker. As below:

aigen.drawio.png

Loss 的设计,当我们设计loss函数的时候,主要是帧对在Thinker模块的输出,Real和Fake和对应的NLP描述均可以设计对应的距离loss函数。

When designing the Loss function, it is mainly designed for the output of the Thinker module, both Real and Fake and the corresponding NLP description can design the corresponding distance loss function.

另外,我们还可以设计一个prompt编码模块,给我们的想象力加一点提示hint,从而更好的控制generator生成的内容。

Additionally, we can design a prompt encoding module to add a hint to our imagination, thereby better controlling the content generated by the generator.

aigen_with_prompt.drawio.png

以上便是我对于人类学习和AI学习的一点点思考。

The above is a little bit of my thinking about human learning and AI learning.

Problem

When I try to use Interactive Annotation in Label Studio with Segment Anything Model Commits version: b2c31d3. I meet a problem, previous annotation is disappeared after making new annotation.(Latest version is fixed this problem[link])

Fix

To fix this problem, we need to modify the label-studio-ml-backend/label_studio_ml/examples/segment_anything_model/segment_anything_model.py file.
change the method to generate result id.

1
2
3

''.join(random.SystemRandom().choice(string.ascii_uppercase + string.ascii_lowercase + string.digits))

to

1
2
3
from uuid import uuid4
str(uuid4())[:4]