
My rethink about Image generation and recognition

The human brain is the most potent image generation machine. Although we can’t express it directly, we can imagine all sorts of things in our brains. Even in our dreams, we can fully roam the world of free imagination.


But how do we humans generate these things in our minds? In my understanding, though we can imagine anything, this “anything” is hard to exceed our cognitive range, which is often referred to as the limitation of imagination.


I believe we have this limitation because we need to observe existing things first, then abstract and think about the features of this thing, in order to imagine other things based on some features of this thing.

因此,如果我们遵从人类的思考方式创建AI模型的话,我假设模型分为3部分: 观察,想象和思考。

Therefore, if we create AI models following human thinking, I assume the model is divided into three parts: observation, imagination, and thinking.


上图就是我认为的一个人类大脑的思维逻辑,或者说是一种图像生成的AI模型的结构,该结构有点类似于GAN(Generative Adversarial Network)模型。

The above figure is what I think is the logic of human brain thinking, or the structure of an image generation AI model, which is somewhat similar to the GAN (Generative Adversarial Network) model.

Observation: 我们将会观察真实图片,学习并提取真实图片的隐藏特征值(Hidden Feature)。

Observation: We will observe real images, learn and extract the hidden features of real images.

Hidden Feature: 观察到的特征值会被应用于两个方面,一方面是用于思考,一方面是用于生成图片。

Hidden Feature: The observed features will be applied in two ways, one for thinking and one for generating images.

Imagination: 我们通过总结到的Hidden Feature,可以试图生成一张新的图片,这张图片和原图Real image很相似,但是又不同,目的是为了生成一张能够让observation提取到和Real image相似特征的图片。

Imagination: We can try to generate a new image through the summarized Hidden Feature. This image is similar to the original Real image, but different. The purpose is to generate an image that allows the observation to extract features similar to the Real image.


Let’s return to our human thinking learning mode. When we were babbling, suppose we knew the word “car”, the teacher would give us a picture card of a car, and repeatedly teach us that this thing is a “car”. After we remembered this picture and the corresponding word “car”, a miraculous scene happened. Even if we saw another car, we could still “guess” it was a car! I attribute such magical ability to human imagination. Because after seeing the picture card of the “car”, we not only learned the features of the car picture, but also learned many abstract features, such as the car has four wheels, a steering wheel, etc., and based on these abstract features, the sky-horse-like imagination came out. Different cars, and the appearance of these imagined cars, are also conducive to our identification of real cars in reality. After observing the real image and the imagined image for N times, we can think about what a “car” is, and also have a model that imagines the appearance of a car. As below:


Thinking: 对于特征的思考,我们大致需要思考两个方面的问题:





Thinking: For thinking about features, we generally need to think about two aspects:

Is the picture real or imagined?

A true or false binary classification problem

What does the image describe?

An NLP description problem


Therefore, based on this set of thinking logic, we may be able to make a one-shot or few-shot model, through repeated Observation and Imagination of the same picture and thinking about them, we may be able to generate three models: encoder, generator, and thinker. As below:


Loss 的设计,当我们设计loss函数的时候,主要是帧对在Thinker模块的输出,Real和Fake和对应的NLP描述均可以设计对应的距离loss函数。

When designing the Loss function, it is mainly designed for the output of the Thinker module, both Real and Fake and the corresponding NLP description can design the corresponding distance loss function.


Additionally, we can design a prompt encoding module to add a hint to our imagination, thereby better controlling the content generated by the generator.



The above is a little bit of my thinking about human learning and AI learning.

