Recent years have witnessed astonishing progress in generative image modeling, with neural network-based models able to synthesize increasingly realistic and detailed images. This rapid advancement is quantitatively reflected in the steady decrease of Fréchet Inception Distance (FID) scores over time. The FID score measures the similarity between generated and real images based on feature activations extracted from a pretrained image classifier network. Lower FID scores indicate greater similarity to real images and thus higher quality generations from the model.
Around 2020, architectural innovations like BigGAN precipitated a substantial leap in generated image fidelity as measured by FID. BigGAN proposed techniques like class-conditional batch normalization and progressive growing of generator and discriminator models to stabilize training and generate higher resolution, more realistic images compared to prior generative adversarial networks (GANs).
The introduction of BigGAN and related architectures drove FID scores down from around 30 to nearly 10 on common benchmark datasets. Since then, diffusion models have become the predominant approach for further improvements in image generation quality.
Diffusion models are trained to reverse a noisy diffusion process which gradually corrupts real images into noise. By learning to reverse this process, they can map samples from a simple noise distribution back to the complex distribution of real images. Optimizing the neural network to accurately model these small diffusion steps enables quite stable training. The result is a steady decrease in FID from around 5 to 3 on datasets like CIFAR-10 and ImageNet over the past couple years.
However, while FID is a convenient automatic measure of image quality, it does not necessarily capture all aspects of human perceptual judgments. An alternative evaluation is to directly measure how often human observers are "fooled" into thinking generated images are real. By this metric of human error rate, the current state-of-the-art model is PFGM++, proposed by researchers at MIT. PFGM++ consistently achieves the highest human error rates, meaning it most reliably fools humans into misclassifying its generated images as real.
PFGM++ represents the latest iteration in a line of work developing generative models based on mathematical physics and electrostatics. The core insight underlying these Poisson Flow models is to interpret the data distribution as a charge distribution in space. The electric field resulting from this spatial charge distribution can then guide samples from a simple prior distribution like a uniform spherical distribution to the data distribution. Intuitively, samples follow the electric field lines emitted by the charge distribution until they intersect with the data distribution itself.
More precisely, each data point is modeled as a point charge. The collective charge distribution gives rise to an electric potential field that satisfies Poisson's equation, where the charge density acts as the source term. While directly solving this partial differential equation is intractable for high dimensional data like images, we only need the gradient of the potential field to obtain the electric field. This gradient can be approximated using Monte Carlo integration. An initial version of Poisson Flow trains a neural network to predict this electric field conditioned on sampled data points.
During generation, samples from a uniform prior distribution on a sphere are evolved by following the learned electric field lines via numerical integration of an ordinary differential equation (ODE). As samples move along the field lines, noise is gradually reduced according to a schedule. Eventually samples intersect the data distribution and generation terminates.
While conceptually appealing, directly applying this idea results in "mode collapse" where samples just end up concentrated around the data mean. The electric field lines all terminate at the center of mass of the charge distribution. To address this, Poisson Flow models augment the data distribution with one extra dimension. Samples now follow electric field lines in this higher dimensional space. By carefully designing the charge distribution, samples traverse the entire data distribution before ending up at the origin in the extra dimension. This enforces diversity and enables defining a smooth projection from the spherical prior to the data distribution.
The original Poisson Flow model was later improved in PFGM by increasing the number of extra augmenting dimensions instead of just one. This allows tuning model properties along a continuum between diffusion-like and more rigid electrostatics-based models. As the number of dimensions grows large, the model starts to resemble diffusion models. Experiments showed that values around 4-8 extra dimensions achieved the best results, balancing training stability against inference robustness.
PFGM++ introduces further enhancements to the training procedure and inference process. First, the expensive training objective of fitting the electric field with large batches of samples is replaced by a more efficient form of score matching. This avoids the need for costly simulation of the field lines. Second, the extra dimensions lead to a more stable training trajectory where the model sees a wider range of sample norms compared to diffusion models.
Experiments across datasets like CIFAR-10, FFHQ, and LSUN demonstrate superior image quality from PFGM++ over diffusion methods including DDPM, the previous state-of-the-art on class-conditional image generation. PFGM++ also displays greater robustness when perturbations are introduced into the generation process, whether via added noise or model quantization and compression. The additional dimensions curb the compounding of errors during sampling by expanding the distribution of training examples.
In summary, physics and electrostatics have provided a fertile source of insights for improving generative modeling of complex data like images. PFGM++ currently produces the most realistic images according to human evaluation. Its training procedure is more data efficient owing to the modified objective function. The inference process is also more stable compared to diffusion-based alternatives, enabled by the expanded sample distribution.
This illustrates the value of exploring diverse sources of inspiration from fields like physics when designing and enhancing neural models for generative tasks. While deep learning provides exceptional function approximation capabilities, injecting inductive biases and structure from scientific domains can clearly confer additional benefits. Physics-guided techniques offer one compelling paradigm, butlikely many other fruitful connections remain untapped.
At the same time, key challenges and opportunities for future work remain. Current diffusion models exhibit instabilities and inefficiencies relating to the inference procedure that physics-based approaches only partially solve. Additional improvements to training and sampling efficiency without sacrificing image quality remain an active research direction. Distilling diffusion models into smaller and faster student networks also offers tangible benefits but has proven difficult thus far.
Controllability and predictability of image generation given text or other conditional inputs likewise remains quite poor in existing models. For applications like text-to-image generation, a user must still explore myriad prompts to obtain their desired output. More predictable and fine-grained control would enhance the usability of these models. Recent work has started making progress on this front by better aligning internal model representations to desired attributes to exert precise control over selected outputs.
In parallel, auto-regressive models present another rapidly evolving class of generative models with complementary strengths like stable scaling to high resolutions. For example, recent work from deepmind, Anthropic, and others demonstrate megapixel image generation through an auto-regressive approach of sequentially predicting pixel values. Such models exhibit different tradeoffs compared to diffusion methods which excel at parallel sampling. Determining the ideal modeling formalisms and training frameworks to unify the key advantages of each remains an open problem.
Beyond images, diffusion-based and physics-inspired techniques have proven widely applicable to other modalities like text, audio, 3D shapes, and even protein structures. But in many domains, identifying the right inductive biases and architectural backbones to maximize sample quality and training stability remains an active research endeavor. As models scale up and find deployment in real-world settings, additional considerations around safety, ethics, and societal impact rise in prominence as well.
Overall though, the rapid progress in generative modeling over just the past few years signals an exciting future ahead. Models have already crossed an important threshold from mostly producing blurry unrealistic outputs to now generating highly convincing samples across diverse data types. Ongoing innovations spanning training techniques, model architectures, inference algorithms, and evaluative metrics will unlock further revolutionary possibilities in this space. The seeds planted by infusing ideas from physics into generative neural networks exemplify the immense potential still remaining to be tapped.