Odunolaoluwa Shadrack Jenrola

An Introduction into flows and diffusion

A stripped down, beginner-friendly introduction to generative modelling.

This is my very first blog post, which will kick off a series exploring flows and diffusion models as I learn about them.

I'll introduce essential terms and concepts, building our understanding step by step and also try to keep things brief and easy to understand.


Vectors, Distributions and Models

First, we'll explore vectors, distributions, probability distributions, sampling and vector fields.

Vectors

Vectors are quantities that have both magnitude and direction. Mathematically, over a set of real numbers Rd, a vector is an ordered list of numbers:

x=(x1,x2,,xn)T

Now, vectors represent points or directions in space. We could relate them to data we are familiar with eg images, audio, videos etc as vectors can be used to represent them in a structured manner. A continuous audio waveform can be sampled and represented as a discrete list of numbers whose length increases over time. An image (grayscale) if resized to a square of 28 x 28 pixels can be flattened into a 784-dimensional vector. A video is just a sequence of images with the addition of an extra dimension to signify time (Think of how a screenshot of a video at any instant in time is just an image).


Distributions

Distributions describe how vectors are spread across a space. When visualized, a distribution reveals which regions in space are more likely to contain vectors and which are less likely. In other words, distributions assign density values to points in space, showing the probability of finding a vector near each point.

A Probability Distribution is a kind of distribution where the total mass sums up to one. Assume that p(x) is the probability density over a set of real numbers, then:

Rdp(x)dx=1

Think of a long list of values that add up to 1. A popular example is the Normal distribution (also called the Gaussian distribution). The standard normal distribution is a special case with mean μx of 0 and standard deviation σx of 1.

Normal Distributions, Image embedded from Wikipedia.

Normal Distributions, Image embedded from Wikipedia.

Notice how the distribution appears to be dense about the mean μ with a degree of scatteredness measured by σ. Please take note of this, it’s super important.

Next we’ll explain sampling. Sampling from a probability distribution simply means to randomly generate vectors according to the likelihoods specified by that distribution. Remember how when we talked about the normal distribution we could see that the region around the mean is more dense and this reduces away from it, If say we sample 3 numbers from the Standard Normal Distribution 𝒩(0,1), we would get numbers like 0,1, -0.1, 1.3 more often than numbers like 10, 11, -12.

In more formal words, sampling from a probability distribution produces random vectors such that the probability of observing a vector in any given region is proportional to the integral of p(x) over that region.

Good, I think we have enough to go on for now.


Generative and Deterministic Models

Let’s talk about what makes a Generative Model a bit different from a Deterministic Model. Deterministic models always produce the same output for the same input. there's no randomness involved. Think of a simple classifier that distinguishes between cats and dogs. A deterministic model will give the same log probability for the exact same cat image every time you input it, just like a regular function. In fact, all deterministic models can be expressed by this familiar relationship:

y=f(x)

Generative models are different. let’s define our goal to generate a picture of a cat. Think of a probability distribution that describes vectors that represents a large diverse set of images of cats.

Throughout the course of these series(hopefully) and in most texts such a distribution is referred to as pdata or the data distribution.

Notice that pdata is unknown, it is impossible for us to determine the mathematical form this data distribution would take, We cannot exactly write down or evaluate pdata at an arbitrary point.

Now let’s think about sampling from this distribution. What would we get when we sample from pdata if we do have perfect access to it? A picture of a cat, Exactly what generative models do. By sampling repeatedly we get random different realistic images of cats. We would denote samples of any data distribution as x ~ pdata. So generating an object simply means sampling from a suitable distribution.

A Generative Model allows us to sample new objects x from the data distribution.

So far we have established that to generate objects we require a data distribution to sample from. How do we get this data distribution?

Let’s say we have access to some other initial distribution that we can sample from. We’ll signify this as pinit. Now we established earlier that it’s relatively easy to sample from the Gaussian distribution, let’s choose this as our initial distribution. What if we could transform vectors sampled from pinit into vectors that look like they are from pdata?

The question above is exactly the operation that we are after, and what transformations that paradigms like diffusion and flows allow us to do, learn a transformation that maps simple distributions into complex, high-dimensional data distributions such like one that describes cats.

Diffusion and Flows allow us to learn a transformation that maps simple distributions to more complex ones.

It’s interesting to note that the major criterion for a distribution to be chosen as the initial distribution is that it should be one that can be easily sampled from. This means that pinit does not necessarily have to be the Gaussian all the time.

Good, we’re making progress. We’re just about ready to dive into flow and diffusion models by their relationships with Ordinary Differential Equations and Stochastic Differential Equations respectively but first let’s define a few more terms: Vector Fields and Trajectories.


Flows and Diffusion Models

Vector fields and trajectories

A vector field in d is a function that assigns a vector to each point in space. We’ll consider time-dependent vector fields of the form:

u:d×[0,1]d,(x,t)ut(x)

Where

Think of it as a function that defines what direction you’d go and how fast you’d move if you were a particle in space.

Now if a vector field tells a particle what direction to go a trajectory is that path the particle would eventually take.

From Wikipedia: A rotational vector field around a central point.

From Wikipedia: A rotational vector field around a central point.

A trajectory Xt is the path a particle follows through space over time starting from some initial position x0 by obeying the rules laid down by the vector field. Mathematically It is any function

X:[0,1]d,tXt

that satisfies the ordinary differential equation:

dXtdt=ut(Xt),X0=x0

Recall that ut(Xt) is a vector field, and by X0=x0 we mean that we start from t=0 and then tend towards t as we move along.

If a trajectory defines the movement of a single particle in space in time a flow is the function that tells us where that particle would be at time t.

The flow ψt of a vector field u is a function that maps each initial point x0 to its corresponding position at time t along the trajectory X that starts at x0. Formally:

ψ(x0):dd,x0Xt

We could inteprete flows in two ways :

  1. For a single starting point x0, the flow ψ(x0) tells us where a particle starting at x0 would be at time t when following the vector field.
  2. If we consider all particles in space, the flow ψt:RdRd represents the collective movement of all of these possible particles. It transforms the entire space according to the vector field’s dynamics over time t. It describes the positions at time t of all particles for all starting points.

So while Xt is a solution to the ODE as depicted above, the flow is such that for a time t:

dψt(x)dt=ut(ψt(x)),ψ0(x)=x

for the same ODE.

The relationship between Flows and Trajectories can be expressed as:

ψt(x0)=XtwhereX0=x0

You see how it tells us what’s happening at time t? It’ll make some more sense in a second.

It’s helpful to note that the ODE needs to have a unique solution.


Flows

Recall that for generative modelling we want to learn a transformation that converts a simple initial distribution pinit to a more complex target distribution pdata. This transformation can be realized through a flow defined by a vector field.

Assume we sample the initial position variable X0 from the initial distribution pinit

X0~pinit

Then the trajectory of this particle can end at X1 which can be said to resemble a particle sampled from the data distribution:

X1~pdata

Now we can see something interesting. You know we said earlier when defining flows (point 2) that a flow ψt(x) describes the positions at time t of all particles moving according to the rules of the vector field u(x). If we move each sample along the flow ψt(x) for time t=1 starting from the initial position above, then

X1=ψ1(x),X0=x,

We just got pdata, what we were looking for. But in other to get this distribution we had to make sure that we use the right vector field. By parameterizing the vector field as uθ(x), we learn this vector field by training a neural network.

The flow from the ODE defined by the learned vector field is then used to carry pinit to pdata.

Let’s rewrite the ODE again with this in mind.

dXtdt=utθ(Xt),X0=x0,X0~pinit

In practice we cannot compute the flow ψtθ analytically so we have to numerically simulate the ODE, recall the different numerical methods for solving ordinary differential equations like the Euler Method(simplest), Midpoint Method, Runge-Kutta etc. Of course each of these numerical methods have trade offs such as speed, accuracy and needed compute.


SDEs and Stochastic Processes

Now let’s extend all of this into the concept of diffusion. We mentioned before that Flow and diffusion models correspond to ODEs and SDEs respectively.

A stochastic differential equation for the trajectory XtRd is written as

dXt=ut(Xt)dt+σtdWt,X0=x0

Where:

For SDEs, trajectories Xt are stochastic rather than deterministic. That means each path is random, not a single fixed curve. Such random paths are called stochastic trajectories or stochastic processes.

σt can be a scalar or a matrix multiplying the brownian motion term. It controls the intensity of the random fluctuations.

Brownian motion introduces true stochasticity and can be thought of as a true random motion of particles suspended in a medium. It is also called Wiener process hence the W, It is characterised by the following properties,:

  1. W0 = 0 ; Its path starts from 0.
  2. W has independent increments.
  3. W has Gaussian increments.
  4. W has almost surely has continuous paths.

Image from Wikipedia

Image from Wikipedia

SDEs introduces stochastic dynamics to ODEs because real-word processes are not perfectly smooth. Note that if the diffusion coefficient σt=0, then Stochastic differential equations revert to ODEs.

Unlike ODEs that have unique flows SDEs produce random trajectories and so there is no concept of a flow. We can remember though that we never really had to compute the flow explicitly when we talked about flow models, all we needed was the vector field the neural network learnt.

Now let’s define diffusion models just like we did for flow models. Recall that our goal is to find some way to convert a simple initial distribution pinit to a more complex target distribution pdata.

Let’s write down the SDE again but this time we parameterize ut(x).

dXt=utθ(Xt)dt+σtdWt,X0=x0,X0~pinit

By training a neural network to learn the vector field utθ(x), we can sample by initializing X0~pinit and simulating the SDE.

The result will approximately follow pdata


Sampling Methods

Finally let’s explain in pseudo-code how we sample from flow and diffusion models by simulating ODEs and SDEs respectively. Note that by sampling we are simulating a differential equation.

For Ordinary Differential Equations, One good method used to sample from Flow models is the Euler Method. Below are the steps:

Given a learned vector field utθ;

  1. Choose a step count n.

  2. Set the step size h=1/n

  3. Draw X0~pinit

  4. Set t=0

  5. in a for loop, for i=0,1,...,n1:

    1. Compute v=utθ(Xi)

    2. Update the state:

      $$ X_{i+1} = X_i + hv $$

    3. Increment time: t=t+h

  6. Output Xn~ψ1θ(X0). This should follow pdata

    At the end of the loop if all goes according to plan,

    $$ X_1 = X_n \sim p_{data} $$

For Stochastic Differential Equations, the Euler-Maruyama Method is usually used to sample.

  1. Choose step count n.
  2. Set the step size h=1/n
  3. Draw X0~pinit
  4. Set t=0
  5. in a for loop, for i=0,1,...,n1:
    1. Compute drift v=utθ(Xt)

    2. Draw noise ξi~𝒩(0,I)

    3. Update the state:

      $$ X_{i+1}=X_i + hv + \sigma_t \sqrt{h}\xi_i $$

    4. Increment time

  6. Output Xn which follows pdata after the diffusion process.

Again, if all goes according to plan, at the end of the loop we should get;

X1=Xn~pdata

Note: In subsequent articles of this series, we’ll place more emphasis on flow models, although diffusion models will occasionally appear as well.

References

  1. MIT 6.S184Introduction to Flow Matching and Diffusion Models. MIT Computer Science class by Stefano Ermon and Guy Blanc. Lecture videos available on YouTube.
  2. Lipman, Yaron, et al. Flow Matching Guide and Code. arXiv preprint arXiv:2412.06264, 2024.
  3. Yaron Lipman et al. Flow matching for generative modeling