An Introduction into flows and diffusion

04 May, 2025

A stripped down, beginner-friendly introduction to generative modelling.

This is my very first blog post, which will kick off a series exploring flows and diffusion models as I learn about them.

I'll introduce essential terms and concepts, building our understanding step by step and also try to keep things brief and easy to understand.

Vectors, Distributions and Models

First, we'll explore vectors, distributions, probability distributions, sampling and vector fields.

Vectors

Vectors are quantities that have both magnitude and direction. Mathematically, over a set of real numbers $R_{d}$ , a vector is an ordered list of numbers:

x = (x_{1}, x_{2}, \dots, x_{n})^{T}

Now, vectors represent points or directions in space. We could relate them to data we are familiar with eg images, audio, videos etc as vectors can be used to represent them in a structured manner. A continuous audio waveform can be sampled and represented as a discrete list of numbers whose length increases over time. An image (grayscale) if resized to a square of 28 x 28 pixels can be flattened into a 784-dimensional vector. A video is just a sequence of images with the addition of an extra dimension to signify time (Think of how a screenshot of a video at any instant in time is just an image).

Distributions

Distributions describe how vectors are spread across a space. When visualized, a distribution reveals which regions in space are more likely to contain vectors and which are less likely. In other words, distributions assign density values to points in space, showing the probability of finding a vector near each point.

A Probability Distribution is a kind of distribution where the total mass sums up to one. Assume that $p (x)$ is the probability density over a set of real numbers, then:

\int_{R^{d}} p (x) d x = 1

Think of a long list of values that add up to 1. A popular example is the Normal distribution (also called the Gaussian distribution). The standard normal distribution is a special case with mean $μ_{x}$ of 0 and standard deviation $σ_{x}$ of 1.

Normal Distributions, Image embedded from Wikipedia.

Notice how the distribution appears to be dense about the mean $μ$ with a degree of scatteredness measured by $σ$ . Please take note of this, it’s super important.

Next we’ll explain sampling. Sampling from a probability distribution simply means to randomly generate vectors according to the likelihoods specified by that distribution. Remember how when we talked about the normal distribution we could see that the region around the mean is more dense and this reduces away from it, If say we sample 3 numbers from the Standard Normal Distribution $𝒩 (0, 1)$ , we would get numbers like 0,1, -0.1, 1.3 more often than numbers like 10, 11, -12.

In more formal words, sampling from a probability distribution produces random vectors such that the probability of observing a vector in any given region is proportional to the integral of $p (x)$ over that region.

Good, I think we have enough to go on for now.

Generative and Deterministic Models

Let’s talk about what makes a Generative Model a bit different from a Deterministic Model. Deterministic models always produce the same output for the same input. there's no randomness involved. Think of a simple classifier that distinguishes between cats and dogs. A deterministic model will give the same log probability for the exact same cat image every time you input it, just like a regular function. In fact, all deterministic models can be expressed by this familiar relationship:

y = f (x)

Generative models are different. let’s define our goal to generate a picture of a cat. Think of a probability distribution that describes vectors that represents a large diverse set of images of cats.

Throughout the course of these series(hopefully) and in most texts such a distribution is referred to as $p_{d a t a}$ or the data distribution.

Notice that $p_{d a t a}$ is unknown, it is impossible for us to determine the mathematical form this data distribution would take, We cannot exactly write down or evaluate $p_{d a t a}$ at an arbitrary point.

Now let’s think about sampling from this distribution. What would we get when we sample from $p_{d a t a}$ if we do have perfect access to it? A picture of a cat, Exactly what generative models do. By sampling repeatedly we get random different realistic images of cats. We would denote samples of any data distribution as $x$ ~ $p_{d a t a}$ . So generating an object simply means sampling from a suitable distribution.

A Generative Model allows us to sample new objects $x$ from the data distribution.

So far we have established that to generate objects we require a data distribution to sample from. How do we get this data distribution?

Let’s say we have access to some other initial distribution that we can sample from. We’ll signify this as $p_{i n i t}$ . Now we established earlier that it’s relatively easy to sample from the Gaussian distribution, let’s choose this as our initial distribution. What if we could transform vectors sampled from $p_{i n i t}$ into vectors that look like they are from $p_{d a t a}$ ?

The question above is exactly the operation that we are after, and what transformations that paradigms like diffusion and flows allow us to do, learn a transformation that maps simple distributions into complex, high-dimensional data distributions such like one that describes cats.

Diffusion and Flows allow us to learn a transformation that maps simple distributions to more complex ones.

It’s interesting to note that the major criterion for a distribution to be chosen as the initial distribution is that it should be one that can be easily sampled from. This means that $p_{i n i t}$ does not necessarily have to be the Gaussian all the time.

Good, we’re making progress. We’re just about ready to dive into flow and diffusion models by their relationships with Ordinary Differential Equations and Stochastic Differential Equations respectively but first let’s define a few more terms: Vector Fields and Trajectories.

Flows and Diffusion Models

Vector fields and trajectories

A vector field in $ℝ^{d}$ is a function that assigns a vector to each point in space. We’ll consider time-dependent vector fields of the form:

u : ℝ^{d} \times [0, 1] \to ℝ^{d}, (x, t) \to u_{t} (x)

Where

$x \in ℝ^{d}$ is a location in space
$t \in [0, 1]$ is time varying from 0 to 1
$u_{t} (x)$ is a vector telling us how to move from point $x$ at time $t$

Think of it as a function that defines what direction you’d go and how fast you’d move if you were a particle in space.

Now if a vector field tells a particle what direction to go a trajectory is that path the particle would eventually take.

From Wikipedia: A rotational vector field around a central point.

A trajectory $X_{t}$ is the path a particle follows through space over time starting from some initial position $x_{0}$ by obeying the rules laid down by the vector field. Mathematically It is any function

X : [0, 1] \to ℝ^{d}, t \to X_{t}

that satisfies the ordinary differential equation:

\frac{d X_{t}}{d t} = u_{t} (X_{t}), X_{0} = x_{0}

Recall that $u_{t} (X_{t})$ is a vector field, and by $X_{0} = x_{0}$ we mean that we start from $t = 0$ and then tend towards t as we move along.

If a trajectory defines the movement of a single particle in space in time a flow is the function that tells us where that particle would be at time $t$ .

The flow $ψ_{t}$ of a vector field $u$ is a function that maps each initial point $x_{0}$ to its corresponding position at time t along the trajectory $X$ that starts at $x_{0}$ . Formally:

ψ (x_{0}) : ℝ^{d} \to ℝ^{d}, x_{0} \mapsto X_{t}

We could inteprete flows in two ways :

For a single starting point $x_{0}$ , the flow $ψ (x_{0})$ tells us where a particle starting at $x_{0}$ would be at time $t$ when following the vector field.
If we consider all particles in space, the flow $ψ_{t} : R^{d} \to R^{d}$ represents the collective movement of all of these possible particles. It transforms the entire space according to the vector field’s dynamics over time t. It describes the positions at time $t$ of all particles for all starting points.

So while $X_{t}$ is a solution to the ODE as depicted above, the flow is such that for a time t:

\frac{d ψ_{t} (x)}{d t} = u_{t} (ψ_{t} (x)), ψ_{0} (x) = x

for the same ODE.

The relationship between Flows and Trajectories can be expressed as:

ψ_{t} (x_{0}) = X_{t} where X_{0} = x_{0}

You see how it tells us what’s happening at time $t$ ? It’ll make some more sense in a second.

It’s helpful to note that the ODE needs to have a unique solution.

Flows

Recall that for generative modelling we want to learn a transformation that converts a simple initial distribution $p_{i n i t}$ to a more complex target distribution $p_{d a t a}$ . This transformation can be realized through a flow defined by a vector field.

Assume we sample the initial position variable $X_{0}$ from the initial distribution $p_{i n i t}$

X_{0} ~ p_{i n i t}

Then the trajectory of this particle can end at $X_{1}$ which can be said to resemble a particle sampled from the data distribution:

X_{1} ~ p_{d a t a}

Now we can see something interesting. You know we said earlier when defining flows (point 2) that a flow $ψ_{t} (x)$ describes the positions at time $t$ of all particles moving according to the rules of the vector field $u (x)$ . If we move each sample along the flow $ψ_{t} (x)$ for time $t = 1$ starting from the initial position above, then

X_{1} = ψ_{1} (x), X_{0} = x,

We just got $p_{d a t a}$ , what we were looking for. But in other to get this distribution we had to make sure that we use the right vector field. By parameterizing the vector field as $u^{θ} (x)$ , we learn this vector field by training a neural network.

The flow from the ODE defined by the learned vector field is then used to carry $p_{i n i t}$ to $p_{d a t a}$ .

Let’s rewrite the ODE again with this in mind.

\frac{d X_{t}}{d t} = u_{t}^{θ} (X_{t}), X_{0} = x_{0}, X_{0} ~ p_{i n i t}

In practice we cannot compute the flow $ψ_{t}^{θ}$ analytically so we have to numerically simulate the ODE, recall the different numerical methods for solving ordinary differential equations like the Euler Method(simplest), Midpoint Method, Runge-Kutta etc. Of course each of these numerical methods have trade offs such as speed, accuracy and needed compute.

SDEs and Stochastic Processes

Now let’s extend all of this into the concept of diffusion. We mentioned before that Flow and diffusion models correspond to ODEs and SDEs respectively.

A stochastic differential equation for the trajectory $X_{t} \in R^{d}$ is written as

d X_{t} = u_{t} (X_{t}) d t + σ_{t} d W_{t}, X_{0} = x_{0}

Where:

$u_{t} (X_{t}) d t$ is called the drift, an infinitesimal displacement over a tiny timestep as in ODEs.
$W_{t}$ is a Brownian motion
$σ_{t}$ is collectively called the diffusion coefficient.

For SDEs, trajectories $X_{t}$ are stochastic rather than deterministic. That means each path is random, not a single fixed curve. Such random paths are called stochastic trajectories or stochastic processes.

$σ_{t}$ can be a scalar or a matrix multiplying the brownian motion term. It controls the intensity of the random fluctuations.

Brownian motion introduces true stochasticity and can be thought of as a true random motion of particles suspended in a medium. It is also called Wiener process hence the $W$ , It is characterised by the following properties,:

$W_{0}$ = 0 ; Its path starts from 0.
$W$ has independent increments.
$W$ has Gaussian increments.
$W$ has almost surely has continuous paths.

Image from Wikipedia

SDEs introduces stochastic dynamics to ODEs because real-word processes are not perfectly smooth. Note that if the diffusion coefficient $σ_{t} = 0$ , then Stochastic differential equations revert to ODEs.

Unlike ODEs that have unique flows SDEs produce random trajectories and so there is no concept of a flow. We can remember though that we never really had to compute the flow explicitly when we talked about flow models, all we needed was the vector field the neural network learnt.

Now let’s define diffusion models just like we did for flow models. Recall that our goal is to find some way to convert a simple initial distribution $p_{i n i t}$ to a more complex target distribution $p_{d a t a}$ .

Let’s write down the SDE again but this time we parameterize $u_{t} (x)$ .

d X_{t} = u_{t}^{θ} (X_{t}) d t + σ_{t} d W_{t}, X_{0} = x_{0}, X_{0} ~ p_{i n i t}

By training a neural network to learn the vector field $u_{t}^{θ} (x)$ , we can sample by initializing $X_{0} ~ p_{i n i t}$ and simulating the SDE.

The result will approximately follow $p_{d a t a}$

Sampling Methods

Finally let’s explain in pseudo-code how we sample from flow and diffusion models by simulating ODEs and SDEs respectively. Note that by sampling we are simulating a differential equation.

For Ordinary Differential Equations, One good method used to sample from Flow models is the Euler Method. Below are the steps:

Given a learned vector field $u_{t}^{θ}$ ;

Choose a step count n.
Set the step size $h = 1 / n$
Draw $X_{0} ~ p_{i n i t}$
Set $t = 0$
in a for loop, for $i = 0, 1, . . ., n - 1 :$
1. Compute $v = u_{t}^{θ} (X_{i})$
2. Update the state:
  
  $$ X_{i+1} = X_i + hv $$
3. Increment time: $t = t + h$
Output $X_{n} ~ ψ_{1}^{θ} (X_{0})$ . This should follow $p_{d a t a}$

At the end of the loop if all goes according to plan,

$$ X_1 = X_n \sim p_{data} $$

For Stochastic Differential Equations, the Euler-Maruyama Method is usually used to sample.

Choose step count $n .$
Set the step size $h = 1 / n$
Draw $X_{0} ~ p_{i n i t}$
Set $t = 0$
in a for loop, for i=0,1,...,n−1:
1. Compute drift $v = u_{t}^{θ} (X_{t})$
2. Draw noise $ξ_{i} ~ 𝒩 (0, I)$
3. Update the state:
  
  $$ X_{i+1}=X_i + hv + \sigma_t \sqrt{h}\xi_i $$
4. Increment time
Output $X_{n}$ which follows $p_{d a t a}$ after the diffusion process.

Again, if all goes according to plan, at the end of the loop we should get;

X_{1} = X_{n} ~ p_{d a t a}

Note: In subsequent articles of this series, we’ll place more emphasis on flow models, although diffusion models will occasionally appear as well.

References

MIT 6.S184 – Introduction to Flow Matching and Diffusion Models. MIT Computer Science class by Stefano Ermon and Guy Blanc. Lecture videos available on YouTube.
Lipman, Yaron, et al. Flow Matching Guide and Code. arXiv preprint arXiv:2412.06264, 2024.
Yaron Lipman et al. Flow matching for generative modeling