Making a computer make images isn’t easy at the best of times, but 3D graphics are a substantial step up in difficulty. Balancing speed with quality is a challenge, especially when the objects we want to render are complex and detailed.

The leading realtime rendering method (i.e. for video games) is **raster graphics**, which stores 3D objects as a collection of triangles, then projects those triangles onto a 2D screen. We can use the relative depths of triangles to cull any obscured triangles, greatly speeding up rendering time, and we can use the normal vectors of the triangles to shade the resulting image.

However, effects like soft shadows, reflections, and antialiasing (smoothing jagged edges) are prohibitively difficult to implement with raster graphics, and the complexity of the objects we can render is limited by the number of triangles that make them up.

An alternative to raster graphics is **ray tracing**, which more closely matches our physical world. We emit “light” rays from the camera and compute their intersection with the entire scene. We can then continue with the reflected ray, intersecting and bouncing it until it reaches a light source. In effect, we simulate only the light rays that eventually reach the camera.

Unfortunately, computing those intersections is extraordinarily taxing, and although fancy effects are possible, they are also quite expensive to implement. Outside of limited use in modern video games with cutting-edge hardware, ray tracing is still constrained to non-realtime rendering (e.g. 3D movies).

How can we get the effects of ray tracing without having to compute those intersections?

We could avoid having to calculate the geometry of an object if we instead knew a **distance estimator**, or DE, for it: a function that takes in a point in $\mathbb{R}^3$ and outputs (no more than) the minimum distance to the surface of the object, with negative outputs indicating the point is inside.

Sphere of radius $r$ at the origin: $d(p) = |p| - r$

Plane $z = 0$: $d(p) = |p_z|$

Cube of side length $2r$ at the origin: $d(p) = \max\{|p_x| - r, |p_y| - r, |p_z| - r\}$

Union of objects with DEs $d_1$ and $d_2$: $d(p) = \min\{d_1(p), d_2(p)\}$

Intersection of objects with DEs $d_1$ and $d_2$: $d(p) = \max\{d_1(p), d_2(p)\}$

Set difference of objects with DEs $d_1$ and $d_2$: $d(p) = d_1(p) - d_2(p)$

Given a camera position $c \in \mathbb{R}^3$ and a unit vector $v \in T_c\mathbb{R}^3$, we can “march” a ray from $c$ in the direction of $v$.

Since there are no objects within $d(c)$ of the camera, we can move to $p_1 = c + d(c)v$ without marching inside an object.

We can then march safely to $p_1 + d(p_1)v$, and repeat until the DE is below some threshold $\varepsilon$ or above some clip distance — lower values of $\varepsilon$ will result in a more accurate image.

Repeating this process for every pixel in a grid, using a different direction for each, we can render an image by coloring pixels based on whether they reach the object.

What else can we do with DEs? Everything might look a nail to our hammer, but this world might genuinely be made of nails.

Once we hit an object, we can find the normal vector to the surface by computing the gradient of the DE numerically. By taking the normal vector’s dot product with a vector pointing to a light source, we can approximate shading.

We can also slightly darken the color based on the number of marches taken to simulate ambient occlusion, and by blending the final color with the sky color based on distance from the camera, we can simulate fog.

Once we hit an object, we can use the surface normal to bump the position back outside of the object, then turn and march straight toward the light source. If we hit something else along the way, we darken the pixel.

Amazingly, we also get soft shadows for free with this method. If we don’t hit anything on the way to the light source, we darken the pixel based on how close it came to hitting something.

Reflections are even simpler: by reflecting the direction through the surface normal, we can start a new march and mix it with our current color.

The real power of ray marching comes from the ways in which we can manipulate DEs to effectively cheat on computation time while the ray is still traveling.

For example, if we modulo the position by a constant in all three directions at every step, the effect is to render infinitely many spheres at **no additional time complexity**, and we can invert the sphere DE to render a room-like space instead.

Let’s render a cube fractal that’s reminiscent of the free group on three elements. We’ll begin with a DE for a cube, and based on which of the six sides we’re closest to, we’ll union it with a smaller cube on that side. Repeating this produces $O(6^n)$ cubes in $O(n)$ time.

By carving out cubes instead of adding them, we can also easily render the Menger Sponge.

By changing the paths along which light travels, we can simulate a different curvature of space.

With (a lot) more work, we can render scenes in the eight Thurston geometries on 3-manifolds.

Ray marching is an excellent tool for rendering anything in 3D, particularly when it involves complicated mathematical objects.

More broadly, it’s an excellent example of extracting an incredible amount of detail and quality from a limited amount of information.