The goal of this post is to show the math of backpropagating a derivative for a
fully-connected (FC) neural network layer consisting of matrix multiplication
and bias addition. I have briefly mentioned this in an earlier post dedicated
but here I want to give some more attention to FC layers specifically.
Here is a fully-connected layer for input vectors with N elements, producing
output vectors with T elements:
As a formula, we can write:
Presumably, this layer is part of a network that ends up computing some loss
L. We’ll assume we already have the derivative of the loss w.r.t. the output
of the layer .
We’ll be interested in two other derivatives:
Back to the fully-connected layer
Circling back to our fully-connected layer, we have the loss – a
scalar function . We also have the
function . If we’re interested in the derivative w.r.t the
weights, what are the dimensions of this function? Our “variable part” is then
W, which has NT elements overall, and the output has T elements, so
The chain rule tells us how to compute the derivative of L w.r.t. W:
Since we’re backpropagating, we already know ; because of the
dimensionality of the L function, the dimensions of are
[1,T] (one row, T columns). has NT inputs and T outputs,
so the dimensions of are [T,NT]. Overall, the dimensions of
are then [1,NT]. This makes sense if you think about it,
because as a function of W, the loss has NT inputs and a single scalar
What remains is to compute , the Jacobian of y w.r.t. W. As
mentioned above, it has T rows – one for each output element of y, and NT
columns – one for each element in the weight matrix W. Computing such a large
Jacobian may seem daunting, but we’ll soon see that it’s very easy to generalize
from a simple example. Let’s start with :
What’s the derivative of this result element w.r.t. each element in W? When
the element is in row 1, the derivative is (j being the column
of W); when the element is in any other row, the derivative is 0.
Similarly for , we’ll have non-zero derivatives only for the second
row of W (with the same result of being the derivative for the
j-th column), and so on.
Generalizing from the example, if we split the index of W to i and j, we
This goes into row t, column in the Jacobian matrix. Overall,
we get the following Jacobian matrix with shape [T,NT]:
Now we’re ready to finally multiply the Jacobians together to complete the
The left-hand side is this row vector:
And we’re multiplying it by the matrix shown above. Each item in the
result vector will be a dot product between and the corresponding
column in the matrix . Since has a single non-zero element
in each column, the result is fairly trivial. The first N entries are:
The next N entries are:
And so on, until the last (T-th) set of N entries is all x-es multiplied
To better see how to apply each derivative to a corresponding element in W, we
can “re-roll” this result back into a matrix of shape [T,N]:
Computational cost and shortcut
While the derivation shown above is complete and mathematically correct, it can
also be computationally intensive; in realistic scenarios, the full Jacobian
matrix can be really large. For example, let’s say our input is a (modestly
sized) 128×128 image, so N=16,384. Let’s also say that T=100. The weight
matrix then has NT=1,638,400 elements; respectably big, but nothing out of the
Now consider the size of the full Jacobian matrix: it’s T by NT, or over
160 million elements. At 4 bytes per element that’s more than half a GiB!
Moreover, to compute every backpropagation we’d be forced to multiply this full
Jacobian matrix by a 100-dimensional vector, performing 160 million
multiply-and-add operations for the dot products. That’s a lot of compute.
But the final result is the size of W – 1.6 million
elements. Do we really need 160 million computations to get to it? No, because
the Jacobian is very sparse – most of it is zeros. And in fact, when we look
at the found above – it’s fairly straightforward to
compute using a single multiplication per element.
Moreover, if we stare at the matrix a
bit, we’ll notice it has a familiar pattern: this is just the outer product between the vectors
If we have to compute this backpropagation in Python/Numpy, we’ll likely write
code similar to:
# Assuming dy (gradient of loss w.r.t. y) and x are column vectors, by # performing a dot product between dy (column) and x.T (row) we get the # outer product. dW = np.dot(dy, x.T)
We’ve just seen how to compute weight gradients for a fully-connected layer.
Computing the gradients for the bias vector is very similar, and a bit simpler.
This is the chain rule equation applied to the bias vector:
The shapes involved here are: is still [1,T], because the
number of elements in y remains T. has T inputs (bias
elements) and T outputs (y elements), so its shape is [T,T]. Therefore, the
shape of the gradient is [1,T].
To see how we’d fill the Jacobian matrix , let’s go back to the
formula for y:
When derived by anything other than , this would be 0; when derived
by the result is 1. The same applies to every other element of y:
In matrix form, this is just an identity matrix with dimensions [T,T].
For a given element of b, its gradient is just the corresponding element in
Fully-connected layer for a batch of inputs
The derivation shown above applies to a FC layer with a single input vector x
and a single output vector y. When we train models, we almost always try to
do so in batches (or mini-batches) to better leverage the parallelism of
modern hardware. So a more typical layer computation would be:
Where the shape of X is [N,B]; B is the batch size, typically a
not-too-large power of 2, like 32. W and b still have the same shapes, so
the shape of Y is [T,B]. Each column in X is a new input vector (for a
total of B vectors in a batch); a corresponding column in Y is the output.
As before, given , our goal is to find
. While the end results are fairly simple
and pretty much what you’d expect, I still want to go through the full Jacobian
computation to show how to find the gradiends in a rigorous way.
Starting with the weigths, the chain rule is:
The dimensions are:
- : [1,TB] because Y has T outputs for each input vector in
- : [TB,TN] since has TB outputs and TN
- : [1,TN] same as in the batch-1 case, because the same
weight matrix is used for all inputs in the batch.
Also, we’ll use the notation to talk about the i-th
element in the b-th input vector x (out of a total of B such input
With this in hand, let’s see how the Jacobians look; starting with
, it’s the same as before except that we have to take the batch
into account. Each batch element is independent of the others in loss
computations, so we’ll have:
As the Jacobian element; how do we arrange them in a 1-dimensional vector with
shape [1,TB]? We’ll just have to agree on a linearization here – same as we did
with W before. We’ll go for row-major again, so in 1-D the array Y would be:
And so on for T elements. Therefore, the Jacobian of L w.r.t Y is:
To find , let’s first see how to compute Y. The i-th element
of Y for batch b is:
Recall that the Jacobian now has shape [TB,TN]. Previously we had
to unroll the [T,N] of the weight matrix into the rows. Now we’ll also have to
unrill the [T,B] of the output into the columns. As before, first all b-s for
t=1, then all b-s for t=2, etc. If we carefully compute the derivative,
we’ll see that the Jacobian matrix has similar structure to the single-batch
case, just with each line repeated B times for each of the batch elements:
Multiplying the two Jacobians together we get the full gradient of L w.r.t.
each element in the weight matrix. Where previously (in the non-batch case) we
Now, instead, we’ll have:
Which makes total sense, since it’s simply taking the loss gradient computed
from each batch separately and adds them up. This aligns with our intuition of
how gradient for a whole batch is computed – compute the gradient for each batch
element separately and add up all the gradients .
As before, there’s a clever way to express the final gradient using matrix
operations. Note the sum across all batch elements when computing
. We can express this as the matrix
This is a good place to recall the computation cost again. Previously we’ve seen
that for a single-input case, the Jacobian can be extremely large ([T,NT] having
about 160 million elements). In the batch case, the Jacobian would be even
larger since its shape is [TB,NT]; with a reasonable batch of 32, it’s something
like 5-billion elements strong. It’s good that we don’t actually have to hold
the full Jacobian in memory and have a shortcut way of computing the gradient.
Bias gradient for a batch
For the bias, we have:
here has the shape [1,TB]; has the shape [TB,T].
Therefore, the shape of is [1,T], as
From the formula for computing Y:
We get, for any batch b:
So, whereas was an identity matrix in the no-batch case, here it
looks like this:
With B identical rows at a time, for a total of TB rows. Since
is the same as before, their matrix
multiplication result has this in column j:
Which just means adding up the gradient effects from every batch element
Addendum – gradient w.r.t. x
This post started by explaining that the parameters of a fully-connected layer
we’re usually looking to optimize are the weight matrix and bias. In most cases
this is true; however, in some other cases we’re actually interested in
propagating a gradient through x – often when there are more layers before
the fully-connected layer in question.
Let’s find the derivative . The chain
rule here is:
Dimensions: is [1, T] as before; has T outputs
(elements of y) and N inputs (elements of x), so its dimensions are [T, N].
Therefore, the dimensions of are [1, N].
We know that . Generalizing
this, we get ; in other words,
is just the weight matrix W. So
is the dot product of
with the i-th column of W.
Computationally, we can express this as follows:
Again, recall that our vectors are column vectors. Therefore, to multiply dy
from the left by W we have to transpose it to a row vector first. The result
of this matrix multiplication is a [1, N] row-vector, so we transpose it again
to get a column.
An alternative method to compute this would transpose W rather than dy and
then swap the order:
These two methods produce exactly the same dx; it’s important to be familiar
with these tricks, because otherwise it may be confusing to see a transposed W
when we expect the actual W from gradient computations.