# Backpropagation through a fully-connected layer

The goal of this post is to show the math of backpropagating a derivative for a

fully-connected (FC) neural network layer consisting of matrix multiplication

and bias addition. I have briefly mentioned this in an earlier post dedicated

to Softmax,

but here I want to give some more attention to FC layers specifically.

Here is a fully-connected layer for input vectors with *N* elements, producing

output vectors with *T* elements:

As a formula, we can write:

Presumably, this layer is part of a network that ends up computing some loss

*L*. We’ll assume we already have the derivative of the loss w.r.t. the output

of the layer .

We’ll be interested in two other derivatives:

and

.

## Back to the fully-connected layer

Circling back to our fully-connected layer, we have the loss – a

scalar function . We also have the

function . If we’re interested in the derivative w.r.t the

weights, what are the dimensions of this function? Our “variable part” is then

*W*, which has *NT* elements overall, and the output has *T* elements, so

.

The chain rule tells us how to compute the derivative of *L* w.r.t. *W*:

Since we’re backpropagating, we already know ; because of the

dimensionality of the *L* function, the dimensions of are

[1,T] (one row, *T* columns). has *NT* inputs and *T* outputs,

so the dimensions of are [T,NT]. Overall, the dimensions of

are then [1,NT]. This makes sense if you think about it,

because as a function of *W*, the loss has *NT* inputs and a single scalar

output.

What remains is to compute , the Jacobian of *y* w.r.t. *W*. As

mentioned above, it has *T* rows – one for each output element of *y*, and *NT*

columns – one for each element in the weight matrix *W*. Computing such a large

Jacobian may seem daunting, but we’ll soon see that it’s very easy to generalize

from a simple example. Let’s start with :

What’s the derivative of this result element w.r.t. each element in *W*? When

the element is in row 1, the derivative is (*j* being the column

of *W*); when the element is in any other row, the derivative is 0.

Similarly for , we’ll have non-zero derivatives only for the second

row of *W* (with the same result of being the derivative for the

*j*-th column), and so on.

Generalizing from the example, if we split the index of *W* to *i* and *j*, we

get:

This goes into row *t*, column in the Jacobian matrix. Overall,

we get the following Jacobian matrix with shape [T,NT]:

Now we’re ready to finally multiply the Jacobians together to complete the

chain rule:

The left-hand side is this row vector:

And we’re multiplying it by the matrix shown above. Each item in the

result vector will be a dot product between and the corresponding

column in the matrix . Since has a single non-zero element

in each column, the result is fairly trivial. The first *N* entries are:

The next *N* entries are:

And so on, until the last (*T*-th) set of *N* entries is all *x*-es multiplied

by .

To better see how to apply each derivative to a corresponding element in *W*, we

can “re-roll” this result back into a matrix of shape [T,N]:

## Computational cost and shortcut

While the derivation shown above is complete and mathematically correct, it can

also be computationally intensive; in realistic scenarios, the full Jacobian

matrix can be *really* large. For example, let’s say our input is a (modestly

sized) 128×128 image, so *N=16,384*. Let’s also say that *T=100*. The weight

matrix then has *NT=1,638,400* elements; respectably big, but nothing out of the

ordinary.

Now consider the size of the full Jacobian matrix: it’s *T* by *NT*, or over

160 million elements. At 4 bytes per element that’s more than half a GiB!

Moreover, to compute every backpropagation we’d be forced to multiply this full

Jacobian matrix by a 100-dimensional vector, performing 160 million

multiply-and-add operations for the dot products. That’s a lot of compute.

But the final result is the size of *W* – 1.6 million

elements. Do we really need 160 million computations to get to it? No, because

the Jacobian is very *sparse* – most of it is zeros. And in fact, when we look

at the found above – it’s fairly straightforward to

compute using a single multiplication per element.

Moreover, if we stare at the matrix a

bit, we’ll notice it has a familiar pattern: this is just the outer product between the vectors

and *x*:

If we have to compute this backpropagation in Python/Numpy, we’ll likely write

code similar to:

```
# Assuming dy (gradient of loss w.r.t. y) and x are column vectors, by
# performing a dot product between dy (column) and x.T (row) we get the
# outer product.
dW = np.dot(dy, x.T)
```

## Bias gradient

We’ve just seen how to compute weight gradients for a fully-connected layer.

Computing the gradients for the bias vector is very similar, and a bit simpler.

This is the chain rule equation applied to the bias vector:

The shapes involved here are: is still [1,T], because the

number of elements in *y* remains *T*. has *T* inputs (bias

elements) and *T* outputs (*y* elements), so its shape is [T,T]. Therefore, the

shape of the gradient is [1,T].

To see how we’d fill the Jacobian matrix , let’s go back to the

formula for *y*:

When derived by anything other than , this would be 0; when derived

by the result is 1. The same applies to every other element of *y*:

In matrix form, this is just an identity matrix with dimensions [T,T].

Therefore:

For a given element of *b*, its gradient is just the corresponding element in

.

## Fully-connected layer for a batch of inputs

The derivation shown above applies to a FC layer with a single input vector *x*

and a single output vector *y*. When we train models, we almost always try to

do so in *batches* (or *mini-batches*) to better leverage the parallelism of

modern hardware. So a more typical layer computation would be:

Where the shape of *X* is [N,B]; *B* is the batch size, typically a

not-too-large power of 2, like 32. *W* and *b* still have the same shapes, so

the shape of *Y* is [T,B]. Each column in *X* is a new input vector (for a

total of *B* vectors in a batch); a corresponding column in *Y* is the output.

As before, given , our goal is to find

and

. While the end results are fairly simple

and pretty much what you’d expect, I still want to go through the full Jacobian

computation to show how to find the gradiends in a rigorous way.

Starting with the weigths, the chain rule is:

The dimensions are:

- : [1,TB] because
*Y*has*T*outputs for each input vector in

the batch. - : [TB,TN] since has
*TB*outputs and*TN*

inputs overall. - : [1,TN] same as in the batch-1 case, because the same

weight matrix is used for all inputs in the batch.

Also, we’ll use the notation to talk about the *i*-th

element in the *b*-th input vector *x* (out of a total of *B* such input

vectors).

With this in hand, let’s see how the Jacobians look; starting with

, it’s the same as before except that we have to take the batch

into account. Each batch element is independent of the others in loss

computations, so we’ll have:

As the Jacobian element; how do we arrange them in a 1-dimensional vector with

shape [1,TB]? We’ll just have to agree on a linearization here – same as we did

with *W* before. We’ll go for row-major again, so in 1-D the array *Y* would be:

And so on for *T* elements. Therefore, the Jacobian of *L* w.r.t *Y* is:

To find , let’s first see how to compute *Y*. The *i*-th element

of *Y* for batch *b* is:

Recall that the Jacobian now has shape [TB,TN]. Previously we had

to unroll the [T,N] of the weight matrix into the rows. Now we’ll also have to

unrill the [T,B] of the output into the columns. As before, first all *b*-s for

*t=1*, then all *b*-s for *t=2*, etc. If we carefully compute the derivative,

we’ll see that the Jacobian matrix has similar structure to the single-batch

case, just with each line repeated *B* times for each of the batch elements:

Multiplying the two Jacobians together we get the full gradient of *L* w.r.t.

each element in the weight matrix. Where previously (in the non-batch case) we

had:

Now, instead, we’ll have:

Which makes total sense, since it’s simply taking the loss gradient computed

from each batch separately and adds them up. This aligns with our intuition of

how gradient for a whole batch is computed – compute the gradient for each batch

element separately and add up all the gradients .

As before, there’s a clever way to express the final gradient using matrix

operations. Note the sum across all batch elements when computing

. We can express this as the matrix

multiplication:

This is a good place to recall the computation cost again. Previously we’ve seen

that for a single-input case, the Jacobian can be extremely large ([T,NT] having

about 160 million elements). In the batch case, the Jacobian would be even

larger since its shape is [TB,NT]; with a reasonable batch of 32, it’s something

like 5-billion elements strong. It’s good that we don’t actually have to hold

the full Jacobian in memory and have a shortcut way of computing the gradient.

## Bias gradient for a batch

For the bias, we have:

here has the shape [1,TB]; has the shape [TB,T].

Therefore, the shape of is [1,T], as

before.

From the formula for computing *Y*:

We get, for any batch *b*:

So, whereas was an identity matrix in the no-batch case, here it

looks like this:

With *B* identical rows at a time, for a total of *TB* rows. Since

is the same as before, their matrix

multiplication result has this in column *j*:

Which just means adding up the gradient effects from every batch element

independently.

## Addendum – gradient w.r.t. x

This post started by explaining that the parameters of a fully-connected layer

we’re usually looking to optimize are the weight matrix and bias. In most cases

this is true; however, in some other cases we’re actually interested in

propagating a gradient through *x* – often when there are more layers before

the fully-connected layer in question.

Let’s find the derivative . The chain

rule here is:

Dimensions: is [1, T] as before; has T outputs

(elements of *y*) and N inputs (elements of *x*), so its dimensions are [T, N].

Therefore, the dimensions of are [1, N].

From:

We know that . Generalizing

this, we get ; in other words,

is just the weight matrix *W*. So

is the dot product of

with the *i*-th column of *W*.

Computationally, we can express this as follows:

Again, recall that our vectors are *column* vectors. Therefore, to multiply *dy*

from the left by *W* we have to transpose it to a row vector first. The result

of this matrix multiplication is a [1, N] row-vector, so we transpose it again

to get a column.

An alternative method to compute this would transpose *W* rather than *dy* and

then swap the order:

These two methods produce exactly the same *dx*; it’s important to be familiar

with these tricks, because otherwise it may be confusing to see a transposed *W*

when we expect the actual *W* from gradient computations.