Estimating Layers in Transformers

We are going to estimate the number of layers in a transformer for a given number of parameters. We will take a simplified view of a decoder-style transformer. The goal is not to count every small term, but to get a useful first-order estimate.

The Building Blocks

What are the building blocks of this transformer?

  1. Input embedding, parameterised by \(\W_E \in \mathbb{R}^{d_{\text{vocab}} \times d_{\text{emb}}}\).
  2. Positional embedding, parameterised by \(\W_{E_{\text{p}}} \in \mathbb{R}^{T_{\max} \times d_{\text{emb}}}\), where \(T_{\max}\) is the maximum context length. For architectures that use rotary position embeddings (RoPE), there is no separate learned positional embedding matrix. So, this building block is omitted in our main calculation.
  3. \(L\) layers, or transformer blocks, where each block has:
    1. Multi-head attention, formed from \(n_{\text{head}}\) attention heads:
      1. Each head has \(Q\), \(K\), and \(V\) projections, parameterised by \(\W_Q, \W_K, \W_V \in \mathbb{R}^{d_{\text{emb}} \times d_{\text{head}}}\). The head dimension is usually chosen so that \(d_{\text{head}} = d_{\text{emb}} / n_{\text{head}}\).
      2. After the heads are concatenated, a single output projection is applied, parameterised by \(\W_O \in \mathbb{R}^{d_{\text{emb}} \times d_{\text{emb}}}\).
    2. Feed-forward layer, parameterised by \(\W_{\uparrow} \in \mathbb{R}^{d_{\text{emb}} \times 4d_{\text{emb}}}\) and \(\W_{\downarrow} \in \mathbb{R}^{4 d_{\text{emb}} \times d_{\text{emb}}}\).
  4. Output embedding, parameterised by \(\W_{E_\text{o}} \in \mathbb{R}^{d_{\text{emb}} \times d_{\text{vocab}}}\). In some transformer architectures, the output embedding is not separate but shared with the input embedding, i.e. \(\W_{E_\text{o}} = \W_E^\top\). However, we will consider it as another distinct step.

For simplicity, we ignore bias terms, LayerNorm parameters, and other small architectural details. We also assume standard multi-head attention and a two-matrix feed-forward layer with expansion factor \(4\). Architectures with grouped-query attention, multi-query attention, gated feed-forward layers, or mixture-of-experts layers change the constants below.

Counting Parameters

Let's use \(n\left[\cdot\right]\) to denote the number of parameters. For example, since \(\W_E\) is a matrix of dimension \(d_{\text{vocab}} \times d_{\text{emb}}\), the number of parameters is

\[ n\left[\W_E\right] = d_{\text{vocab}} d_{\text{emb}}. \]

If learned absolute positional embeddings are included, they add \(T_{\max}d_{\text{emb}}\) parameters. In the main calculation below, we use the modern RoPE-style case, so that positional-embedding term is omitted. We do keep the output embedding as a separate, untied matrix, as noted in the building blocks above, so it contributes another \(d_{\text{vocab}}d_{\text{emb}}\) parameters.

So, the total number of parameters is:

\begin{align} n\left[\text{Transformer}\right] = {} & n\left[\text{Embeddings}\right] + L \cdot n\left[\text{Transformer Block}\right]. \label{eq:param_decomposition} \end{align}

Equation \(\eqref{eq:param_decomposition}\) is the main decomposition: count the embeddings once, then add the same block count \(L\) times.

Since we assume no learned absolute positional embedding matrix, the embedding term is:

\begin{align} n\left[\text{Embeddings}\right] = {} & n\left[\W_E\right] + n\left[\W_{E_\text{o}}\right] \nonumber \\ = {} & 2 d_{\text{vocab}} d_{\text{emb}}. \label{eq:embedding_params} \end{align}

For self-attention:

\begin{align} n\left[\text{Self-Attention}\right] = {} & n_{\text{head}} \left( n\left[\W_Q\right] + n\left[\W_K\right] + n\left[\W_V\right] \right) + n\left[\W_O\right] \nonumber \\ = {} & n_{\text{head}} \left( 3 \cdot d_{\text{emb}} \frac{d_{\text{emb}}}{n_{\text{head}}} \right) + d_{\text{emb}} d_{\text{emb}} \nonumber \\ = {} & 4d^2_{\text{emb}}. \label{eq:self_attention_params} \end{align}

Notice that \(n_{\text{head}}\) cancels in \(\eqref{eq:self_attention_params}\). Increasing the number of heads decreases \(d_{\text{head}}\) in the same proportion, so the total size of the \(Q\), \(K\), and \(V\) projections stays the same under this simplified setup.

For the feed-forward layer:

\begin{align} n\left[\text{Feed-forward}\right] = {} & n\left[\W_{\uparrow}\right] + n\left[\W_{\downarrow}\right] \nonumber \\ = {} & d_{\text{emb}}(4d_{\text{emb}}) + (4d_{\text{emb}})d_{\text{emb}} \nonumber \\ = {} & 8d^2_{\text{emb}}. \label{eq:feed_forward_params} \end{align}

So, under this simplified architecture, the feed-forward layer in \(\eqref{eq:feed_forward_params}\) has twice as many parameters as the self-attention layer in \(\eqref{eq:self_attention_params}\). A single transformer block therefore has:

\begin{align} n\left[\text{Transformer Block}\right] = {} & n\left[\text{Self-Attention}\right] + n\left[\text{Feed-forward}\right] \nonumber \\ = {} & 4d^2_{\text{emb}} + 8d^2_{\text{emb}} \nonumber \\ = {} & 12d^2_{\text{emb}}. \label{eq:block_params} \end{align}

Solving for the Number of Layers

Putting \(\eqref{eq:embedding_params}\) and \(\eqref{eq:block_params}\) into \(\eqref{eq:param_decomposition}\) gives:

\begin{align} n\left[\text{Transformer}\right] = {} & 2d_{\text{vocab}} d_{\text{emb}} + L \cdot 12d^2_{\text{emb}}. \label{eq:transformer_params} \end{align}

Finally, solving \(\eqref{eq:transformer_params}\) for \(L\) gives the estimated number of layers:

\begin{align} L = {} & \frac{ n\left[\text{Transformer}\right] - 2d_{\text{vocab}} d_{\text{emb}} }{ 12d^2_{\text{emb}} }. \label{eq:layer_estimate} \end{align}

The value from \(\eqref{eq:layer_estimate}\) should be read as a rough estimate and then rounded to a whole number, since an actual transformer has an integer number of blocks.

Checking Against Real Models

To see how well \(\eqref{eq:layer_estimate}\) holds up, the table below applies it to eleven open-weight decoder models. For each one we plug in the published vocabulary size \(d_{\text{vocab}}\), hidden size \(d_{\text{emb}}\), and total parameter count, then compare the estimated layer count against the true number of blocks.

Family Model Parameters \(d_{\text{vocab}}\) \(d_{\text{emb}}\) Actual \(L\) Estimated \(L\)
Llama Llama-3.2-1B 1.22 B128,2562,0481614
Llama-3-8B 8.03 B128,2564,0963235
Llama-3-70B 70.6 B128,2568,1928085
Llama-3-405B 405 B128,25616,384126124
Qwen Qwen2.5-0.5B 0.49 B151,9368962423
Qwen2.5-1.5B 1.54 B151,9361,5362838
Qwen2.5-7B 7.61 B152,0643,5842842
Qwen2.5-72B 72.5 B152,0648,1928087
Mistral Mistral-7B-v0.1 7.24 B32,0004,0963235
Mistral-Nemo-12B 12.2 B131,0725,1204035
Mistral-Large 123 B32,76812,2888867

Plotting the true layer count against the estimate makes the fit easy to read. Points on the dashed line are perfect estimates; points below the line are overestimates (the formula guessed more layers than the model has), and points above are underestimates. Hover over a point to see which model it is.

Figure 1. Actual layer count plotted against the estimate for eleven open-weight models. The dashed line is the line of perfect agreement (\(\text{actual} = \text{estimated}\)).

I was quite surprised that the estimates are quite close to the actual. Beyond that, I haven't investigated each architecture in detail; this is a back-of-the-envelope estimate.

The direction of the error still tells us something simple. If the formula overshoots the layer count, then more parameters must be going into each transformer block than our assumed \(12d^2_{\text{emb}}\), and the formula makes up the difference by adding extra layers. For the Qwen models (e.g. Qwen2.5-7B), the MLP blocks are larger than the \(4\times\) expansion we assumed, so each real block costs more than \(12d^2_{\text{emb}}\) and we overshoot.

On the other hand, we underestimate the layers for Mistral-Large, which means each of its transformer blocks holds fewer parameters than \(12d^2_{\text{emb}}\).