Positional Encoding(PE) is very important for transformers, which is widely used in LLM nowadays. There are many different types of PE, and RoPE is one of them.
RoPE is first proposed in RoFormer1, and applied in many popular transformer models, such as LLaMa2, open-sourced by Meta.
This blog we will introduce the basic concept of RoPE, its derivation, and length extrapolation related it.
Before we start, let’s review some basic concepts of transformer and positional encoding.
Transformer 101
Given a sequence SN with length N, ti is the i-th token in the sequence, and xi is the d-dim embedding of ti. We can formulate SN and EN as:
SN={t1,t2,…,tN}EN={x1,x2,…,xN}
Before computing self-attention, we need transform xi∈EN to Qi, Ki, Vi with linear projection but adding extra positional information. We can formulate as:
Qi=fq(xi,pi)Kj=fk(xj,pj)Vj=fv(xj,pj)
where pi,pj is the positional information of xi,xj respectively, and fq, fk, fv are transform functions.
Transformers are parallel architectures which means they cannot capture the order of tokens in sequence. Positional encoding comes to rescue. Generally, there are two major approaches:
Fuse positional information into input embedding, which called absolute positional embedding;
Fuse positional information into self-attention scores, which called relative positional embedding.
Absolute Positional Embedding(APE). The most common way is proposed in original transformer paper3, which is adding a fixed positional embedding to input embedding. Periodic function like sine and cosine functions are used to generate positional embedding. The formula is:
where pos is the position of token, d is the dimension of embedding, and i is for computing the index of dimension.
Python code as follows:
# pos: position of token# seq_len: length of sequence# d: dimension of embeddingdefget_pos_embedding(pos, seq_len, d): pos_embedding = np.zeros((seq_len, d))for i inrange(d):if i %2==0:# even index using sine pos_embedding[:, i] = np.sin(pos /10000** (i / d))else:# odd index using cosine pos_embedding[:, i] = np.cos(pos /10000** ((i -1) / d))return pos_embedding
It’s evident that the characteristic of the sine/cosine positional encoding is periodical, hence it can be expected to have a certain degree of extrapolation. 4
Another common choice is to use learned version of APE, which is a trainable parameter, such as in GPT-35.
Relative Positional Embedding(RPE). Relative position encoding doesn’t model the position information of each token.
Instead, it model the relative position when computing self-attention scores.
For example, T5’s Relative bias6 first maps the relative distance (i−j) between tokens at position i and j to a scalar bias value b=f(i−j).
Then it is added to the dot product of query and key in the self-attention mechanism.
Rotary Positional Embedding(RoPE)
Considering that APE is straightforward and easy to implement, and RPE is more intuitive and effective, RoPE can combine the advantages of both.
Formulation
Given q,k, we can add absolute positional information as following:
qm=f(q,m)kn=f(k,n)
f(⋅,m) is the function to add positional information to inputs. The equation below needs to be satisfied as attention computed by dot-product:
<f(q,m),f(k,n)>=g(q,k,m−n)
where g(⋅,⋅,⋅) is the function to compute self-attention scores. We can assume f(q,0)=q and f(k,0)=k safely.
Considering the 2-d situation and complex field, q=(q1,q2)=q1+i∗q2, we can get:
For 2-D dimension, we can transform q∗ei∗mθ to the following matrix form:
q∗ei∗mθ=(cos(mθ)sin(mθ)−sin(mθ)cos(mθ))∗(q1q2)
As I said at the beginning, f(⋅,m) is the function to add position information, so it can be considered as q rotating θ angle. In other words, f(⋅,m) is a rotation function.
General Forms
For higher-dimensional space, we can decompose into block-wise repetition of 2-d rotation simply.
Putting all these pieces together, we can get the general form of f(q,m):
f(q,m)=M1M2⋱Md/2∗q0q1⋮qd−1
where Mi is a 2-d rotation matrix Mi=(cos(mθi)sin(mθi)−sin(mθi)cos(mθi)), and θi is the angle of i-th dimension.
Considering the sparse of M, we can use the element-wise form to computing: