How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability — AI News