📚The path to LLM mastery is paved with broken embeddings and resurrected gradients.
- base
- transformer
- vit transformer
- lm head
- kv cache
- GPU Architecture
- SM80
- SM90
- SM100
- SM120
- attention
- self attention
- online attention
- flash attention
- flash attention 2
- flash attention 3
- flash decoding
- flash decoding++
- scaled dot-product attention (SDPA)
- multi-head self-attention (MHSA)
- multi-head attention (MHA)
- grouped-query attention (GQA)
- multi-query attention (MQA)
- multi-head latent attention (MLA)
- multi-token attention (MTA)
- sage attention 1
- sage attention 2
- sage attention 2++
- sage attention 3
- paged attention
- ring attention
- ring flash attention
- linear attention
- lightning attention
- native sparse attention (NSA)
- grouped latent attention (GLA)
- grouped-tied attention (GTA)
- softmax
- softmax
- safe softmax
- online softmax
- kv cache optimization
- sparse
- quantization
- allocator
- window
- share
- norm
- Batch Norm
- Layer Norm
- RMS Norm
- position embedding
- RoPE
- AliBi
- 2D RoPE
- 3D RoPE
- NTK-Award RoPE
- Yarn
- quantization
- smooth quant
- AWQ
- KIVI
- GPTQ
- design
- chunked prefill
- continous batching
- speculative decoding
- Medusa
- Lookahead decoding
- NGram
- OSD
- Eagle 1,2,3
- sliding window
- multi-token prediction (MTP)
- reinforcement learning
- PPO
- GRPO
- DAPO
- GPG
- gemm
- deep gemm
- cutlass
- cooperative and ping-pong gemm scheduler
- cublas
- open source
- flash mla
- ptx instructions
- mbarrier
- cp.async
- ldmatrix
- mma
- wgmma