Just before Christmas, I've stumbled upon Machine Learning Engineering Open Book by Stas Bekman. It's a book I've long been looking for – a collection of insights on operating a large GPU cluster and using it to train and run LLMs: AI accelerators, fast intra-node and inter-node networking, optimized cluster storage, large-scale LLM training and inference, with lots of first-hand experience on these topics. I want to share my key takes from the book as well as other excellent resources I've discovered in the process.
read more