Lessons from the Trenches on Reproducible Evaluation of Language Models — AI News