I came across a piece from WorkOS on how they test their AI agents. They built eval datasets, scoring systems, regression checks, and comparison tooling. Because nothing existed to do it for them. ๐ง๐ต๐ฎ๐ ๐๐ฒ๐น๐น๐ ๐๐ผ๐ ๐๐ผ๐บ๐ฒ๐๐ต๐ถ๐ป๐ด ๐ถ๐บ๐ฝ๐ผ๐ฟ๐๐ฎ๐ป๐. This isnโt just a tooling problem. Itโs an infrastructure gap.

