
We Randomized PPO Hyperparameters 100 Times. Only 2 Really Mattered.
PPO has a reputation for being the “stable” reinforcement learning algorithm. That reputation is mostly deserved. Compared with older policy-gradient methods, PPO is easier to get running, easier to debug, and less likely to destroy a policy with one bad update. But “stable” is not the same thing as “insensitive.” In a previous experiment, How Sensitive […]
Read Main Article→



