Multi-model Markov decision process (MMDP)
is a promising framework for computing policies
that are robust to parameter uncertainty in MDPs.
MMDPs aim to find a policy that maximizes the
expected return over a distribution of MDP mod-
els. Because MMDPs are NP-hard to solve, most
methods resort to approximations. In this paper,
we derive the policy gradient of MMDPs and propose CADP, which combines a coordinate ascent
method and a dynamic programming algorithm for
solving MMDPs. The main innovation of CADP
compared with earlier algorithms is to take the coordinate ascent perspective to adjust model weights
iteratively to guarantee monotone policy improvements to a local maximum. A theoretical analysis
of CADP proves that it never performs worse than
previous dynamic programming algorithms like
WSU. Our numerical results indicate that CADP
substantially outperforms existing methods on several benchmark problems.