Nathan Kallus - Toward Better Off-Policy Reinforcement Learning in High-Stakes Domains

Date

Mon November 11th 2019, 1:10pm

Location

Graduate School of Business, Gunn Building, Rm. G101

Information regarding parking: http://www.gsb.stanford.edu/visit

Nathan Kallus - Toward Better Off-Policy Reinforcement Learning in High-Stakes Domains

Talk title: Toward Better Off-Policy Reinforcement Learning in High-Stakes Domains

Abstract: In medicine and other high-stakes domains, randomized exploration is often infeasible and simulations unreliable, yet the potential for personalized and sequential decision making remains immense. The hope is that off-policy methods can leverage the available large-scale observational data such as electronic medical records to support better, more personalized decision making. Yet there remain many significant hurdles to realizing this in practice. In this talk I will discuss two of the most challenging hurdles and possible remedies to them: the curse of horizon and the presence of unobserved confounding. At the end I will also mention other efforts in this direction.

Off-policy evaluation (OPE) is notoriously difficult in long and infinite horizons because, as trajectories grow long, the similarity (overlap) between any proposed policy and the observed data diminishes quickly, known as the curse of horizon. We characterize exactly when this curse bites by considering for the first time the semiparametric efficiency limits of OPE in Markov decision processes (MDPs). This shows infinite-horizon OPE may be feasible in ergodic time-invariant MDPs. We develop the first efficient OPE estimator for this setting, termed Double Reinforcement Learning (DRL), for which we establish many favorable characteristics.

Since observational data no matter how rich will inevitably have some unobserved confounding, another concern is whether assuming unconfoundedness (or other unverifiable identifying assumptions) might lead to harm. Instead, we propose a method that seeks to obtain the best-possible uniform control on the range of true policy regrets that might realize due to unobserved confounding. We will establish various guarantees that support its soundness and illustrate it in the context of hormone replacement therapy using the Women's Health Initiative parallel observational study and clinical trial, which will demonstrates the power and safety of our approach.

This talk is based on work joint with Masatoshi Uehara (https://arxiv.org/abs/1909.05850, https://arxiv.org/abs/1908.08526) and Angela Zhou (https://arxiv.org/abs/1805.08593).

Bio:

Nathan Kallus is an Assistant Professor at Cornell Tech

and in the Operations Research and Information Engineering Department at Cornell University