May 8, 2017 - 1:10pm
Rm. G102, Gunn Building, Stanford Graduate School of Business
Estimating metrics of an interactive system from historical data is hard, when the metric is computed from user actions (like clicks and purchases). The key challenge is in the counterfactual nature of such a problem: in the example of Bing, any change to the search engine may result in different search result pages for the same query, but we normally cannot infer reliably from historical search log how users would react to the new search results. To compare two systems on a target metric, one typically runs an A/B test on live users, as in a randomized clinical trial. While A/B tests have been very successful, they are unfortunately expensive and time-consuming.
Recently, offline evaluation (a.k.a. off-policy estimation) of interactive systems, without the need for online A/B testing, has seen growing interests in both industry and academia, with successes in important applications. This approach effectively allows one to run (potentially infinitely) many A/B tests *offline* from historical log, making it possible to estimate and optimize online metrics easily and inexpensively. In this talk, I will formulate the problem in the framework of reinforcement learning (under the contextual bandit and Markov decision process models, in particular), describe the basic techniques and their applications, as well as briefly cover recent theoretical advances.
Joint work with Wei Chu, Miro Dudik, Nan Jiang, John Langford, Rob Schapire and Csaba Szepesvari, among many others.
Institute for Research in the Social Sciences and Graduate School of Business