-
Notifications
You must be signed in to change notification settings - Fork 76
How to tailor the DS to YOUR application
(Adapted from Section 2 of the MWT White Paper.)
Our running example is optimization of a news website. While it is based on a successful product deployment, we ignore some real-life complications for the sake of clarity. Thus, let us consider a very simple news website, henceforth called SimpleNews, which displays exactly one headline to each user. When a user arrives, some information is available pertaining to the interaction with this particular user, e.g., age, gender, geolocation, time of day, and possibly much more; such information is summarily called the context. The website chooses a news topic (e.g., politics, sports, tech, etc.), possibly depending on the context, and the top headline for the chosen topic is displayed. The website's goal is to maximize the number of clicks on the headlines that it displays. For this example, we are only concerned with the choice of news topic: that is, we want to pick a news topic whose top headline is most likely to be clicked on for a given context.
The interactions between your application system (APP) and its users should be interpreted as a sequence of small interactions with individual users, possibly overlapping in time. Each small interaction should follow the template: observe a context, make a decision by choosing from the available alternatives, and observe the outcome of this decision. The meaning of contexts, decisions and outcomes should be consistent throughout. An interaction fitting this template is called an experimental unit. APP's objective (as far as the Decision Service is concerned) is to optimize the decision for a given context so as to bring about the most desirable outcome.
One needs to decide what contexts, decisions, and rewards mean in your APP:
- The context should encompass the properties of the current user and/or task to be accomplished, and must be known to APP. It can describe complicated objects such as a webpage (e.g., to place an ad on). The context is typically represented as a feature vector.
- The decision must be controlled by APP. Typically the decision is to choose an action among a set of allowed actions. (However, in some applications the decision can have a more complicated semantics, such as choosing a slate of actions.) The set of allowed actions should be known: either fixed in advance or specified by the context. It is often useful to describe actions with features of their own, a.k.a. action-specific features.
- The outcome must be observable by APP not too long after the action is chosen. The outcome (perhaps jointly with the context) should define a reward: the short-term objective to be optimized. The total reward collected over time should provide a good proxy for the ``true" long-term objective of APP, such as long-term user satisfaction.
Once the model is chosen, one should be able to fill in the following table:
SimpleNews | Your APP | |
---|---|---|
Context | {gender, location, time-of-day} | |
Decision | a news topic to display | |
Allowed actions | {politics, sports, tech, arts} | |
Action-specific features | none | |
Outcome | click or no click within 20 sec | |
Reward | 1 if clicked, 0 otherwise |
There can be multiple possible ways to define these notions for a given APP: to wit, which features to include in the context and (if applicable) in each action's feature vector, and how to represent these features numerically; what exactly is the decision that we want to optimize; how to synthesize the “reward" from the available observations. Such issues arise in many applications of machine learning; they are usually application-specific, and do not have generic, easy-to-describe solutions. (We describe how these issues are resolved in a specific application in Section 6 of the MWT White Paper.)
Consider SimpleNews to illustrate how the semantics of the decision, outcome and reward can be non-trivial. For example, SimpleNews can be modified so that actions correspond to news articles, rather than news topics, and the decision in each experimental unit consists of choosing a slate of news articles. The outcome might distinguish between clicks on different articles in the slate, and can include other information such as dwell time; the reward might depend on all items included in the outcome.
To estimate the scale and feasibility of the learning problem, one needs to estimate a few parameters: the number of features (#features) and the number of allowed actions (#actions) in a typical experimental unit, a typical delay between making a decision and observing the corresponding outcome, and the data rate: the number of experimental units per second (say). If the outcome may include a rare event whose frequency is crucial to estimate --- e.g., clicks are typically rare compared to non-clicks, --- then we also need a rough estimate of this frequency. Finally, we need the stationarity interval: the time interval during which a typical experimental unit does not change too much. We are interested in several properties: which contexts are likely to arrive, what are the allowed actions, and what are the expected rewards. More specifically, two things should not change significantly during the interval: the distribution of arriving contexts (where a context includes the set of allowed actions and, if applicable, their features), and the expected rewards for each context-action pair. In the SimpleNews example, this corresponds to the distribution of arriving user profiles and the click probability for the top headline (for a given news topic, when presented to a typical user with a given user profile).
To summarize, one should be able to fill in the last column in the following table:
SimpleNews | Your APP | |
---|---|---|
#features | 3 | |
#actions | 4 news topics | |
Typical delay | 5 sec | |
Data rate | 100 users/sec | |
Rare event frequency | typical click prob. 2-5% | |
Stationarity interval | one week |
A good rule-of-thumb is that for a successful application of the Decision Service the stationarity interval should be much larger than the typical delay, and moreover we should have
StatInterval x DataRate x RareEventsFreq >> #actions x #features
The left-hand side is the frequency of rare events, and the right-hand side characterizes the complexity of the learning problem. This rule is for the default linear representation of policies, with other representations potentially requiring more or less events while yielding more or less benefit.