RECORD - public
This is the code repository to our work Learning with an Open Horizon in Ever-Changing Dialogue Circumstances.
This work proposes the usage of lifetime return and meta-learning of hyperparameters for enhanced continual reinforcement learning training. We optimized the state-of-the-art architecture for continual RL of dialogue policies called DDPT (see https://aclanthology.org/2022.coling-1.21/).
As base algorithms, we use PPO and CLEAR. While PPO is an on-policy algorithm, CLEAR is an off-policy algorithm specifically built for continual reinforcement learning. Moreover, the dialogue policies can be trained with different user simulator setups: single user simulator (rule-based or transformer-based), and multiple simulators together.
Installation
The code builds upon ConvLab-3. To install ConvLab-3, please follow the instructions in the repository:
https://github.com/ConvLab/ConvLab-3
In addition, for utilizing meta-learning and evaluation, you need to install the higher library and rliable using
pip install higher
pip install -U rliable
Training
The code for training models can be found in the folders ppo_DPT and vtrace_DPT. We explain the usage using vtrace_DPT. It works analogously for ppo_DPT.
Train with rule-based simulator
The rule-based simulator has different configurations with outputting only little actions or more actions in a turn. We train with them using the following two configurations
python convlab/policy/vtrace_DPT/train_ocl_meta.py --seed=0 --path=convlab/policy/vtrace_DPT/semantic_level_config_ocl.json
python convlab/policy/vtrace_DPT/train_ocl_meta.py --seed=0 --path=convlab/policy/vtrace_DPT/semantic_level_config_ocl_shy.json
Train with transformer-based simulator TUS
python convlab/policy/vtrace_DPT/train_ocl_meta.py --seed=0 --path=convlab/policy/vtrace_DPT/semantic_level_config_ocl_tus.json
Train with all three simulators
We can leverage all simulators during learning with the following execution
python convlab/policy/vtrace_DPT/train_ocl_meta_users.py --seed=0
We can run the training with various seeds to obtain different results. The results are stored in the folder experiments
and moved to finished_experiments
once they are done.
Leveraging Lifetime Return and Meta Learning
We can specify whether we want to use meta-learning, use episodic return, lifetime return, or both in the config-file convlab/policy/vtrace_DPT/config.json
.
- lifetime_weight: number between 0 and 1; 0 means no lifetime return, 1 means using lifetime return
- only_lifetime: true or false; true means only lifetime return is used, false means both lifetime return and episodic return are used
- meta: true or false; true means meta-learning is used, false means no meta-learning is used
Specifying the Timeline
We provide timelines used for the paper in convlab/policy/ocl_utils/timelines
. You specify the following:
- timeline: a dictionary, where the keys are given by domains. The values determine after how many dialogues the domain should be introduced
- num_domain_probs: for every integer n, the probability of using n domains in a user goal
- domain_probs: for every domain, the probability of using the domain in a user goal
- new_domain_probs: probability that the newly introduce domain should be part of the user goal
- num_dialogues_stationary: number of dialogues before the user demand changes
- std_deviation: specifies the variation of user demand changes
During training you specify the timeline_path to use in the config, e.g. in semantic_level_config_ocl.json
Evaluation
Let us assume we have run two experiments, one with meta-learning and one baseline. Each experiment has been run with 5 different seeds. We create folders meta and baseline for the two experiments, each folder containing the different seed folders. We assume the folders meta and baseline lie in the folder meta-experiments.
meta-experiments
- meta
- seed_0
- seed_1
- seed_2
- seed_3
- seed_4
- baseline
- seed_0
- seed_1
- seed_2
- seed_3
- seed_4
We can evaluate the experiments using the following command
python convlab/policy/ocl_utils/plot_ocl.py meta baseline --dir_path meta-experiments
More generally, you pass a list of experiment names and the folder they are saved in. The script will then create different plots as in the paper and save them in the folder meta-experiments.