Neural Approaches to Conversational AI Reading Notes


This is my reading notes for Gao et al’s Neural Approaches to Conversational AI. I just marked something I think is important and made the structure more logistic clear. Since I just read the paper once, there must be something that I wrongly understood. Feel free to contact me at haroldliuj@gmail if you find something wrong. Have a nice trip!

Dialogue Hierarchical Structure

  • There is a top-level process selects which sub-process should be called.

Neural NLP Tasks

3 Common Steps

  • Encode symbolic space to neural space
  • Reasoning in neural space
  • Decode neural space to symbolic space


CH2 Reinforcement Learning

Challenges of RL

  • Exploration-exploitation tradeoff
  • Delayed reward and temporal credit assignment
  • Partially observed states


Markov Decision Process

  • Denotes


  • Basic Idea


  • Policy $\pi = P(A=a|s=s_t)$ stands for the policy that the agent can adapt.

  • Once we choose the policy, we can compute the reward:

    However, since $\pi$ is a possibility distribution, we can not use $G_t$ directly as the final score, because is different each time we compute it. So we use expected value as our score. So we will define 2 scores, the value of the state and the value of the state with action

  • The Value of the State:

  • The Value of the State with Action

  • Difference between those 2 eqns above:



Basic Idea

  • Since we use Q to represent how good the policy can be, so we can find the max Q so that we can find the best policy.

  • Q-Learning : Use “Temporal Difference” to update gradient:

Policy Gradient

  • The policy $\pi$ itself is directly parameterized by $\theta \in \R^d$

CH3 QA & Machine Reading

Steps for KB-QA

  • Semantic Parsing
  • Multi-Step Reasoning

Semantic Parsing for KB-QA

1. Embedding-based Methods

  • Same as word vectors, semantic vectors are encoded via knowledge base completion, that is, to recognize whether a triple is in KB

  • We can compute the score for a triple to mark how likely the triple will exist in KB:

    $score(s,r,t;\theta) = x_s^TW_rx_t$

    where $x_r$ is the semantic vector. The model will learn $x$ and $W$ simultaneously

Multi-Step Reasoning on KB

Since not every relation for two entities is represented in KB, we need some way to infer the relations between two entities with the relations we have.

1. Symbolic Methods —- Path Ranking Algorithm(PRA)

  • Random walk in KB and find some possible path

  • When given a query $q=(s,r,?)$, search the path that may suitable for $r$, denoted as $B_r=\{\pi_1, \pi_2…\}$

  • Travel those paths and find the candidate target $t$

  • Compute the score of $t$

    $score(q,t) = \sum_{\pi \in B_r}\lambda_\pi P\{t|s, \pi\}$ where $\lambda_\pi$ is the learned weight

2. Neural Methods —- Implicit ReasoNet

Intuitive Idea: tell algorithm, “Hey, you should pay some attention on here in KB” for many times


  • Encoder: Encode $s, r$ into semantic vector space

  • Shared Memory $M$: Each vector represents a concept, is embedded from KB

  • Controller: A RNN with attention

    • Compute the how much attention the current state $s_t$ should pay on each concept in $M$
    • Compute weighted memory as input of next time step
    • Compute next state $s_{t+1}$

Conversational KB-QA Agent

1. Common Architecture


  • Belief Trackers:
    • Resolving coreferences and ellipsis in user utterances using conversational context
    • Identifying user intents
    • Extracting associated attributes
    • Tracking the dialogue state
  • Interface with KB to query for relevant results
  • Beliefs Summary: Summarize the state into a vector
  • Dialogue Policy: Selects the next action based on the dialogue state.

Machine Reading for Text-QA

1. Common Architecture

  • Lexicon Embedding Layer: Maps each word to a vector space using a pre-trained word embedding

  • Contextual Embedding Layer: Utilizes contextual cues from surrounding words to refine the embedding of the words

    • e.g. bank of a river vs bank of America
  • Attention Layer

    • Compute attention for Passage through Questions
  • Reasoning Layer

    • Single-Step Reasoning

      1. Summarizing question vector:

      2. Predict start position:

      3. Predict end position:

        The idea here is integrating question and the answer start position info to predict the end position.

        $\mathbf{M}$ is shared semantic info

    • Multi-Step Reasoning

      • The idea is the same as the single-step reasoning, while multi-step can compute the start and end position over $T$ memory steps and get t start position and end position possibilities with attention. Finally compute the average value of those position possibilities as the final prediction.

        At Each time step t:

CH4 Task-oriented Dialogue Systems

Task-oriented Dialogue is also called slot-fitting, that is there are some predefined slots(essential info that used to finish the task. e.g. num of tickets, departure date for ticket booking) by the domain expert and the goal of the task is to get the information from users and fill the slots so that the system can finish the task. There are two kinds of slots:

  • Inform able slots: the value for this slot can be used to constrain the conversation, such as phone number;
  • Request able slots: the speaker can ask for its value, such as ticket price.

Common Architecture


  • Natural Language Understanding(NLU): Takes user’s raw utterance as input and converts it to the semantic form of dialogue acts.
  • Dialogue Manager(DM): The central controller of the system, this module will decide whether the system will interact with the user or with the DB
    • Dialogue State Tracking: track the current dialogue state
    • Dialogue Policy
  • Natural Language Generation(NLG): If the policy chooses to respond to the user, this module will convert this action


  • Categories

    • task success rate

    • cost of dialogue


      • simulation based

      • human based

Natural Language Understanding

1. 3 Tasks

  • Domain detection(classification task)
  • Intent determination(classification task)
  • Slot tagging(bidirectional LSTM)

ps: In most cases, they are trained separately while it can be more cost-saving if we train this model jointly

Dialogue State Tracking

  • Dialogue state contains all information about what the user is looking for at the current turn of the conversation.

  • A example Neural Belief Tracker


Dialogue Policy Learning

1. RL Method

Train a Fully Connected Neural Network to find the Q-Function so that we can use the action correspond to the maximum Q-value to make our decision.

  • Input $\mathbf{s}$ contains:
    • one-hot representations of the dialogue act and slot corresponding to the last user action;
    • the same one-hot representations of the dialogue act and slot corresponding to the last system action;
    • a bag of slots corresponding to all previously filled slots in the conversation so far;
    • current turn count
    • the number of results from the knowledge base that match the already filled-in constraints for informed slots
  • Output $\mathbf{q}$ is the predicted Q-Values for corresponded actions

Since training a network from zero can use great amount of data, we can use warm start policy to pretrain the network

Besides, RL agent should explore and extend slots over using, so we should introduce domain extension tech

2. Policy Learning for Composite-task Dialogues

  • Composite-task

    • User will ask the machine to finish a series of tasks that are correlated to each other. That is, there are cross-subtask constraints(e.g. the hotel check in time should be later then the plane arrival time)
  • 2 ways to resolve:

    • Hierarchy Ways: Top level policy will decide with sub-level policy to use


    • Feudal Reinforce Leaning: The policy will first decide whether to gather more info or provide info to user, then choose corresponding actions.

3. Policy Learning for Multi-Domain Dialogues

  • Multi-Domain Dialogues
    • There are no cross-task constraints but it still have to maintain a large dialogue states
  • 2 Ways to resolve:
    • Train many policies, then vote for action
    • Train many policies which are asked to finish different subtasks and train a picker to decide which policy to use.

4. Integration of Planning and Learning

  • Planning: Learn from simulator
  • Learning: Learn from real user
  • Use different weights for simulator and real user

5. Reward Function Learning

  • Hand write
  • Neural network method(may be better than hand write functions)

CH5 Fully Data-Driven Conversation Model and Social Bots

End to End Conversation Models


To exploit longer-term context

  • Two-level hierarchy structure that combine 2 RNNs:

    • One at a word level
    • One at the dialogue turn level


  • Use this extra dialogue turn level to capture context in conversation level rather than just word level

2. Attention Models

  • Less effective in E2E dialogue modeling

3. Pointer-Network Models

The basic idea is to use what has been shown in the context(“Copy and Paste”)

  • produces an output sequence consisting of elements from the input sequence:
    • drawn from a fixed-size vocabulary
    • selected from the source sequence

Challenges and Remedies

1. Response Blandness

  • The agent will tend to response something less informative such as “I don’t know” for all questions
  • Remedies:
    • Maximum Mutual Information
      • Hard to optimize, so just use it at inference time
      • Just can mitigate rather than completely eliminates the blandness probles
    • GAN
    • VHERD
    • Use retrieval-based method
      • Hard to extend

2. Speaker Consistency

  • Let agent reply with a consist personality

  • Methods:

    • Combine word vectors with speaker embeddings


3. Word Repetitions

  • Remedies:
    • Self-attention

Grounded Conversational Models

  • Let chatbot response base on real world information rather than what is in the training data
  • Train an extra Facts Encoder to encode real world info


Reinforcement Learning Can Make Better Results

  • Need to find reward function since normally there is no clear goal for users when they are making chitchats

  • Example of reward function, has 3 components:

    • $-p(\mathbf{Dull\ Response} | T_i)$: to penalize noninformative response
    • $-log\operatorname{Sigmoid cos}(T_{i-1}, T_i)$: to ensure that every response should contain some new information
    • $lopp(T_{i-1}|T_i) + logp(T_i|T_i-1)$: to counterbalance the aforementioned two rewards


  • Gao J, Galley M, Li L. Neural approaches to conversational AI[C]//The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 2018: 1371-1374.