Brasil quiere mujeres en Young Lions 2018

Young Lions es una de las más grandes oportunidades para proyectar la carrera de un profesional de publicidad. El programa destinado a publicistas de hasta 30 años hace parte de las festividades del…

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

Introduction

Within the field of Natural Language Processing, Natural Language Generation (NLG) continues to be an active area of research. NLG, as defined by Artificial Intelligence: Natural Language Processing Fundamentals, is the “process of producing meaningful phrases and sentences in the form of natural language.” Essentially, NLG takes structured data as input and generates (short or long-form) narratives that describe, summarize, or explain the input in a human-like manner. An example of NLG is automatically generating a text description of records in a database, as shown below.

Empirically, neural encoder-decoder models have been successful at text generation, specifically with regards to improving the fluency of generation. Traditionally, these models use an encoder network to transform a source knowledge base into a continuous, latent space interpretable by a machine (and not humans) and subsequently, use a decoder network to emit tokens word-by-word, conditioned on the source encoding and the previous tokens emitted.

However, encoder-decoder models have two major problems; they are:
(1) uninterpretable
(2) difficult to control in terms of their phrasing or content

Because of their E2E training, black-box nature, and continuous, latent representations, it can be extremely difficult to identify why an encoder-decoder model makes a mistake. This is evident with universal adversarial triggers, “input-agnostic sequences of tokens that trigger a model to produce a speciﬁc prediction when concatenated to any input” [3].

Figure 3. The proposed model learns a discrete, latent neural template that is used to generate a human-like text description of the records in the knowledge base. Each cell represents a segment in the learned segmentation. During generation, the slots (represented by “blanks” above) are filled.

The neural HSMM allows for:
1) the explicit representation of what the system wants to say (in the form of the learned template)
2) how it is attempting to say it (in the form of an instantiated template)

The NLG model can be trained in an entirely data-driven way, using back-propagation through inference. The authors have found that this approach of learning neural templates scales well to large datasets and performs almost as well as encoder-decoder NLG models.

Goal: generate a human-like text description of a knowledge base, as in Figure 3 above.

Let us define some notation first:

From x, we would like to achieve a fluent text description ŷ of all the records.

The fluent text description ŷ is a sequence of tokens.

Till now, most individuals have used an encoder network to transform x into a continuous, latent space and subsequently a conditional decoder network to generate ŷ, with the encoder-decoder model being trained in an end-to-end manner. To generate a description of a knowledge base, the decoder network produces a distribution over the next word, a new word is produced, and the next output is conditioned on all the previous words, including this new word. But, it’s hard to determine what features of x are correlated with the output of the model, which makes it difficult to:

It’s important to be able to control the content of the model’s output! Suppose we want to generate a description of a knowledge base of customer reviews of a product, but we would like to exclude customer names. While it is possible to filter out the customer’s name with the encoder-decoder model, this could potentially lead to changes we don’t expect in the model’s output, due to the model’s black-box nature, which could, in turn, compromise the overall quality of the model’s output.

Hence, we need the proposed neural HSMM NLG system whose intent is always explicit and which produces controllable and interpretable output in a principled manner.

A HSMM is capable of modeling latent segmentations in an output sequence. Compared to Hidden Markov Models, HSMMs can emit for multiple time-steps and multi-step emissions do NOT have to be conditionally independent of each other given the hidden state.

Let us define some more notation:

Sequence of observed tokens.

We also have two per-timestep variables to model multi-step segments:

First of the two per-timestep variables. It represents the length of the current segment.

An HSMM essentially specifies a joint distribution over the observations and latent segmentations. If we let θ be our model’s parameters, then we get this expression for the joint-likelihood:

Woah, that’s a long equation; don’t get too bogged down in the notation though! Integrating some assumptions about how the probabilities in the first product term factor, all we’re saying is that the joint-likelihood is the product of:

2. the probability of each segment given its discrete state (a.k.a. the length distribution)

3. the probability of the observations in each segment (a.k.a. the emission distribution)

Figure 4. Example HSMM factor graph. The great thing about this model is that we get to incorporate LSTMs and attention, which have an excellent track record of effective neural text generation, while keeping the HSMM structure.

The transition distribution is a K-by-K matrix of probabilities, where each row sums to 1.

We simply let all length probabilities be uniform, up to a maximum length L.

The emission model generates a text segment conditioned on a latent state and source information. It is based on an RNN decoder.

Again, don’t get too bogged down in the notation! All this is saying is that a segment’s probability is a product over token-level probabilities.

We, then, apply softmax over the attention scores of the output vocabulary and copy scores.

Finally, we write:

Woah, another big equation; again, don’t get too bogged down in the notation! Essentially, all we’re saying is that the probability of the next observed word being w (given all the previously observed words in the segment, the current state, and the entire knowledge base) is equal to the normalized attention score corresponding to w (after the RNN has run over i-1 words) plus the sum of all the normalized copy scores for all the records whose value is equal to w.

We can also formulate an autoregressive variant of the statement above. Currently, the model assumes that segments are conditionally independent given the latent state and x. To allow for interdependence between tokens (but still not between segments) in a computationally tractable way, we can have each next-token distribution depend on ALL the previously generated tokens (instead of just the previous tokens in the current segment), by using an additional RNN that runs over all the preceding tokens.

Because the HSMM model necessitates optimizing a large number of neural network parameters, by assuming z, l, and f are unobserved, we can marginalize over these variables to maximize the log marginal-likelihood of the observed tokens y given x. Calculating the marginal-likelihood can be accomplished using a dynamic program; thus, all the parameters can be optimized by back-propagating through the dynamic program.

Given an example x from a database and a “ground-truth” text description y, we can determine the assignments of z, l, and f using maximum a posteriori (MAP) estimation with a dynamic program similar to the Viterbi algorithm. Importantly, the MAP segmentations allow for the association of text segments with the latent, discrete states z that regularly generate them.

Figure 5. An example of the Viterbi algorithm used to segment y. The subscripts represent the corresponding latent state. We see that there are 17 segments.

We can now extract templates, that is, collect the most common sequences of hidden states seen in the data! Each template z^(i) is a sequence of distinct, latent states. We can restrict the HSMM model to use one template during generation.

To generate text, we do:

All this is saying is that, for each extracted template z^(i), we should find the generated text description with the maximum likelihood given a knowledge base x. Because exhaustively computing the arg max is intractable due to the use of RNNs, a constrained beam search (over a segment, for a specific latent state) is used.

Interpretability –– Every segment generated is typed by a corresponding latent, discrete state.

Controllability –– Generation is forced to use a template, which is learned from the data.

Possible Issue –– The assumption that segments are independent of each other given their latent state and x might be a problem. But, the authors argue that a good encoder should be able to capture the interdependence between segments anyways.

The authors applied the HSMM model to two data-driven NLG tasks: E2E and WikiBio.

Datasets –– E2E and WikiBio.

Comparisons –– For both datasets, published encoder-decoder models and direct template-style baselines are compared. The E2E task is evaluated using BLEU, NIST, ROUGE, CIDEr, and METEOR, which are all well-known text generation quality metrics. The WikiBio task is evaluated in terms of BLEU, NIST, and ROUGE.

Model and Training Details –– The reason more plausible segmentations of observed tokens y could be learned is due to constraining the HSMM model to not split up phrases that appear in any record.

While a larger K (recall: number of discrete, latent states) makes a more expressive model, it can be too computationally expensive to compute K emission distributions; thus, a 1-layer LSTM was used in defining emission distributions, to tie emissions distributions between multiple states, while respecting the autonomy of states to have their own transition distributions.

Additionally, to reduce memory usage, the probability distribution over the output vocabulary did not include words found in record values.

For all experiments, the 100 most common templates were selected, and beam search was performed with a beam size of 5.

Figure 6. The neural HSMM models are competitive with the encoder-decoder system on the validation data AND offer the benefits of interpretability and controllability; however, the gap increases on test data.

Controllable Diversity –– The template can be manipulated while leaving the database example constant.

Interpretable States –– Moreover, because our latent states are discrete, rather than continuous, it is simple to guess to which states certain fields correspond. To show that learned states align with field types, the average purity of the discrete states learned for each dataset was calculated. For each discrete state for which the majority of its generated words appear in some record, the state’s purity is the percentage of the state’s words that come from the most frequent record type the state represents. The average purities of the states for both datasets with the HSMM model were relatively high. The autoregressive variant of the model did not score as high due to its reliance on the autoregressive RNN, rather than the latent state, for segment typing.

What was Proposed: A neural, template-like generation model based on an HSMM decoder, which can:
1. be learned tractably using back-propagation through a dynamic program
2. allows for the extraction of template-like latent objects in a principled manner for the purpose of structured text generation

Benefits: The model scales to large datasets and can compete with state-of-the-art neural encoder-decoder models. It also allows for controlling the diversity of generation and producing interpretable states during generation. As such, the HSMM model is robust.

Future Work: Next steps involve learning templates that are:
1. minimal (maximally different)
2. representative of paragraphs, documents, etc.
3. hierarchical

Brasil quiere mujeres en Young Lions 2018

Introduction

Add a comment

Related posts:

Learning to Love Ourselves

How Learning New Skills Rewire Your Brain

Hope is my name