Homework 3: Decoding

Decoding is process of taking input in one language (e.g. French):

honorables sénateurs , que se est - il passé ici , mardi dernier ?

…And finding its best English translation under your model:

honourable senators , what happened here last Tuesday ?

To decode, we need a model of English sentences conditioned on the French sentence. You did most of the work of creating such a model in word alignment. In this assignment, we will give you some French sentences and a probabilistic model consisting of a phrase-based translation model $p_{\textrm{TM}}(\textbf{f},\textbf{a} \mid \textbf{e})$ and an n-gram language model $p_{\textrm{LM}}(\textbf{e})$. Your challenge is to find the most probable English translation under the model. We assume a noisy channel decomposition.

$$\begin{align*} \textbf{e}^* & = \arg \max_{\textbf{e}} p(\textbf{e} \mid \textbf{f}) \\ & = \arg \max_{\textbf{e}} \frac{p_{\textrm{TM}}(\textbf{f} \mid \textbf{e}) \times p_{\textrm{LM}}(\textbf{e})}{p(\textbf{f})} \\ &= \arg \max_{\textbf{e}} p_{\textrm{TM}}(\textbf{f} \mid \textbf{e}) \times p_{\textrm{LM}}(\textbf{e}) \\ &= \arg \max_{\textbf{e}} \sum_{\textbf{a}} p_{\textrm{TM}}(\textbf{f},\textbf{a} \mid \textbf{e}) \times p_{\textrm{LM}}(\textbf{e}) \end{align*}$$

Getting Started

Clone your HW3 from github classroom.

Under the decode directory, you now have simple decoder. Test it out!

python decode > output

This creates the file output with translations of data/input. You can compute $p(\textbf{e} \mid \textbf{f})$ using compute-model-score.

python compute-model-score < output

This command sums over all possible ways that the model could have generated the English from the French, including translations that permute the phrases. This sum is intractable, but the phrase dictionary is fixed and sparse, so we can compute it in a few minutes. It is still easier to do this than it is to find the optimal translation. But if you look at this command you may get some hints about how to do the assignment!

The decoder generates the most probable translations that it can find, using three common approximations.

First, it seeks the Viterbi approximation to the most probable translation. Instead of computing the intractable sum over all alignments for each sentence, we simply find the best single alignment and use its translation.

$$\begin{align*} \textbf{e}^* &= \arg \max_{\textbf{e}} \sum_{\textbf{a}} p_{\textrm{TM}}(\textbf{f},\textbf{a} \mid \textbf{e}) \times p_{\textrm{LM}}(\textbf{e}) \\ &\approx \arg \max_{\textbf{e}} \max_{\textbf{a}} p_{\textrm{TM}}(\textbf{f},\textbf{a} \mid \textbf{e}) \times p_{\textrm{LM}}(\textbf{e}) \end{align*}$$

Second, it translates French phrases into English without changing their order. So, it only reorders words if the reordering has been memorized as a phrase pair. For example, in the first sentence, we see that mardi dernier is correctly translated as last Tuesday. If we consult data/tm, we will find that the model has memorized the phrase pair mardi dernier ||| last Tuesday. But in the second sentence, we see that Comité de sélection is translated as committee selection, rather than the more correct selection committee. To show that this is a search problem rather than a modeling problem, we can generate the latter output by hand and check that the model really prefers it.

head -2 data/input | tail -1 > example
python decode -i example | python compute-model-score -i example
echo a selection committee was achievement . | python compute-model-score -i example

The scores are reported as log-probabilities, and higher scores (with lower absolute value) are better. We see that the model prefers selection committee, but the decoder does not consider this word order.

Finally, our decoder uses strict pruning. As it consumes the input sentence from left to right, it keeps only the highest-scoring output up to that point. You can vary the number of number of outputs kept at each point in the translation using the -s parameter. See how this affects the resulting model score.

python decode | python compute-model-score
python decode -s 10000 | python compute-model-score

The Challenge

Your task is to find the most probable English translation. Our model assumes that any segmentation of the French sentence into phrases followed by a one-for-one substitution and permutation of those phrases is a valid translation. We make the simplifying assumption that segmentation and ordering probabilities are uniform across all sentences, hence constant. This means that $p(\textbf{e},\textbf{a} \mid \textbf{f})$ is proportional to the product of the n-gram probabilities in $p_{\textrm{LM}}(\textbf{e})$ and the phrase translation probabilities in $p_{\textrm{TM}}(\textbf{f},\textbf{a} \mid \textbf{e})$. To avoid numerical underflow we work in logspace, seeking $\arg \max_{\textbf{e}} \max_{\textbf{a}} \log p_{\textrm{TM}}(\textbf{f},\textbf{a} \mid \textbf{e}) + \log p_{\textrm{LM}}(\textbf{e})$. The baseline decoder works with log probabilities, so you can simply follow what it does.

To pass, you must implement a beam-search decoder like the one we have given you that is also capable of swapping adjacent phrases. To get full credit, you must additionally experiment with another decoding algorithm. Any permutation of phrases is a valid translation, so we strongly suggest searching over all or some part of this larger space. This search is NP-Hard, so it will not be easy. You can trade efficiency for search effectiveness by implementing histogram pruning or threshold pruning, or by using reordering limits as described in the textbook (Chapter 6). Or, you might consider implementing other approaches to solving this combinatorial optimization problem:

These methods all attempt to approximate or solve the Viterbi approximation to decoding. You can also try to approximate $p(\textbf{e} \mid \textbf{f})$ directly.

Credits: This assignment was developed by Adam Lopez, Matt Post, and Chris Callison-Burch. Chris Dyer made many improvements to this assignment.