thesis/systemDevelopment.tex at master · jmp84/thesis · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
\chapter{Domain Adaptation for Hierarchical Phrase-Based Translation Systems}
\chaptermark{Hiero Domain Adaptation}
\label{chap:wmt}

% TODOFINAL grep for a + space + vowel and replace by an (also reverse)
% TODOFINAL grep for remind and replace by recall
% TODOFINAL search replace mix-cased/mix cased/mix case by case-sensitive
% TODOFINAL grep for best
% TODOFINAL check tabular or table without center
% TODOFINAL search for figure and remove colon. similarly search for equation and put colon instead of period.
% TODOFINAL search replace lowercase/lower-case by case-insensitive
% TODOFINAL include shallow-2 in background
% TODONEVER harmonize table layout (e.g. bold header, horizontal and vertical lines)
% TODOFINAL check bib through online tool
% TODONEVER add row numbers in exp tables
% TODOFINAL probably need to mention lattice mert in background

In \autoref{chap:hfile}, we have demonstrated improvements in scalability
for generation of and retrieval from hierarchical phrase-based grammars.
In \autoref{chap:extractionFromPosteriors}, we have introduced
an original model for hierarchical phrase-based grammar extraction
and estimation. In this chapter, we will investigate further
refinements for hierarchical phrase-based grammar modelling in the context
of domain adaptation. We will also investigate possible improvements
for language modelling both in the context of domain adaptation and
in terms of training data size.

\section{Introduction}

As described in \autoref{sec:theSMTpipeline}, SMT state-of-the-art
systems consist of a pipeline of various modules. These modules
interact in complex ways, which makes experimentation challenging, but
also provides an opportunity to make improvements in separate modules with
the hope that the end-to-end system will be improved overall.
In this chapter, we describe investigations into language model
and grammar refinement techniques for SMT. We compare various strategies for grammar
and language modelling in the context of domain adaptation. We also explore
the interaction between a first pass language model used in decoding and a second
pass language model used in rescoring. Finally, we compare two smoothing
strategies for large language model rescoring.
The techniques presented in this chapter may be used in order
to build high quality systems for a
translation evaluation or to customise a translation system
for a client in the industry setting.
\autoref{fig:modulesToBeImproved} indicates which modules we attempt
to improve.
%
\begin{figure}
  \tikzstyle{TranslationModule} = [rectangle, draw, rounded corners,
    align = center, text width = 4cm]
  \tikzstyle{line} = [draw, very thick, color=black!50, -latex']
  \begin{center}
    \begin{tikzpicture}[node distance = 1.5cm]
      % Place nodes
      \node [TranslationModule] (parallelData) {Parallel Text};
      \node [TranslationModule, below of=parallelData] (preprocessing) {Preprocessing};
      \node [TranslationModule, below of=preprocessing] (wordalignment) {Word Alignment};
      \node [TranslationModule, below of=wordalignment, very thick, blue] (rulextraction) {\textbf{Grammar Extraction}};
      \node [TranslationModule, right = 3cm of parallelData] (monolingualData) {Monolingual Text};
      \node [TranslationModule, below = 1.57cm of monolingualData] (preprocessing2) {Preprocessing};
      \node [TranslationModule, below = 1.3cm of preprocessing2, very thick, blue] (languageModel) {\textbf{Language Modelling}};
      \node (emptyMiddleGrammarLm) at ($(rulextraction)!0.5!(languageModel)$) {};
      \node [TranslationModule, below of = emptyMiddleGrammarLm] (decoding) {Decoding};
      \node [TranslationModule, below of=decoding] (tuning) {Tuning};
      \node [TranslationModule, below of=tuning, very thick, blue, text width = 4.5cm] (rescoring) {\textbf{Language Modelling \\ for Rescoring}};
      % Draw edges
      \path [line] (parallelData) -- (preprocessing);
      \path [line] (preprocessing) -- (wordalignment);
      \path [line] (wordalignment) -- (rulextraction);
      \path [line] (monolingualData) -- (preprocessing2);
      \path [line] (preprocessing2) -- (languageModel);
      \path [line] (rulextraction) -- (decoding);
      \path [line] (languageModel) -- (decoding);
      \path [line] (decoding) edge [bend right] (tuning);
      \path [line] (tuning) edge [bend right] (decoding);
      \path [line] (tuning) -- (rescoring);
      \draw [line, latex'-latex', color = blue] (languageModel) to [bend left = 45] node [right, align = left] {First pass LM / \\ rescoring LM \\ interaction} (rescoring);
      % Circle important modules
      %\node [ellipse, draw=red, fit= (wordalignment)] {};
      %\node [ellipse, draw=red, fit= (rulextraction)] {};
    \end{tikzpicture}
    \caption{Machine Translation System Development Pipeline. Modules that we attempt to refine are in bold blue.
    Grammar extraction and first-pass language modelling are refined in the context of domain adaptation.
    We also investigate smoothing for language modelling in rescoring. Finally, we study the interaction
    between the first-pass language model and the rescoring language model.
    }
    \label{fig:modulesToBeImproved}
  \end{center}
\end{figure}

Experiments are based on a system submitted to the WMT13
Russian-English translation shared
task~\citep{pino-waite-xiao-degispert-flego-byrne:2013:WMT}.
Because we developed a very competitive system for this evaluation,
refinements on that system give a good indication of which techniques
are worthwhile applying to SMT systems.
We briefly summarise the system building in
\autoref{sec:wmt13ExperimentalSetup}.
In \autoref{sec:domainAdaptationMT}, we review domain adaptation
techniques for machine translation. In \autoref{sec:domainAdaptationLM},
we show how to adapt a language model to obtain better performance
on a specific domain such as newswire text.
In \autoref{sec:domainAdaptationGrammar}, we show how additional
grammar rule features related to specific domains may help in translation.
In \autoref{sec:bestPossibleRescoring}, we investigate
the various strategies for combining first pass translation
and language model rescoring in terms of language model
training data size and $n$-gram language model order.
Finally, in \autoref{sec:sbVSkn}, we compare two smoothing
strategies for large language model rescoring.

\section{Description of the System to be Developed}
\label{sec:wmt13ExperimentalSetup}

The experiments reported in this chapter are based
on the system submitted to the WMT13 Russian-English translation shared
task~\citep{pino-waite-xiao-degispert-flego-byrne:2013:WMT}.
19 systems were submitted to this particular task. The
CUED system obtained the best case-sensitive BLEU
score and the second best human judgement score
amongst constrained-track
systems~\citep{bojar-buck-callisonburch-federmann-haddow-koehn-monz-post-soricut-specia:2013:WMT}.\footnote{http://matrix.statmt.org/matrix/systems\_list/1738}
In this section, we summarise the system building.

We use all the Russian-English parallel data available in the constrained track.
We filter out non Russian-English sentence pairs with the
\emph{language-detection} library.\footnote{http://code.google.com/p/language-detection/}
A sentence pair is filtered out if the language detector detects a different language with probability
more than 0.999995 in either the source or the target.
This discards 78,543 sentence pairs from an initial 2,543,462.
For example, the sentence in \autoref{fig:englishActuallyFrench} was detected as French with
a probability of 0.999997.
%
\begin{figure}
\begin{quote}
Sur base des précédents avis, il semble que l'hôtel a fait un effort.
\end{quote}
\caption{Example of target side sentence of a Russian-English sentence pair. The sentence was detected as actually being French with 0.999998 probability.}
\label{fig:englishActuallyFrench}
\end{figure}
%
In addition, sentence pairs where the source sentence has no Russian character, defined by the
Perl regular expression [{\textbackslash}x{0400}-{\textbackslash}x{04ff}], are discarded.
This regular expression corresponds to the Unicode block for Cyrillic
characters (in hexadecimal notation, code 0400 to code 04ff).
This further discards 19000 sentence pairs. We take the view
that discarding a small portion of training data in order to
obtain cleaner data is beneficial in translation, when such cues are easily and
reliably available.

The Russian side of the parallel corpus is tokenised with the
Stanford CoreNLP toolkit.\footnote{http://nlp.stanford.edu/software/corenlp.shtml}
The English side of the parallel corpus is tokenised with a standard English tokeniser,
which splits sentences into tokens using white space and punctuation, and retains
abbreviations as single tokens.
Both sides of the parallel corpus are then lowercased. % TODOFINAL say somewhere that BLEU scores are lowercase
Corpus statistics after filtering and tokenisation are summarised in
\autoref{tab:parallelStatsWMT13}.
%
\begin{table}[htbp]
\begin{center}
\begin{tabular}{*{3}{|r}|}
\hline
Language & \# Tokens & \# Types \\
\hline
\hline
%Russian & CoreNLP & 2445919 & 47426938 & 1191325 \\
RU & 47.4M & 1.2M \\
\hline
%English & Aachen & 2445919 & 50419263 & 711001 \\
EN & 50.4M & 0.7M \\
\hline
\end{tabular}
\end{center}
\caption{Russian-English parallel corpus statistics after filtering
  and tokenisation.
  Parallel text contains approximately 50M tokens on each side. This translation
  task can be characterised as a \emph{medium size} translation task.}
\label{tab:parallelStatsWMT13}
\end{table}
%

Parallel data is aligned using the MTTK
toolkit~\citep{deng-and-byrne:2008:ASLP}:
we train a word-to-phrase HMM model with a maximum phrase length of 4 in both
source-to-target and target-to-source
directions (see \autoref{sec:statisticalMachineTranslationHmmAlignmentModel}).
The final alignments are obtained
by taking the union of alignments obtained in both directions.
A hierarchical phrase-based grammar
is extracted from the alignments as described
in \autoref{sec:rulextractMapReduce}. The constraints for extraction
described in \autoref{sec:hfileForHiero} are set as follows:
%
\begin{itemize}
  \item $s_{\text{max}} = 9$ (maximum number of source terminals for phrase-based rules)
  \item $s_{\text{max elements}} = 5$ (maximum number of source terminals and nonterminals)
  \item $s_{\text{max terminals}} = 5$ (maximum number of source consecutive terminals)
  \item $s_{\text{max NT}} = 10$ (maximum source nonterminal span)
\end{itemize}
%
We are using a shallow-$1$ hierarchical
grammar (see \autoref{sec:constraintsOnHierarhicalGrammars}) in our
experiments.
This model is constrained enough that the decoder can build exact search spaces,
i.e. there is no pruning in search that may lead to search errors under the model. % TODO (note that we still use hiero grammar threshold filtering)

We use the KenLM toolkit~\citep{heafield-pouzyrevsky-clark-koehn:2013:ACL} to estimate
a single modified Kneser-Ney smoothed 4-gram language
model (see \autoref{sec:StatisticalMachineTranslationKneserNey})
on all available monolingual data available in the constrained track.
Statistics for the monolingual data are presented in \autoref{tab:monolingualStats}.
%
\begin{table}[htbp]
\begin{center}
\begin{tabular}{l|r}
Corpus & \# Tokens \\
\hline
%EU + NC + UN + CzEn + Yx & 30883622 & 652478959 \\
EU + NC + UN + CzEng + Yx & 652.5M \\
%Giga + CC + Wiki & 25091335 & 654121365 \\
Giga + CC + Wiki & 654.1M \\
%news & 68521563 & 1594300877 \\
%News Crawl & 68521563 & 1594300877 \\
News Crawl & 1594.3M \\
%afp & 31200783 & 874138041 \\
afp & 874.1M \\
%apw & 56608701 & 1429342034 \\
apw & 1429.3M \\
%cna + wpb & 2213142 & 66368468 \\
cna + wpb & 66.4M \\
%ltw & 12343912 & 326531595 \\
ltw & 326.5M \\
%nyt & 68702044 & 1744284446 \\
nyt & 1744.3M \\
%xin & 14581673 & 425356551 \\
xin & 425.3M \\
\hline
%Total        & 310146775 & 7766922336 \\
Total & 7766.9M \\
\end{tabular}
\end{center}
\caption{Statistics for English monolingual corpora used to train language models.
  Abbreviations are used for the following corpora: Europarl (EU), News Commentary (NC), United Nations (UN), Czech-English corpus (CzEng), Yandex (Yx),
  $10^9$ French-English corpus (Giga), Common Crawl (CC), Wikipedia Headlines (Wiki). Corpora ``afp'' and below are the various news agency from the
  GigaWord corpus.} % TODOFINAL cite GigaWord
\label{tab:monolingualStats}
\end{table}

For first-pass decoding, we use HiFST as described in \autoref{sec:hifst}. First-pass
decoding is followed by a large 5-gram language model rescoring step
as described in \autoref{sec:rescoring}. The 5-gram language model
also uses all available monolingual data described
in \autoref{tab:monolingualStats}.

Two development sets are available: \emph{newstest2012} with 3003 sentences and \emph{newstest2013} with
3000 sentences.
We extract odd numbered sentences from \emph{newstest2012} in order to
obtain a tuning set \emph{newstest2012.tune}. Even numbered sentences from \emph{newstest2012}
form a test set \emph{newstest2012.test}. The \emph{newstest2013} set is used as an additional test set.
\emph{newstest2012.tune}, \emph{newstest2012.test} and \emph{newstest2013} are respectively
shortened to \emph{tune}, \emph{test1} and \emph{test2}.
% TODOFINAL mention baseline features as well as lattice mert. should be baseline feature section in background. should be lattice mert section in background.

We report our baseline experiment
in \autoref{tab:baselineWMT}. We indicate case-insensitive BLEU
score after first-pass decoding and 5-gram language model rescoring.
%
\begin{table}
  \begin{center}
  \begin{tabular}{l|lll}
    Configuration & \emph{tune} & \emph{test1} & \emph{test2} \\
    \hline
    baseline 1st pass & 33.22 & 32.01 & 25.35 \\
    baseline +5g & 33.33 & 32.26 & 25.53 \\
  \end{tabular}
  \caption{WMT13 Russian-English baseline system. Performance is measured
    by case-insensitive BLEU. The 1st pass configuration indicates results obtained
    after decoding. The +5g configuration indicates results obtained
    after 5-gram language model rescoring.}
  \label{tab:baselineWMT}
  \end{center}
\end{table}
%
We will now investigate various strategies in order to improve
performance over the baseline. We will investigate
domain adaptation techniques in order to
estimate better language models and grammars for
translating newswire data.
We will also study the interaction between the first-pass language model
and the rescoring language model. Finally, we will study
an alternative smoothing technique to Stupid Backoff for the
rescoring language model.

\section{Domain Adaptation for Machine Translation}
\label{sec:domainAdaptationMT}

In this section, we first introduce in
\autoref{sec:domainAdaptationGeneralProblem} the general problem of domain
adaptation and how certain machine translation settings
may be instances of this problem. We then review
previous work on addressing the problem of domain adaptation
in SMT in \autoref{sec:domainAdaptationSMTrelatedWork}.

\subsection{Domain Adaptation}
\label{sec:domainAdaptationGeneralProblem}

Let us first formalise the problem of domain adaptation.
In a standard multi-class classification problem, we are
given a training set
$\{(x_i, y_i) \in \mathcal{X} \times \mathcal{Y}, i \in [1, N]\}$
where $\mathcal{X}$ is a set of instances to be labelled and
$\mathcal{Y}$ is a finite set of labels. The multi-class classification
learning problem
is to find a function $f : \mathcal{X} \rightarrow \mathcal{Y}$
that minimises the number of prediction errors.
Machine translation
can be seen as a multi-class classification problem where $\mathcal{X}$
is the set of all possible source sentences and $\mathcal{Y}$ is the set of
all possible target sentences. The multi-class classification framework is not very intuitive
for machine translation. First, at least in theory, there are an
infinite number of possible target sentences to choose from in translation.
In addition, many possible translations are correct
or acceptable in general.
However, the multi-class classification setting is assumed by the original objective
function (see \autoref{sec:sourceChannelModel}), which we
recall in \autoref{eq:originalSMTobjectiveReminder}. This objective
function minimises the number of misclassifications.
%
\begin{equation}
  \bm{\hat{e}} = \argmax_{\bm{e}} p(\bm{e} \mid \bm{f})
  \label{eq:originalSMTobjectiveReminder}
\end{equation}
%

A \emph{domain} is a particular distribution $\mathcal{D}$
over $\mathcal{X}$. For example, $\mathcal{D}$ can be such that
source sentences in $\mathcal{X}$ drawn from $\mathcal{D}$ are
in the newswire
domain. For example, if source sentences are in English, then source sentences
drawn sampled from $\mathcal{X}$ with distribution $\mathcal{D}$
may look like
the sentence in \autoref{fig:exampleSentenceNewswire}.
%
\begin{figure}
  \begin{quote}
    In addition, a new seven-year EU budget needs to be passed, which is very complicated due to the current crisis.
  \end{quote}
  \caption{Example of English sentence in the newswire domain.}
  \label{fig:exampleSentenceNewswire}
\end{figure}
%
% TODOFINAL fix space here

For domain adaptation, we assume an out-of-domain distribution
$\mathcal{D}_O$ and a in-domain distribution $\mathcal{D}_I$.
A model is trained on
a sample drawn from $\mathcal{D}_O$ but the model performance
is evaluated on a sample drawn from $\mathcal{D}_I$.
\emph{Domain adaptation} consists in altering the training procedure
by using information from domain
$\mathcal{D}_I$ in order to achieve better performance on $\mathcal{D}_I$.
The out-of-domain and in-domain distributions are also called
source domain and target domain respectively.
A simple example of domain adaptation setting for SMT may be that
the parallel text is a parallel corpus of
tweets\footnote{\url{https://twitter.com/}} and sentences to be translated
come from newspaper articles.

\subsection{Previous Work on Domain Adaptation for SMT}
\label{sec:domainAdaptationSMTrelatedWork}

% orthogonal axes: language model, translation model, linear mixture, loglinear mixture

%TODOFINAL: distinguish lm, tm, linear, loglinear

In machine translation, we can distinguish two types of domain
adaptation problem: \emph{cross-domain} adaptation and \emph{dynamic}
adaptation~\citep{foster-kuhn:2007:WMT}. In cross-domain adaptation,
a small amount of in-domain training data is available while
in dynamic adaptation, no in-domain training data is available.
Dynamic adaptation may be important for online translation systems for example.
Our concern is cross-domain adaptation, where at least an in-domain
development set is available for parameter tuning.

%notes on Experiments in Domain Adaptation for Statistical Machine Translation by Keohn and Schroeder
%large europarl // data, small newscommentary // data, test set in news domain
%baseline: concatenate the parallel corpora
%in domain data to train language model
%interpolated language model: train 2 lms: one in out of domain, one in in domain
%grid search on interpolation weights
%use the 2 lms as separate features
\citet{koehn-schroeder:2007:WMT} explore various techniques for
machine translation domain adaptation. They use
Europarl~\citep{koehn:2005:MTSummit} as out-of-domain
training data and a news commentary parallel
corpus\footnote{\url{http://statmt.org/wmt07/shared-task.html}}
for in-domain training data. Training the language model
on in-domain training data gives an improvement of 0.8 BLEU
with respect to the baseline. Training two separated language models
on the in-domain and out-of-domain training data and interpolating
them with weights set to minimise perplexity on the tuning set
gives an improvement of 0.4 BLEU. Using these two language models
as separate features to be optimised by MERT~\citep{och:2003:ACL}
gives an improvement of 0.6 BLEU. Finally, using two separate
translation models trained on the in-domain and out-of-domain
data and tuning the weights with MERT gives an improvement
of 0.9 BLEU. These results are reported on the tuning set only.
In this chapter, we explore very similar techniques.

%notes on Domain Adaptation for Statistical Machine Translation with Monolingual Resources by Bertoldi and Federico
%exploit monolingual in domain data (src or trg)
%adapt sp-en system from UN to Europarl
%distinguish cross-domain and dynamic adaptation
%large in-domain monolingual data
%use directly to adapt lm
%generate synthetic // corpus and adapt translation and reordering model
%generate synthetic // data directly with alignment given by decoder
%phrase table is the union and the translation features are smoothed by some
%lexical prob in case one phrase is not extracted from one of the // corpora

% TODOFINAL add this
%\citet{bertoldi-federico:2009:WMT} show how to exploit
%a monolingual corpus for machine translation domain
%adaptation. An in-domain monolingual corpus is available
%either on the source or target side. A source side
%monolingual corpus is automatically translated with
%an out-of-domain baseline system. The two resulting
%phrase tables are merged and the translation features
%for phrase pairs not extracted from one of the corpus
%are smoothed with a lexical feature. The language model
%is trained on the target side of the synthetic corpus.
%This strategy gives a gain of 1 BLEU. Using in-domain
%monolingual data on the target side is much more effective.
%Simply training the language model on this data gives
%a gain of 5.2 BLEU. Creating a synthetic parallel corpus
%and merging the phrase tables as just described gives
%an additional 0.3 BLEU.

% notes on Discriminative Instance Weighting for Domain Adaptation
% in Statistical Machine Translation by foster goutte kuhn
%baseline adaptation technique
%obvious: use mert tune set in domain
%more difficult: how to adapt lm and s2t and t2s
%no adaptation of alignment
%concatenate in and out training data
%train two models and interpolate. one way to do this
%is two use them as separate features in mert. drawback is
%that mert becomes unstable
%log linear combination not good because multiplies prob (?)
%use linear intepolation instead
%optimization on in domain dev set for LM
%for TM not so straightforward
%hat{alpha} = argmax_alpha sum_s,t ptilda(s,t) log sum_i alpha_i p_i(s | t)
%ptilda(s,t): joint empirical distrib on in domain dev set using normal
%rule extraction
%alternative: use MAP (see paper for formula)
%sentence selection: filter out-of-domain training to match
%input source sentences
%sentence selection crude. matsoukas et al use sentence pair weighting.
%extension: learn weights on phrase pairs
%model
%p(s|t) = alpha_t p_I(s | t) + (1 - alpha_t) p_o(s | t)
%p_I: s2t as usual computed on the in domain corpus
%p_o: instance weighted model computed on the out of domain corpus

% notes on Simulating Discriminative Training for Linear Mixture Adaptation in Statistical Machine Translation by Foster-Chen-Kuhn
% linear mixture for domain adaptation. problem: mixture weight biased towards corpus size.
% typical effective recipe: heterogeneous subcorpora, in domain dev set
% linear mixture and loglinear mixture
% linear mixture better
% challenge for linear mixture: need to mix at local level (ngram or phrase pair)
% linear mixture usually tuned towards perplexity (sennrich 2012). can linear mixture be tuned towards BLEU ?
% looks difficult to modify tuning algo (mert, smooth bleu gradient descent, mira)
% maximum likelihood mixtures
% p(s | t) = sum_i w_i p_i(s|t)
% extract multiset of phrase pairs from small indomain parallel corpus: joint distribution ptilda(s,t)
% hat{w} = argmax_w sum_s,t ptilda(s,t) log sum_i w_i p_i (s | t).
% then train using EM (basically same thing as mixture lm to minimize perplexity for dev set)
% correction for large corpus bias
%


%papers to review:
%eck et al 2004
%hildebrand et al 2005 (similar to eck et al 2004)
%foster and kuhn 2007
%finch sumita 2008
%civera and juan 2007
%ueffing et al 2007
%schwenk 2008
% daume 2007, daume marcu 2007
% matsoukas et al 2009
%Perplexity Minimization for Translation Model Domain Adaptation in Statistical Machine Translation
% Simulating Discriminative Training for Linear Mixture Adaptation in Statistical Machine Translation
% Analysing the effect of out-of-domain data on smt systems.
% Perplexity minimization for translation model domain adaptation in statistical machine translation.
% Discriminative instance weighting for domain adaptation in statistical machine translation.

\section{Language Model Adaptation for Machine Translation}
\label{sec:domainAdaptationLM}

In this section, we contrast two simple language model adaptation
techniques for machine translation. Experiments are carried out % TODOFINAL what's different about previous work
on the Russian-English system described in \autoref{sec:wmt13ExperimentalSetup}.

We use the KenLM toolkit to
estimate separate modified Kneser-Ney smoothed 4-gram
language models for each of the
corpora listed in \autoref{tab:monolingualStats}. The component models are then
interpolated with the SRILM toolkit~\citep{stolcke:2002:SLP} to form a single LM.
The interpolation weights are optimised for perplexity on the
concatenation of the English side of the \emph{news-test2008},
\emph{newstest2009} and \emph{newssyscomb2009} development sets which were provided
for previous translation evaluations, using
the \emph{compute-best-mix} tool, which is part of the SRILM toolkit.
The weights reflect
both the size of the component models and the genre of the corpus the component models
are trained on, e.g. weights are larger for larger corpora in the news
genre~\citep{foster-chen-kuhn:2013:MTSummit}.
Thus we obtain a linear mixture of language models. We call this configuration ``linear''.
As a contrast, we also keep the individual language models as separate features
in optimisation. We call this configuration ``loglinear''.

We compare the baseline, the linear and the loglinear configurations
in \autoref{tab:lmInterpolationBestStrategy}.
We observe that the best strategy for
building a language model is to do offline linear interpolation to minimise
perplexity on a development set that has the same genre as the translation
test sets (row 3). The second best strategy is to use an uninterpolated
language model (row 1). The worst strategy is to do a log-linear interpolation
of the various language model components by tuning these language
language models as separate features with lattice MERT (row 5). Note that
these observations are confirmed even after rescoring with a large
Stupid Backoff 5-gram language model (rows 2, 4 and 6).
It is possible that the loglinear configuration, which has 20 features as opposed
to 12 features for the baseline and the linear configuration, performs worst because
the MERT algorithm
becomes unstable with more features~\citep{foster-kuhn:2009:WMT}.
%
\begin{table}
  \begin{center}
    \begin{tabular}{l|l|lll}
      Row & Configuration & \emph{tune} & \emph{test1} & \emph{test2} \\
      \hline
      1 & baseline 1st pass & 33.22 & 32.01 & 25.35 \\
      2 & baseline +5g & 33.33 & 32.26 & 25.53 \\
      \hline
      3 & linear 1st pass & 33.71 & 32.26 & 25.47 \\
      4 & linear +5g & 33.74 & 32.51 & 25.58 \\
      \hline
      5 & loglinear 1st pass & 33.23 & 31.75 & 25.37 \\
      6 & loglinear +5g & 33.28 & 31.85 & 25.48
    \end{tabular}
    \caption{Performance comparison, measured by case-insensitive BLEU, between an uninterpolated language model (baseline), a
    linearly interpolated language model (linear) and a log-linearly interpolated language model (loglinear).
    Off-line linear interpolation (rows 3 and 4) is the best strategy. The loglinear configuration performs worst, however this
    may be due to the MERT optimisation algorithm not being able to find a high quality
    set of parameters with 20 features.}
    \label{tab:lmInterpolationBestStrategy}
  \end{center}
\end{table}
%
These results are in contrast to those obtained
previously~\citep{koehn-schroeder:2007:WMT}, where the loglinear interpolation
was found to provide more benefits over the linear interpolation. In addition,
more language models are considered (9 vs. 2). Finally,
both tune and test results are reported here.

%We note that the differences in BLEU between the various configurations are
%relatively small. Therefore, we perform a statistical significance analysis
%using bootstrap resampling~\citep{koehn:2004:EMNLP}.\footnote{implementation at \url{http://www.ark.cs.cmu.edu/MT/paired_bootstrap_v13a.tar.gz}}
%We do not use a more
%recent significance testing analysis~\citep{clark-dyer-lavie-smith:2011:ACL} that
%accounts for random restarts in MERT simply because we do not use
%random restarts. % justify why no random restarts ?
%We carry out the following tests: baseline vs.\ linear and baseline vs.\ loglinear, both
%for first pass decoding and for 5g-rescored results, on \emph{test1} and \emph{test2}.
%BLEU scores superscripted with a star indicate that the difference with respect to the
%baseline is statistically significant at the 0.05 level with 3000 samples.

% TODOFINAL IF TIME do exp that compares linear and loglinear interp of ONLY 2 lms.

\section{Domain Adaptation with Provenance Features}
\label{sec:domainAdaptationGrammar}

% notes on chiang's paper
%run porter stemmer on en side of // corpus
%build two word translation table
%t(e' | f ) and t(e|e')
%t_m(e | f ) = sum_{e'} t(e' | f) t(e | e')
%t_m(f | e) = sum_{e'} t(f | e') t(e' | e) = t(f | e')
%conditioning on provenance
%each sentence pair has genre/collection info
%compute word translation tables t_s(e | f) and t_s(f | e) for each feature s
%for unseen word pairs, use Witten-Bell smoothing
%for each s, for phrase pair/rule (f,e) add two features - log t_s(e | f)/t(e|f)

In \autoref{sec:domainAdaptationLM}, we have presented domain adaptation strategies
for the language model. In this section, we will focus on domain
adaptation for the translation model.
\citet{chiang-deneefe-pust:2011:ACL} introduce \emph{provenance} lexical
features. The concept of provenance was introduced
previously~\citep{matsoukas-rosti-zhang:2009:EMNLP} and
provides metadata information such as genre or collection for each
sentence pair in parallel text.
% TODOFINAL add this to background + refer to background
%The lexical feature $t(\bm{e} \mid \bm{f})$ for a phrase
%pair $(\bm{f}, \bm{e})$ is defined in \autoref{eq:lexicalFeature}:
%
%\begin{equation}
%  t(\bm{e} \mid \bm{f}) = \prod_{i = 1}^{|\bm{e}|}
%  \begin{cases}
%    \frac{1}{|a_i|} \sum_{j \in a_i} t(e_i \mid f_j) & \text{if } |a_i| > 0 \\
%    t(e_i \mid \text{NULL}) & \text{otherwise}
%  \end{cases}
%  \label{eq:lexicalFeature}
%\end{equation}
%
%where $t(e_i \mid f_j)$ is a word translation table computed from the
%word aligned parallel text.
Word translation tables and
lexical features are computed for each provenance. % TODOFINAL ref lex feature background
Finally, for each provenance, the ratio of the provenance specific
lexical feature with the global lexical feature is added as a feature
to the translation system.

We use a very similar approach with the following differences and
extensions:
%
\begin{itemize}
  \item The lexical features are computed with Model 1 rather than Viterbi alignments, as described in
    \autoref{sec:features}.
  \item The features added to the system are not the ratios between a
    provenance specific lexical feature and the global lexical feature but
    simply the provenance specific lexical features.
  \item We extend this technique to compute provenance specific
    source-to-target and target-to-source translation models
    as described in \autoref{sec:rulextractMapReduce}.
\end{itemize}
%
For our experiments, we use 4 provenances that correspond
to the 4 subcorpora provided for parallel text: the Common Crawl corpus, the
News Commentary corpus, the Yandex corpus and the Wiki Headlines
corpus. This gives an additional 16 features to our system: source-to-target
and target-to-source translation and lexical features for each provenance.
This configuration is called ``provenance''.

When retrieving relevant rules for a particular test set, various thresholds are applied, such % TODOFINAL as described in section...
as number of targets per source or translation probability cutoffs.
Thresholds involving source-to-target translation scores are applied separately for each
provenance and the union of all surviving rules for each provenance is kept. This
configuration is called ``provenance union''. This specific configuration
was used for submission to the translation task competition at WMT13.
To our knowledge, this original technique has not been employed previously.
Results show that it provides additional gains over the simple use
of provenance features.

The first-pass language model is the linear mixture described
in \autoref{sec:domainAdaptationLM}. We also compare
the various configurations after 5-gram rescoring.
Results are presented in \autoref{tab:noprovVsProvVsProvunion}.
%
\begin{table}
  \begin{center}
    \begin{tabular}{l|l|lll}
      Row & Configuration & \emph{tune} & \emph{test1} & \emph{test2} \\
      \hline
      1 & no provenance 1st pass & 33.71 & 32.26 & 25.47 \\
      2 & no provenance +5g           & 33.74 & 32.51 & 25.58 \\
      \hline
      3 & provenance 1st pass & 33.22 & 32.14 & 25.40 \\
      4 & provenance +5g        & 33.22 & 32.14 & 25.41 \\
      \hline
      5 & provenance union 1st pass & 33.65 & 32.36 & 25.55 \\
      6 & provenance union +5g              & 33.67 & 32.58 & 25.63 \\
      \hline
      7 & no provenance union 1st pass & 33.22 & 31.78 & 25.59 \\
      8 & no provenance union +5g      & 33.22 & 31.78 & 25.59 \\
    \end{tabular}
    \caption{Effect of using provenance features. Case-insensitive
      BLEU scores are reported. The most effective strategy is
      provenance union. Because the ``no provenance union'' does
      not improve over the baseline, we conclude that the gains
      obtained by the provenance union strategy are not only due
      to additional rules; instead, they are due to the conjunction
      of additional rules with provenance features for better discrimination.}
      %Statistical significant differences on \emph{test1}
      %and \emph{test2} with respect to the baseline are marked
      %with a superscript star.}
    \label{tab:noprovVsProvVsProvunion}
  \end{center}
\end{table}
%
We can see that adding provenance features slightly degrades
performance (row 3 vs.\ row 1, row 4 vs. row 2). This again
may be due to the inability of the MERT optimisation algorithm to
tune more than the 12 basic features.
However, if we use the provenance union strategy, we can see that having additional
rules together with provenance features is the best
strategy (row 5 vs.\ row 1, row 6 vs.\ row 2).
In order to verify whether the gains are only due to additional
rules, we rerun the experiment with the provenance union configuration but
remove the provenance
features (row 7 and row 8). The performance of the ``no provenance union''
configuration is worse or equal to the ``no provenance'' configuration, which
demonstrates that the provenance union strategy is optimal in this set of experiments
and that the gains provided are not only due to having additional translation
rules. The effectiveness of the provenance union strategy is due to
additional rules in conjunction with additional features for better discrimination
between rules.

% TODOFINAL significance testing

\section{Interaction between First Pass Language Model and Rescoring Language Model}
\label{sec:bestPossibleRescoring}

In \autoref{sec:domainAdaptationLM} and \autoref{sec:domainAdaptationGrammar},
we have investigated various strategies to adapt the language model
and translation model to the genre of the test set to be translated.
In each case, we verified that the conclusions held even after
rescoring with a large 5-gram language model.
In this section, we study the interaction between the first pass language model
and the rescoring language model in terms of $n$-gram order and training
data size. We give recommendations on how much
data should be used to train a first pass language model
and what order should be chosen for the first pass
$n$-gram language model.

Our translation systems are typically run in two steps. The
first step usually consists in decoding with
a 4-gram language model. The second step consists
in 5-gram lattice rescoring.
\citet{brants-popat-xu-och-dean:2007:EMNLP-CoNLL} argue
that single pass decoding is conceptually simpler
and may lose less information.
In this section, we also attempt to verify this claim
by incorporating 5-gram language models directly in
first-pass decoding and evaluate performance with respect
to a two-pass strategy.

We describe the various configurations used for experiments and
reported in \autoref{tab:lmSizes}. For monolingual data size,
we use a ``large'' configuration which uses all available
data described in \autoref{tab:monolingualStats}, and a
``medium'' configuration which uses the following corpora: EU, NC, UN, CzEng, Yx, Giga, CC, Wiki, afp and xin.
(The keys for abbreviations are described in \autoref{tab:monolingualStats}).
We also train 4-gram and 5-gram language models with large and
medium data for the first pass language model.
The ``large 4g'' configuration corresponds to the linear
configuration from \autoref{tab:lmInterpolationBestStrategy}.

The initial parameter for the MERT optimisation
step was the same for all configurations. This parameter was obtained
from a separate Arabic-to-English experiment carried out earlier. Reusing
already optimised parameters between experiments allows us to do faster
experimentation since less iterations are needed until convergence.

Results are presented in \autoref{tab:lmSizes}.
Comparing row 1 and row 3, we can see that for a 4-gram
language model, more training data is beneficial in first-pass decoding.
This conclusion is confirmed after 5-gram rescoring (row 2 vs.\ row 4).
However, when comparing row 5 vs.\ row 7 and row 6 vs.\ row 8, we conclude that
in the case of a first-pass 5-gram language model, more training
data is only helpful in first-pass decoding.
It is possible that even more monolingual data is required for a 5-gram
language model to generate a lattice of sufficient quality in first-pass
decoding so as to obtain improvements in rescoring.

Comparing row 2 vs.\ row 6 and row 4 vs.\ row 8, we see
that for equal amounts of training data, the best strategy
is to use a first-pass 4-gram language model followed by
large 5-gram language model rescoring. One possible reason
is that a first pass 5-gram language model may not have enough training
data to be reliably estimated and therefore may not
produce a first-pass lattice of high quality translations to be rescored, whereas
a first pass 4-gram language model is able to discard poor translations in first pass
decoding.
One caveat with this
conclusion of course is that the maximum amount of data experimented
with is 7.8B words as opposed to 2 trillion words reported
by \citet{brants-popat-xu-och-dean:2007:EMNLP-CoNLL}. Future work may be dedicated to fill this gap in experimentation.
For example, we plan to exploit a large 5-gram language model~\citep{buck-heafield-vanooyen:2014:LREC}
trained on the Common Crawl corpus\footnote{\url{https://commoncrawl.org/}}
in rescoring experiments.
%
\begin{table}
  \begin{center}
    \begin{tabular}{l|l|lll}
      Row & Configuration & \emph{tune} & \emph{test1} & \emph{test2} \\
      \hline
      1 & medium 4g 1st pass & 32.96 & 31.53 & 24.60 \\
      2 & medium 4g +5g &       33.43 & 32.12 & 25.30 \\
      \hline
      3 & large 4g 1st pass  & 33.71 & 32.26 & 25.47 \\
      4 & large 4g +5g       & 33.74 & 32.51 & 25.58 \\
      \hline
      5 & medium 5g 1st pass & 32.62 & 31.62 & 24.77 \\
      6 & medium 5g +5g       & 33.20 & 31.96 & 25.27 \\
      \hline
      7 & large 5g 1st pass  & 32.99 & 31.79 & 25.18 \\
      8 & large 5g +5g       & 32.99 & 31.79 & 25.18 \\
    \end{tabular}
    \caption{Comparing various data size conditions and $n$-gram orders
      for language model training and the effect on large 5-gram language
      model rescoring. Case-insensitive BLEU scores are reported.
      In conclusion, more language
      model training data is helpful but the use of higher order $n$-gram language
      models is only beneficial in rescoring.}
    \label{tab:lmSizes}
  \end{center}
\end{table}
%
We are not aware of any studies in the machine translation literature on
interactions between first pass and second pass language models.
The main conclusion, i.e. that a two-pass decoding approach is beneficial over a single-pass
approach given the amount of training data, has not been demonstrated experimentally before.

% TODOFINAL if time make the large 5g first pass interpolated by using restricted vocab.
% TODOFINAL check 1st pass lattice sizes

\section{Language Model Smoothing in Language Model Lattice Rescoring}
\label{sec:sbVSkn}

%TODOFINAL: refer back to background section on 5g rescoring

In \autoref{sec:bestPossibleRescoring}, we have given recommendations
about what strategy to adopt in first-pass decoding in order
to obtain optimal results in rescoring. The conclusion was to use
a 4-gram language model trained on large amounts of data.
In this section, we attempt to improve on the 5-gram rescoring
procedure by exploring two alternative smoothing strategies for the
5-gram language model used for rescoring.

We train a Stupid Backoff 5-gram language
model (see \autoref{sec:stupidBackoffSmoothing}
and \autoref{sec:applicationsMapReduceSMT})
and a modified Kneser-Ney 5-gram language model on all available
English data for WMT13, described in \autoref{tab:monolingualStats}.
We compare the two smoothing methods in rescoring over various
configurations used throughout this chapter. We did not include
the ``provenance union no provenance'' and the ``large 5g'' configurations.
On average, over the seven experiments from \autoref{tab:SB5gVsKN5g},
we obtain a gain of 0.11 BLEU on the tuning set \emph{tune}, a gain of
0.14 BLEU on the first
test set \emph{test1} and a gain of
0.13 BLEU on the second test set \emph{test2}. We conclude that the
use of modified Kneser-Ney smoothing is slightly beneficial in 5-gram
lattice rescoring when the amount of training data is in the order
of several billion tokens. This confirms conclusions from
Figure 5 in a previous publication~\citep{brants-popat-xu-och-dean:2007:EMNLP-CoNLL}
that Kneser-Ney smoothing is superior to Stupid Backoff smoothing
with amounts of training data in a similar order to the work here. These
conclusions were made for first pass decoding while here. This is
the first time these two smoothing strategies are compared
in the context of 5-gram language model lattice rescoring.
We also note that we obtain an improvement over the system submitted to the WMT13 translation
task using the ``provenance union + KN 5g'' configuration (row 15 in bold vs.\ row 14).
%
\begin{table}
  \begin{center}
    \begin{tabular}{l|l|lll}
      Row & Configuration         & \emph{tune} & \emph{test1} & \emph{test2} \\
      \hline
      1 & baseline 1st pass & 33.22 & 32.01 & 25.35 \\
      2 & baseline + SB 5g                 & 33.33 & 32.26 & 25.53 \\
      3 & baseline + KN 5g                 & 33.31 & 32.28 & 25.65 \\
      \hline
      4 & linear 1st pass & 33.71 & 32.26 & 25.47 \\
      5 & linear + SB 5g               & 33.74 & 32.51 & 25.58 \\
      6 & linear + KN 5g               & 33.73 & 32.26 & 25.48 \\
      \hline
      7 & loglinear 1st pass      & 33.23 & 31.75 & 25.37 \\
      8 & loglinear + SB 5g                 & 33.28 & 31.85 & 25.48 \\
      9 & loglinear + KN 5g                 & 33.30 & 31.92 & 25.35 \\
      \hline
      10 & provenance 1st pass              & 33.22 & 32.14 & 25.40 \\
      11 & provenance + SB 5g                 & 33.22 & 32.14 & 25.41 \\
      12 & provenance + KN 5g                 & 33.54 & 32.37 & 25.75 \\
      \hline
      13 & provenance union 1st pass        & 33.65 & 32.36 & 25.55 \\
      14 & provenance union + SB 5g                 & 33.67 & 32.58 & 25.63 \\
      15 & provenance union + KN 5g                 & \textbf{33.89} & \textbf{32.82} & \textbf{25.84} \\
      \hline
      16 & medium 5g 1st pass  & 32.62 & 31.62 & 24.77 \\
      17 & medium 5g + SB 5g                 & 33.20 & 31.96 & 25.27 \\
      18 & medium 5g + KN 5g                 & 33.24 & 32.29 & 25.47 \\
      \hline
      19 & medium 4g 1st pass  & 32.96 & 31.53 & 24.60 \\
      20 & medium 4g + SB 5g                 & 33.43 & 32.12 & 25.30 \\
      21 & medium 4g + KN 5g                 & 33.65 & 32.46 & 25.55 \\
    \end{tabular}
    \caption{Comparison between the gains obtained by Stupid Backoff
    5-gram rescoring and Kneser-Ney 5-gram rescoring. Case-insensitive
    BLEU scores are reported. In conclusion, at the level of several billion
    tokens for language model training, Kneser-Ney smoothing is beneficial
    over Stupid Backoff smoothing in the context of 5-gram lattice rescoring.
    Row 15 (in bold) achieves a better performance than row 14, which was our system submitted to the WMT13 translation shared task.}
    \label{tab:SB5gVsKN5g}
  \end{center}
\end{table}

\section{Conclusion}

In this chapter, we have investigated various refinements
on translation system building.
We have explored domain adaptation techniques in order
to refine the language model and the translation model.
For language model building, we find that the best
strategy for cross-domain adaptation
is to build an interpolated
language model and tune the interpolation weights in order
to minimise the perplexity on a development set in the
target domain, as opposed to tuning the language model
weights with MERT. % TODOFINAL result not in line with Koehn 2004 but this was with 9 lms so need to redo with 2 lms
We also have examined the effect of provenance features in the
translation model. We found that the use provenance features
is beneficial only when these features are used to discriminate
rules from a larger rule set.

After having attempted to refine models used in first-pass decoding, we
have studied the interaction between the first pass language model
and the 5-gram language model used in lattice rescoring. We
have come to the conclusion that in our setting, a two-pass
strategy is preferable and that using large amounts of data
to train a 4-gram language model for first-pass decoding yields
better results in rescoring.

Finally, we have demonstrated that Kneser-Ney smoothing
produces better results than Stupid Backoff smoothing
when training a 5-gram language model on 7.8 billion
words for rescoring purposes. This confirms previous
research in the context of rescoring as opposed to first-pass decoding.