-
Notifications
You must be signed in to change notification settings - Fork 2
/
ice.tex
519 lines (468 loc) · 23.6 KB
/
ice.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
\documentclass[11pt]{article}
\usepackage{times}
\usepackage{latexsym}
\usepackage{acl07}
\usepackage[all]{xy}
\author{\textbf{Nathan C. Sanders} \\ Department of Linguistics \\
Indiana University, Bloomington, IN 47401, USA \\ \texttt{[email protected]}}
\title{Measuring Syntactic Difference in British English}
\begin{document}
\maketitle
%TODO:
% 7. Maybe run a set of tests with R^2? That would be easy and
% worthwhile.
% 9. *Maybe* real-language examples for methods. But if you have trouble with
% abc/xyz how do you handle REAL CS?! Or theoretical syntax? Or PERL?!?
\begin{abstract}Recent work by
\cite{nerbonne06} has provided a foundation for measuring
syntactic differences between corpora. It uses part-of-speech trigrams as an
approximation to syntactic structure, comparing the trigrams of two
corpora for statistically significant differences.
This paper extends the method and its application. It extends the
method by using leaf-path ancestors of \cite{sampson00} instead of
trigrams, which capture internal syntactic structure---every leaf in
a parse tree records the path back to the root.
The corpus used for testing is the International Corpus of English,
Great Britain \cite{nelson02}, which contains syntactically
annotated speech of Great Britain. The speakers are grouped into
geographical regions based on place of birth. This is different in
both nature and number than previous experiments, which found
differences between two groups of Norwegian L2 learners of
English. We show that dialectal variation in eleven British regions from the ICE-GB
is detectable by our algorithm, using both leaf-ancestor paths and trigrams.
\end{abstract}
\section{Introduction}
In the measurement of linguistic distance, older work such as
\cite{seguy73} was able to measure distance in most areas of
linguistics, such as phonology, morphology, and syntax. The
features used for comparison were hand-picked based on
linguistic knowledge of the area being surveyed. These features,
while probably lacking in completeness of coverage, certainly allowed
a rough comparison of distance in all linguistic domains.
In contrast, computational methods have
focused on a single area of language. For example, a method for
determining phonetic distance is given by \cite{heeringa04}. Heeringa
and others have also done related work on phonological distance in
\cite{nerbonne97} and \cite{gooskens04}. A measure of syntactic
distance is the obvious next step: \cite{nerbonne06} provide one such
method. This method approximates internal syntactic structure using
vectors of part-of-speech trigrams. The trigram types can then be
compared for statistically significant differences using a permutation
test.
This study can be extended in a few ways.
First, the trigram approximation works well, but it
does not necessarily capture all the information of syntactic
structure such as long-distance movement. Second,
the experiments did not test data for geographical dialect variation,
but compared two generations of Norwegian L2 learners of English, with
differences between ages of initial acquisition.
We address these areas by using the syntactically annotated speech
section of the International Corpus of English, Great Britain (ICE-GB)
\cite{nelson02}, which provides a corpus with full syntactic annotations,
one that can be divided into groups for comparison. The sentences
of the corpus, being represented as parse trees rather than a vector
of POS tags, are
converted into a vector of leaf-ancestor paths, which were developed
by \cite{sampson00} to aid in parser evaluation by providing a way to
compare gold-standard trees with parser output trees.
In this way, each sentence produces its own vector of leaf-ancestor
paths. Fortunately, the
permutation test used by \cite{nerbonne06} is already designed to
normalize the effects of differing sentence length when combining POS
trigrams into a single vector per region. The only change needed is
the substitution of leaf-ancestor paths for trigrams.
The speakers in the ICE-GB are divided by place of birth into
geographical regions of
England based on the nine Government Office Regions, plus Scotland and
Wales. The
average region contains a little over 4,000
sentences and 40,000 words. This is less than the size of the
Norwegian corpora, and leaf-ancestor paths are more
complex than trigrams, meaning that the amount of data required for
obtaining significance should increase. Testing on smaller corpora
should quickly show whether corpus size can be reduced without losing
the ability to detect differences.
Experimental results show that differences can be detected among the
larger regions: as should be expected with a method
that measures statistical significance, larger corpora allow easier
detection of significance. The limit seems to be around 250,000 words for
leaf-ancestor paths, and 100,000 words for POS trigrams, but more careful
tests are needed to verify this.
Comparisons to judgments of dialectologists have not yet
been made. The comparison is difficult because of the
difference in methodology and amount of detail in
reporting. Dialectology tends to collect data from a few informants
at each location and to provide a more complex account of relationship
than the like/unlike judgments provided by permutation tests.
\section{Methods}
The methods used to implement the syntactic difference test come from two
sources. The primary source is the syntactic comparison of
\cite{nerbonne06}, which uses a permutation test, explained in
\cite{good95} and in particular for linguistic purposes in
\cite{kessler01}. Their permutation test
collects POS trigrams from a random subcorpus of sentences
sampled from the combined corpora. The trigram frequencies are
normalized to neutralize the
effects of sentence length, then compared to the
trigram frequencies of the complete corpora.
% \cite{nerbonne06} compare
% two generations of Norwegian L2 learners of English.
The principal difference between \cite{nerbonne06}'s work and ours is
the use of leaf-ancestor paths.
Leaf-ancestor paths were developed by \cite{sampson00} for
estimating parser performance by providing a measure of similarity of
two trees, in particular a gold-standard tree and a machine-parsed
tree. This distance is not used for our method, since for our purposes,
it is enough that leaf-ancestor paths represent syntactic information, such as
upper-level tree structure, more explicitly than trigrams.
The permutation test used by \cite{nerbonne06} is independent of the
type of item whose frequency is measured, treating the items as atomic
symbols. Therefore, leaf-ancestor paths should do just as well as
trigrams as long as they do not introduce any additional constraints
% constraints should=> statistical anomalies!
on how they are generated from the corpus. Fortunately, this is not
the case; \cite{nerbonne06} generate $N-2$ POS trigrams from each
sentence of length $N$; we generate $N$ leaf-ancestor paths from each
parsed sentence in the corpus. Normalization is needed to account for
the frequency differences caused by sentence length variation; it is
presented below. Since the same number (minus two) of trigrams and
leaf-ancestor paths are generated for each sentence the same
normalization can be used for both methods.
\subsection{Leaf-Ancestor Paths}
Sampson's leaf-ancestor paths represent syntactic structure
by aggregating nodes starting from each leaf and proceeding up to
the root---for our experiment, the leaves are parts of speech.
This maintains constant input from
the lexical items of the sentence, while giving the parse tree some
weight in the representation.
For example, the parse tree
\[\xymatrix{
&&\textrm{S} \ar@{-}[dl] \ar@{-}[dr] &&\\
&\textrm{NP} \ar@{-}[d] \ar@{-}[dl] &&\textrm{VP} \ar@{-}[d]\\
\textrm{Det} \ar@{-}[d] & \textrm{N} \ar@{-}[d] && \textrm{V} \ar@{-}[d] \\
\textrm{the}& \textrm{dog} && \textrm{barks}\\}
\]
creates the following leaf-ancestor paths:
\begin{itemize}
\item S-NP-Det-The
\item S-NP-N-dog
\item S-VP-V-barks
\end{itemize}
There is one path for each word, and the root appears
in all four. However, there can be ambiguities if some
node happens
to have identical siblings. Sampson gives the example
of the two trees
\[\xymatrix{
&&\textrm{A} \ar@{-}[dl] \ar@{-}[dr] &&&\\
&\textrm{B} \ar@{-}[d] \ar@{-}[dl] &&\textrm{B} \ar@{-}[d] \ar@{-}[dr] & \\
\textrm{p} & \textrm{q} && \textrm{r} & \textrm{s} \\
}
\]
and
\[\xymatrix{
&&\textrm{A} \ar@{-}[d] &&&\\
&&\textrm{B} \ar@{-}[dll] \ar@{-}[dl] \ar@{-}[dr] \ar@{-}[drr]&&& \\
\textrm{p} & \textrm{q} && \textrm{r} & \textrm{s} \\
}
\]
which would both produce
\begin{itemize}
\item A-B-p
\item A-B-q
\item A-B-r
\item A-B-s
\end{itemize}
There is no way to tell from the paths which leaves belong to which
B node in the first tree, and there is no way to tell the paths of
the two trees apart despite their different structure. To avoid this
ambiguity, Sampson uses a bracketing system; brackets are inserted
at appropriate points to produce
\begin{itemize}
\item $[$A-B-p
\item A-B]-q
\item A-[B-r
\item A]-B-s
\end{itemize}
and
\begin{itemize}
\item $[$A-B-p
\item A-B-q
\item A-B-r
\item A]-B-s
\end{itemize}
Left and right brackets are inserted: at most one
in every path. A left bracket is inserted in a path containing a leaf
that is a leftmost sibling and a right bracket is inserted in a path
containing a leaf that is a rightmost sibling. The bracket is inserted
at the highest node for which the leaf is leftmost or rightmost.
It is a good exercise to derive the bracketing of the previous two trees in detail.
In the first tree, with two B
siblings, the first path is A-B-p. Since $p$ is a leftmost child,
a left bracket must be inserted, at the root in this case. The
resulting path is [A-B-p. The next leaf, $q$, is rightmost, so a right
bracket must be inserted. The highest node for which it is rightmost
is B, because the rightmost leaf of A is $s$. The resulting path is
A-B]-q. Contrast this with the path for $q$ in the second tree; here $q$
is not rightmost, so no bracket is inserted and the resulting path is
A-B-q. $r$ is in almost the same position as $q$, but reversed: it is the
leftmost, and the right B is the highest node for which it is the
leftmost, producing A-[B-r. Finally, since $s$ is the rightmost leaf of
the entire sentence, the right bracket appears after A: A]-B-s.
At this point, the alert reader will have
noticed that both a left bracket and right bracket can be inserted for
a leaf with no siblings since it is both leftmost and rightmost. That is,
a path with two brackets on the same node could be produced: A-[B]-c. Because
of this redundancy, single children are
excluded by the bracket markup algorithm. There is still
no ambiguity between two single leaves and a single node with two
leaves because only the second case will receive brackets.
% See for yourself:
% \[\xymatrix{
% &\textrm{A} \ar@{-}[dl] \ar@{-}[dr] &\\
% \textrm{B} \ar@{-}[d] &&\textrm{B} \ar@{-}[d] \\
% \textrm{p} && \textrm{q} \\
% }
% \]
% \[\xymatrix{
% &\textrm{A} \ar@{-}[d] &\\
% &\textrm{B} \ar@{-}[dl] \ar@{-}[dr] & \\
% \textrm{p} && \textrm{q} \\
% }
% \]
% \cite{sampson00} also gives a method for comparing paths to obtain an
% individual path-to-path distance, but this is not necessary for the
% permutation test, which treats paths as opaque symbols.
\subsection{Permutation significance test}
With the paths of each sentence generated from the corpus, then sorted
by type into vectors, we now try to determine
whether the paths of one region occur in significantly different
numbers from the paths of another region. To do this, we calculate some
measure to characterize the difference between two vectors as a single
number. \cite{kessler01} creates a simple measure called the
{\sc Recurrence} metric ($R$ hereafter), which
is simply the sum of absolute differences of all path token counts
$c_{ai}$ from the first corpus $A$ and $c_{bi}$ from the second corpus
$B$.
\[ R = \Sigma_i |c_{ai} - \bar{c_i}| \textrm{ where } \bar{c_i} = \frac{c_{ai} + c_{bi}}{2}\]
However, to find out if the value of $R$ is significant, we
must use a permutation test with a Monte Carlo technique described by
\cite{good95}, following
closely the same usage by \cite{nerbonne06}. The intuition behind
the technique is to compare the $R$ of the two corpora with the $R$ of
two random subsets of the combined corpora. If the random subsets' $R$s
are greater than the $R$ of the two actual corpora more than $p$ percent
of the time, then we can reject the null hypothesis that the two were
are actually drawn from the same corpus: that is, we can assume that
the two corpora are different.
However, before the $R$ values can be compared, the path counts in the
random subsets must
be normalized since not all paths will occur in every subset, and
average sentence length will differ, causing relative path frequency
to vary. There are two normalizations that must occur: normalization
with respect to sentence length, and
normalization with respect to other paths within a subset.
The first stage of normalization normalizes the counts for each path
within the pair of vectors $a$ and $b$. The purpose is to neutralize the
difference in sentence length, in which longer sentences with more
words cause paths to be relatively less frequent.
Each count is converted to a frequency $f$ \[f=\frac{c}{N} \] where
$c$ is either $c_{ai}$ or $c_{bi}$ from above and $N$ is the length of the
containing vector $a$ or $b$. This produces two frequencies, $f_{ai}$ and
$f_{bi}$.Then the frequency is scaled
back up to a redistributed count by the equation
\[\forall j \in a,b : c'_{ji} = \frac{f_{ji}(c_{ai}+c_{bi})}{f_{ai}+f_{bi}}\]
This will redistribute the total of a pair from $a$ and $b$ based on
their relative frequencies. In other words, the total of each path
type $c_{ai} + c_{bi}$ will remain the same, but the values of
$c_{ai}$ and $c_{bi}$ will be balanced by their frequency
within their respective vectors.
For example, assume that the two corpora have 10 sentences each, with
a corpus $a$ with only 40 words and another, $b$, with 100 words. This
results in $N_a = 40$ and $N_b = 100$. Assume also that there is a
path $i$ that occurs in both: $c_{ai} = 8$ in $a$ and $c_{bi} = 10$ in
$b$. This means that the relative frequencies are $f_{ai} = 8/40 = 0.2$
and $f_{bi} = 10/100 = 0.1$. The first normalization will redistribute the
total count (18) according to relative size of the frequencies. So
\[c_{ai}' = \frac{0.2(18)}{0.2+0.1} = 3.6 / 0.3 = 12\] and
\[c_{bi}' = \frac{0.1(18)}{0.2+0.1} = 1.8 / 0.3 = 6\]
Now that 8 has been scaled to 12 and 10 to 6, the effect of sentence length
has been neutralized. This reflects the intuition that something that
occurs 8 of 40 times is more important than something that occurs 10
of 100 times.
% this is the (*2n) / N bit
The second normalization normalizes all values in both
permutations with respect to each other. This is simple: find the
average number of times each path appears, then divide each scaled
count by it. This produces numbers whose average is 1.0 and whose
values are multiples of the amount that they are greater than the average.
The average path
count is $N / 2n$, where $N$ is the number of path tokens in
both the permutations and $n$ is the number of path types. Division by
two is necessary since we are multiplying counts from a single permutation by
token counts from both permutations. Each type entry in the
vector now becomes \[\forall j \in a,b : s_{ij} = \frac{2nc_{ij}'}{N}\]
Starting from the previous example, this second normalization first
finds the average. Assuming 5 unique paths (types) for $a$ and 30 for
$b$ gives \[n = 5 + 30 = 35\] and
\[N = N_a + N_b = 40 + 100 = 140\]
Therefore, the average path type has $140 / 2(35) = 2$
tokens in $a$ and $b$ respectively. Dividing $c_{ai}'$ and $c_{bi}'$ by this average gives $s_{ai} = 6$
and $s_{bi} = 3$. In other words, $s_{ai}$ has 6 times more tokens
than the average path type.
%% This is definitely the code at the end of normaliseall
% The third stage of normalization normalizes all subset corpora with
% respect to each other. This process has the same effect as the second stage of
% normalization, but carried out over all subsets:
% each path type is averaged over every corpus; then each type is divided
% by that average. The final result is the same as the second normalization:
% each corpus consists of multiples of an average of $1.0$. The
% difference is that after the second normalization all types within a
% single subset corpus vary around an average of 1.0. After the third
% normalization, the values for a single type vary around an average of
% 1.0 across all subsets.
% For example,
% assume that $s_a = 6$ and $s_b = 3$ from the previous example are
% members of a subset corpus $C_1$ and that another corpus $C_2$
% has values $ss_a = 2$ and $ss_b = 3$ for the same path. Assume this
% path type has an average of 4 across all subsets. This gives a
% normalized value of $s'_a = 1.5$ and
% $ss'_a = 0.5$. Doing the same operation with an average $S_b = 3$ gives
% normalized values $s'_b = 1$ and $ss'_b = 1$. Remember, $s_a = 6$ means
% that $s_a$ appears 6 times more than the average path in its
% subset corpus. After the last normalization, $s_a = 1.5$ means that 6 is
% 1.5 times more than the average of this path across all
% subsets. Conversely, $ss_a$'s 2 times is only 0.5 the average for
% this path.
\section{Experiment and Results}
The experiment was run on the syntactically annotated part of the
International Corpus of English, Great Britain corpus (ICE-GB).
The syntactic annotation labels terminals with one of twenty parts of
speech and internal nodes with a category and a function
marker. Therefore, the leaf-ancestor paths each started at the root of
the sentence and ended with a part of speech.
For comparison to the experiment conducted by \cite{nerbonne06}, the
experiment was also run with POS trigrams. Finally, a control
experiment was conducted by comparing two permutations from the same
corpus and ensuring that they were not significantly different.
ICE-GB reports the place of birth of each speaker, which is the best
available approximation to which dialect a speaker uses. As a simple,
objective partitioning, the speakers were divided into 11 geographical
regions based on the 9 Government Office Regions of England with Wales
and Scotland added as single regions. Some speakers had to be thrown
out at this point because they lacked brithplace information or were
born outside the UK. Each region varied in size; however, the average
number of sentences per corpus was 4682, with an average of 44,726
words per corpus (see table \ref{size}). Thus, the average
sentence length was 9.55 words. The average corpus was smaller than
the Norwegian L2 English corpora of \cite{nerbonne06}, which had two
groups, one with 221,000 words and the other with 84,000.
% NW=NW
% NE=Northumbria
% Yorkshire=Yorkshire/Humber
% East=East Anglia
% London=London
% Southeast=SE
% Southwest=West
% East-Midlands~=Middle England (except I think bigger)
% West-Midlands~=Heart of England (except smaller--nothing to S or E)
\begin{table}
\begin{tabular}{|lcc|} \hline
Region & sentences & words \\
\hline \hline
East England & 855 & 10471 \\ \hline
East Midlands & 1944 & 16924 \\ \hline
London & 24836& 244341 \\ \hline
Northwest England & 3219 & 27070 \\ \hline
Northeast England & 1012 & 10199 \\ \hline
Scotland & 2886 & 27198 \\ \hline
Southeast England & 11090 & 88915 \\ \hline
Southwest England & 939 & 7107 \\ \hline
West Midlands & 960 & 12670 \\ \hline
Wales & 2338 & 27911 \\ \hline
Yorkshire & 1427 & 19092 \\ \hline
\end{tabular}
\label{size}
\caption{Subcorpus size}
\end{table}
%Priscilla Rasmussen, ACL, 209 N. Eighth Street, Stroudsburg, PA 18360
\begin{table}
\begin{tabular}{|c|c|} \hline % or use * ** *** notation
Region & Significantly different ($p < 0.05$) \\ \hline
London & East Midlands, NW England \\
& SE England, Scotland \\ \hline
SE England & Scotland \\ \hline
\end{tabular}
\label{diffs}
\caption{Significant differences, leaf-ancestor paths}
\end{table}
Significant differences (at $p < 0.05$) were found when
comparing the largest regions, but no significant differences were
found when comparing small regions to other small regions. The
significant differences found are given in table \ref{diffs} and
\ref{trigramdiffs}. It seems that summed corpus size must reach a
certain threshold before differences can be observed reliably: about 250,000
words for leaf-ancestor paths and 100,000 for trigrams. There are exceptions in
both directions; the total size of London compared to Wales is larger than
the size of London
compared to the East Midlands, but the former is not statistically different.
On the other hand, the total size of Southeast England compared to
Scotland is only half of the other significantly different comparisons; this
difference may be a result of
more extreme syntactic differences than the other areas.
Finally, it is interesting to note that the summed Norwegian corpus
size is around 305,000 words, which is about three times the size needed
for significance as estimated from the ICE-GB data.
\begin{table}
\begin{tabular}{|c|c|} \hline % or use * ** *** notation
Region & Significantly different ($p < 0.05$) \\ \hline
London & East Midlands, NW England, \\
& NE England, SE England,\\
& Scotland, Wales \\ \hline
SE England & London, East Midlands, \\
& NW England, Scotland \\ \hline
Scotland & London, SE England, Yorkshire \\ \hline
\end{tabular}
\label{trigramdiffs}
\caption{Significant differences, POS trigrams}
\end{table}
\section{Discussion}
Our work extends that of \cite{nerbonne06} in a number of ways. We
have shown that an alternate method of representing syntax still
allows the permutation test to find significant differences between
corpora. In addition, we have shown differences between corpora divided
by geographical area rather than language proficiency, with many more
corpora than before. Finally, we have shown that the size of the
corpus can be reduced somewhat and still obtain significant results.
Furthermore, we also have shown that both leaf-ancestor paths and POS
trigrams give similar results, although the more complex paths require more data.
However, there are a number of directions that this experiment should
be extended. A comparison that divides the speakers into traditional
British dialect areas is needed to see if the same differences can be
detected. This is very likely, because corpus divisions that better
reflect reality have a better chance of achieving a significant difference.
In fact, even though leaf-ancestor paths should provide finer
distinctions than trigrams and thus require more data for detectable
significance, the regional corpora presented here were smaller than
the Norwegian speakers' corpora \cite{nerbonne06} by up to a factor of
10. This raises the question of a lower limit on corpus size. Our
experiment suggests that the two corpora must have at least 250,000 words,
although we suspect that better divisions will allow smaller corpus sizes.
While we are reducing corpus size, we might as well compare the
increasing numbers of smaller and smaller corpora in an advantageous
order. It should be possible to cluster corpora by the point at which
they fail to achieve a significant difference when split from a
larger corpus. In this way, regions could be
grouped by their detectable boundaries, not a priori distinctions
based on geography or existing knowledge of dialect boundaries.
Of course this indirect method would not be needed if one had a direct
method for clustering speakers, by distance or other
measure. Development of such a method is worthwhile research for the future.
%THEND
\bibliographystyle{acl}
\bibliography{central}
\end{document}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: t
%%% End: