-
Notifications
You must be signed in to change notification settings - Fork 2
/
prospectus.tex
108 lines (98 loc) · 5.28 KB
/
prospectus.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
\documentclass[11pt,letterpaper]{article}
\pdfpagewidth=\paperwidth
\pdfpageheight=\paperheight
\usepackage{times}
\usepackage[all]{xy}
\usepackage{robbib}
\author{Nathan Sanders, Indiana University}
\title{Syntax Distance for Dialectometry}
\begin{document}
\maketitle
This dissertation will examine syntax distance in dialectometry using
computational methods as a basis. It is a continuation of my previous
work \cite{sanders07}, \cite{sanders08b} and earlier work by
\namecite{nerbonne06}, the first computational measure of syntax
distance. Dialectometry has existed as a field since
\namecite{seguy73} and is a sub-field of dialectology
\cite{chambers98}; recently, computational methods have come to
dominate dialectometry, but they are limited in focus compared to
previous work; most have explored phonological distance only, while
earlier methods integrated phonological, lexical, and syntactic data.
Dialectology is the study of linguistic variation.
Its goal is to characterize the linguistic features that separate two
language varieties. Dialectometry is a subfield of dialectology that
uses mathematically sophisticated methods to extract and combine
linguistic features. In recent years it has been associated with
computational linguistic work, most of which has focused on
phonology, starting with \namecite{kessler95} and treated most
comprehensively by \namecite{heeringa04}.
% MERGE
In dialectometry, a distance measure can be defined in two parts:
first, a method of decomposing the linguistic data into minimal,
linguistically meaningful features, and second, a method of combining
the features in a mathematically and linguistically sound way.
Dialectometry has focused on phonological distance measures, while
syntactic measures have remained undeveloped. Unfortunately,
approaching the problem by copying phonology is not a good solution.
Dialectology has traditionally worked with small corpora, which
suffices for phonology, because reliable features are easy to
automatically extract. For syntax, though, it is not possible to automatically
identify reliable features in small corpora.
A better approach is to define a distance measure that matches the specific
properties of syntactic data. Syntactic structure is easily decomposed
into a variety of features, and large syntactic corpora are available
for many languages. Together, these properties mean that large corpora
should provide enough evidence to rank automatically extracted
features. Based on work by \namecite{kessler01}, \namecite{nerbonne06}
proposed a simple measure called $R$ together with a test for
statistical significance. At present, however, $R$ has not been
adequately shown to detect dialect differences. A small body of work
suggests that it does, but to date there has not been a satisfying
correlation of its results with existing results from the dialectology
literature on syntax. \namecite{nerbonne06} found
differences between two generations of L2 English speakers, and I
found differences between most regions of England
\cite{sanders08b}, but neither result corresponded with
existing dialectology in more than the general outline.
% SUMMARISE Questions/Hypotheses here
Therefore, the most important question this dissertation will address
is whether $R$ is a good measure of syntax distance. Specifically,
have the ambiguous results of previous research been a shortcoming of
$R$, differences between phonological and syntactic corpora, or
differences between phonological and syntactic dialect boundaries?
This research will eliminate the corpus variability in
\namecite{sanders08b} that resulted in these confounding factors. Then
to determine whether $R$ agrees with traditional dialectology results
for syntax, the syntactic features it ranks highest will be
enumerated, then compared to the syntactic dialect features found by
dialectologists.
Besides $R$, this dissertation will propose and evaluate alternative
syntactic distance measures. Specifically, $R$ treats features
as atomic, not using any internal information. As such, it is not much
different than Goebl's WIV. This may not be a problem if the
decomposition methods used to generate features adequately capture
dialect differences in independent, atomic features. However, since I
plan to investigate features from both constituent and dependency
representations of syntax, a more context-aware method of combination
may be better than independent features.
% TODO: WIV, also Kullbeck-Leibler Divergence could work.
% Maybe also k-NN/MBL, HMM binary classifier (?), maybe even a
% neural net
Two secondary questions become relevant once a useful syntax distance
measure is established. First is what features provide the best
results. My previous work has shown that leaf-ancestor paths provide a
small advantage over part-of-speech trigrams, presumably by capturing
syntactic structure higher in the parse tree. Additional feature sets
will be evaluated to find additional ways to represent syntactic
information. Another question is whether $R$ agrees with phonological
distance measures like Levenshtein distance. Unlike agreement with
traditional dialectology, there is no {\it a priori} reason to expect
agreement between phonology and syntax in delineating dialect
boundaries.
\bibliographystyle{robbib}
\bibliography{central}
\end{document}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: t
%%% End: