discussion.tex

\chapter{Discussion}
\label{discussion-chapter}

This chapter compares the dissertation's results to three areas. It
compares the results to dialectology, starting with the traditional
dialect regions of Sweden and moving to individual dialect phenomena. Then it
compares the results to phonological dialectometry, which uses many of
the same analytical techniques on phonology data. Finally, it compares
this work to previous work in the field of syntactic dialectometry and
summarizes its improvements. The chapter ends with a summary of the
work and its contributions to dialectology at large and Swedish
dialectology in particular.

% A big question is why trigrams are so good. All of the fancier feature
% sets do worse than trigrams. I should address this in the summary for
% sure.

% The reason is that the simpler tags are easier to tag. So I recommend
% for the real-world analyses with only untagged transcriptions as input
% to simply use trigrams.

\section{Comparison to Syntactic Dialectology}
\label{discussion-chapter-dialectology-section}

The comparison to syntactic dialectology consists of three
sections. The first section looks at the general expectations of
dialectology with respect to correlation with geographic distance. The
second section compares the traditional dialect regions of Sweden to
the ones found by the statistical dialect measure. The third section
finishes by comparing specific phenomena of dialect regions to the
corresponding features from interview sites.

\subsection{General Expectations}

The default expectation of dialect distance is that it should
correlate with geographic distance, see \namecite{chambers98} and
\namecite{gooskens04a}. The principal places that geographic distance
fails to correlate with dialect distance are where dialect boundaries
that exist between adjacent sites; here, a small geographic distance
is paired with large dialect distance. For non-adjacent sites, in
contrast, a large geographical distance may be paired with a small
dialect distance. This can occur, for example, with relic dialects,
where the innovative dialect expands from the center, leaving similar
dialects isolated on the edges. However, neither of these cases holds
for the Scandinavian languages; \namecite{hallberg05} points out that
Swedish dialect areas form a continuous gradient without any strong
boundaries. This means a particularly strong correlation between
geography and dialect. Therefore, the first step is to compare the
correlation of geographic distance with dialect distance as measured
here.

Unfortunately, the correlations between geographic distance and
dialect distance are uniformly low, even when they do attain
significance. The highest correlation is 0.36. Correlating dialect
distance with travel distance rather than geographic distance gives
0.37, which is an improvement, albeit a small one. However, as
Gooskens point out, time and distance required to travel between two
points at the beginning of the 21st century is considerably less than
it was one hundred years ago or more. Measuring travel time between
sites at some point in the past as she does might provide an even
better correlation with dialect distance.

Nonetheless, the overall pattern agrees with Hallberg's analysis;
there is a north-to-south gradient that is fairly smooth; the
composite cluster maps (figure \ref{map-composite-5-1000} in chapter
\ref{results-chapter}, for example) show this pattern best, but the
consensus tree and MDS maps do as well. The exceptions to this
gradient are the areas surrounding Stockholm and Malm\"o, as well as
the whole of the southern provinces Sk\aa{}ne and Blekinge. It may be
that modern urbanization has created a city/country divide, with
Stockholm and Malm\"o innovating and the rural areas becoming relic
dialects. These two exceptions will be discussed more in the next
section.

\subsection{Dialect Regions}
% so which is it? A city/country divide or is just that the
% traditional areas were Right All Along.
% I guess upon reflection, it's probably the latter...

According to dialectology, Sweden does not have strong dialect
boundaries, but it still has some traditional dialect areas. However,
these are loosely defined and do not have sharp borders; the Eastern
area is centered around Stockholm, the Western around G\"oteborg, the
Southern around Malm\"o, and the Northern area covers the north of
Sweden. In addition, the island of Gotland forms a separate area. The
MDS maps and consensus tree maps reproduce these areas with varying
degrees of fidelity.

For example, in the consensus tree figure \ref{map-consensus-5-1000}, the
cyan cluster corresponds to the Northern and Western dialect areas,
the orange cluster corresponds to the Eastern area, and the red/yellow
cluster corresponds to the Southern. There is a question that arises
from this grouping, though; why should the northern and western areas
appear in the consensus tree as one group? It looks as if the
consensus tree map makes it more important that they differ from the
East and South than that the differ from each other. The MDS maps
reinforce this point; they show that the western sites and northern sites
do in fact differ quite a bit. However, because the eastern and
southern sites are so close, a clustering technique, like consensus
trees, with exclusive group membership will put distant sites in the
same group.

The boundary between the Sk\aa{}ne and Blekinge is quite abrupt,
presumably mirroring the former Danish border that existed until the
end of the Middle Ages. This contradicts Hallberg, who explicitly
mentions that dialectology research finds no border there, and
that the strongest north/south division more closely approximates
\quotecite{leinonen08} diagonal boundary in map
\ref{leinonen-factors-3-4} below.

There are three possible explanations for this: first, there could be
statistical, accumulative evidence which Swedish dialectologists have
missed; second, the distribution of Swediasyn interview sites may be
too sparse to reflect the real border; in particular, there are very
few in Sm\aa{}land; third, the dialect landscape may have changed
since the prevailing dialectology opinion was established.  The last
explanation is attractive, since the Swedia corpus is around 50 years
newer than newest dialectology studies. However, this is an old
boundary: it mirrors the Sweden-Denmark political border that existed
over 400 years ago. It would be odd for it to disappear for over 350
years and re-appear just before 2000. Instead, I believe the first
explanation is more likely: Leinonen's results, in addition to
reproducing the boundary described by Hallberg, also place a boundary
at the same location as these syntactic results. This boundary is
visible in factors 2 (figure \ref{leinonen-factors-1-2}) and 5 (figure
\ref{leinonen-factors-5-6}) of her factor analysis based on the
phonology data of Swedia, discussed below in section
\ref{discussion-chapter-phonological-dialectometry}. Both Leinonen's
method and mine are capable of detecting distributional patterns that
are difficult to see from manual analysis. For example, in the
previuos chapter, I showed that the trigrams AB-AV-AB, despite
appearing in all interview sites, was more common in central Swedish
cluster A.

% The second explanation may also be true, considering that Leinonen's
% results reproduce the boundary that Hallberg mentions, but it does not
% invalidate the boundary that the syntactic measure detects, since her
% results reproduce it as well.

% remember, these produce a
%   city/country divide, a boundary at Jamshog, and something kind of
%   near Goteborg.

\subsection{Dialect Features}

The literature for Swedish syntactic dialectology is not extensive,
largely because there is not much syntactic dialectology for any
language. As a result, I will compare my results to two papers,
\namecite{delsing03} and \namecite{rosenkvist07}. The first paper is a
survey of syntactic dialectology from the late 19th and early 20th
centuries. In the same volume, other papers analyze specific phenomena
in more detail; the survey is mostly concerned with the dialect differences
and distributions rather than the syntactic analysis. The second is an
analysis of the South Swedish Apparent Cleft.

\namecite{delsing03} surveys a number of dialectology studies. These
studies date from the height of the field in Sweden, from circa
1880--1930, which Delsing at times augments with modern data. It is
worth noting that the Swedia data in the comparison was collected
around 2000, so there were likely changes in the dialects in the
intervening 70--120 years. This is particularly true in the northern
dialect areas, where improved travel and communication have
leveled the dialects considerably \cite{hallberg05}.

However, comparison to the phenomena in the survey may still yield
interesting results, so for each
phenomenon I will start with a summary of the phenomenon for Swedish
dialects: its geographic distribution and its linguistic
realizations. Then I will match the geographic distribution with
Swediasyn interview sites and represent the phenomenon in terms of the
feature sets developed in this dissertation. For this initial
analysis, trigram features are used because they are simple. This
matches Delsing's survey descriptions, which are for the most part
surface-oriented; other papers in the same volume with his survey
analyze the phenomena in more detail.

With the target sites and features defined, it is straightforward to count the
number of occurrences of each feature in each site and compare the
two. If the predicted dialect phenomenon is reflected in the data,
then the sites associated with the phenomenon will have more
occurrences of the target features than the non-associated sites. This
difference is precisely what the distance measures use.

This method is inadequate for two reasons: first, the translation of
linguistic analysis to feature representation will not be perfect and
may miss some valid instances of the linguistic phenomenon. Second,
more importantly, the differences are not yet checked for statistical
significance. As such, the comparison can only be suggestive;
checking for statistical significance will have to wait for future
work.

% As an aside, much of this missing information IS available to me, so I
% could look manually. But none of it made it through to the distance
% measures, and this analysis compares the way the distance measures
% make the decisions with the way that linguists make their
% decisions. So I have to use only the information that the distance
% measures used.

The maps reproduced here are taken from Delsing's survey.

\subsubsection{``Partitive'' Article}

Northern Sweden uses the suffixed article much more than the rest of
Sweden. The reason, Delsing says, is that some uses of the suffixed
article are not definite in the north; they have a partitive function,
similar to the partitive article in French, which is not present in
the rest of the country. See figure \ref{partitive-article} for an
example.

\begin{figure}
  \gll H\"a finns vattne d\"ari hinken. \\
  Here found water-the in bucket-the \\
  \trans `There is water in the bucket.'
  \caption{Suffix marking for partitive}
  \label{partitive-article}
\end{figure}

Unfortunately, the part-of-speech tag set used for this dissertation
is quite coarse; it does not record whether nouns are marked with the
definite suffix. Therefore, there is no way for the distance measure
to tell the difference between suffixed dialect usage and bare
standard usage.

\subsubsection{Proper-Noun Articles}

\begin{figure}
  \includegraphics[scale=0.7]{dialektboka-karta3}
  \caption{Proper-Noun Articles}
  \label{indefinite-article-proper-noun-map}
\end{figure}

\begin{figure}
 \gll En Bjurstr\"om ha aff\"arn. \\
  A Bjurstr\"om has the-store. \\
  \trans `Bjurstr\"om has a store.'
  \caption{Indefinite Article for Proper Nouns: First Names}
  \label{indefinite-article-proper-noun}
\end{figure}

In Northern Scandinavia, first names are preceded by an indefinite
article, and sometimes last names as well. The indefinite article also
precedes kinship terms that are used as proper names, for example
``Mother'' or ``Grandfather''. An example is given in figure
\ref{indefinite-article-proper-noun}. Standard Swedish does not
include this feature. In Sweden, this feature is found along the
border with Norway as well as Northern Sweden. In the Swediasyn data, this
includes the interview sites K\"ola, Indal, and Anundsj\"o---the dark
area in figure \ref{indefinite-article-proper-noun-map}, there labeled
``Prepropriell artikel''.

Unlike the partitive article suffix, this feature is easy to detect
with a coarse part-of-speech tag set. Specifically, it can be represented as the
bigram EN-PN (indefinite article-proper noun), which can be used as a
search term in the trigram feature set. The same EN-PN sequence is
expected for leaf-head paths, since the indefinite article depends on
proper noun. The phrase-structure-rule features should
look something like NP$\to$EN-PN.

Occurrences of the EN-PN bigram in the trigram feature set for
Leksand, Indal and K\"ola agree with the linguistic analysis: a rate
of 0.00007 versus 0.00006. Unfortunately, this result cannot be
trusted because the rate of occurrence for both regions is so rare, as
well as so close between the two regions. The only conclusion that can
be drawn is that the hypothesis is not yet disproved.

\subsubsection{Possessives and the article}

\begin{figure}
  \includegraphics[scale=0.7]{dialektboka-karta4}
  \caption{Proper-Noun Articles}
  \label{possessive-plus-article-map}
\end{figure}

\begin{figure}
 \gll naboens den stribede kat \\
  Neighbors' the striped cat \\
  \trans `The neighbors' striped cat'
  \caption{Simultaneous possessive and determiner in noun phrase in
    Danish, and at one time Southwest Sweden}
  \label{possessive-plus-article-example}
\end{figure}

In Swedish, and in the other Scandinavian countries, there is a good
deal of variation in the handling of possessives with articles. In
Swedish, normally only one is allowed in a noun phrase: either a
possessive or a determiner, but not both. However, in Danish and the
Danish-influenced areas of Sweden, both are allowed in certain cases:
for example, when the possessive and determiner are separated from the
noun by an adjective. Delsing gives an example from Danish, shown here
in figure \ref{possessive-plus-article-example}.  This pattern also
exists in the southwest corner of Sweden, very near to Denmark. In
figure \ref{possessive-plus-article-map}, this area is shaded
left-to-right diagonally; it includes the interview site Bara. In
addition, this pattern alternates with the standard Swedish pattern on
the island of Gotland (cross-hatched on the map), which includes the
interview sites Fole, F\aa{}r\"o and Sproge.

This pattern can be detected by analyzing the per-site recall for the
4-grams PO-PO-AJ-NN, PR-PO-AJ-NN and NN-PO-AJ-NN. The first is the
sequence pronoun-pronoun-adjective-noun, for example {\it mitt det
  gamla huset} ``My the old house-the''. The second starts with a
proper name, such as {\it Pers} ``Per's'', and the third starts with a
noun, such as {\it naboens} ``neighbor's''. These three 4-tag sequence
can be encoded as trigrams by breaking them into two pieces. This
allows them to be searched for in that the distance measure would have
encountered them.

In addition to this pattern, there is a second in the north of
Sweden. Here, it is simply that possessive personal pronouns are
allowed both before and after the noun. This pattern includes the
interview sites Indal and Anundsj\"o and is covered in the next
section.

Searching Bara, in the southwest of Sweden, for the previously
mentioned trigram patterns does not find them: the rate of occurrence
is 0.00289 inside Bara but 0.000341 outside. It should be higher in
Bara. However, Delsing, writing in 2003, mentions that residents of
Sk\aa{}ne that he has asked do not recognize this form either, so it is
possible that it has fallen out of use in the 70 years or so since it
was last reported.

Executing a similar search for the alternation of standard
Swedish with the possessive pronoun pattern in the Gotland sites (F\aa{}r\"o,
Fole and Sproge), the standard Swedish trigrams PO-AJ-NN, PR-AJ-NN and
NN-AJ-NN show similar results: 0.00441 in Gotland, 0.00495 outside
Gotland. This is opposite the predicted direction.

The final region in figure \ref{possessive-plus-article-map}, in northern
Sweden, which includes Indal and Anundsj\"o, is actually more
complicated than can be captured by the part-of-speech tags used here;
this region allows possessive proper nouns to occur with
suffix-determiner nouns. But this can occur in either order: for
example, both ``Pers huset'' and ``huset Pers'' is allowed. Although
both ``Pers hus'' and ``Pers huset'' produce identical tags (PN-NN),
trigrams do encode order, so the unusual order in ``huset Pers'' can be
searched for. Since both orders should be present in this northern
area, it should overuse bigrams like NN-PN (noun-proper noun) relative
to the rest of Sweden.

Searching for the bigrams NN-PN (noun-proper noun) and NN-PO
(noun-pronoun) shows a usage rate of 0.02532 for Indal and Anundsj\"o
and a rate of 0.02438 for the rest of Sweden. This is the expected
direction, but the rate of usage is very similar between the two
regions. The comparison is really too close to make a prediction
because the difference is not likely to be significant.

% \subsubsection{Pronominal Possessives}

% In Swedish, as well all of mainland Scandinavian, another possessive
% construction is the reflexive genitive, which consists of a
% noun-reflexive-noun sequence. An example is given in figure
% \ref{genitive-reflexive-normal-example}. However, this construction
% does not allow pronouns: the sequence noun-reflexive-pronoun is not
% allowed (see figure \ref{genitive-reflexive-pronoun-example}).

% \begin{figure}
%   {\it Per sitt hus} \\
%   Per its house
% ``Per's house'' \\
% \caption{Standard Swedish genitive reflexive construction}
% \label{genitive-reflexive-normal-example}
% \end{figure}

% \begin{figure}
%   {\it han sitt hus} \\
%   his its house
% ``his house'' \\
% \caption{Pronominal genitive reflexive construction}
% \label{genitive-reflexive-pronoun-example}
% \end{figure}

% However, this construction is allowed in NORTHERN SWEDEN.
% Oops, actually I think this whole section is whole throwaway intro to
% something else. Boooooo.

% However, this is not allowed with possessive pronouns:
% *{\it han sitt hus}.

% The prepositional genitive
% behaves the same way: {\it huset till Per} ``Per's
% house'' (gloss: house-the of Per) is legal but *{\it huset till meg}
% ``my house'' (gloss: house-the of me) is not.

% There is an exception for kinship words, which I don't understand
% yet. But somehow ``far min'' is different (maybe just because it's not
% ``min far''?)

% So basically standard Swedish allows trigrams sequences like NN-PO-NN
% ({\it Per sitt hus}) but not PO-PO-NN ({\it han sitt hus}). It also
% allows sequence like NN-PR-NN ({\it huset till Per}) but not NN-PR-PO
% ({\it huset till meg}).

% Does not work (is too close to call): 0.02229 vs 0.2429

% Reversing the bigram, looking for PO-NN in the south gives
% Works (but is still super close): 0.04243 vs 0.03998

% It looks like one set just uses more nouns than the others or
% something. Conclusion: inconclusive, leaning toward no---it looks like
% they're the same.

\subsubsection{Proper Noun Possessives}

\begin{figure}
 \gll Huset hans Per \\
  The-house his Per \\
  \trans `Per's house'
  \caption{Possessive formed of Possessive Pronoun and Proper Noun}
  \label{proper-noun-post-possessive}
\end{figure}

\begin{figure}
  \includegraphics[scale=0.7]{dialektboka-karta6}
  \caption{Proper-Noun Possessives}
  \label{proper-noun-post-possessive-map}
\end{figure}

In addition to the post-nominal possessive pattern of the previous
section, there is a variant that is common in Norway. Here, the
sequence is noun-possessive pronoun-proper noun. An example
of this pattern is given in figure \ref{proper-noun-post-possessive}.

This pattern overlaps slightly into Sweden, covering the interview
site K\"ola. The distribution is given in figure
\ref{proper-noun-post-possessive-map}. Note that the northern area
with small stripes is the same as in figure
\ref{possessive-plus-article-map}, and the northern area with thin
stripes has no matching sites. The area of interest is the one with
larger, thick stripes that covers the majority of Norway.

This phenomenon maps to a trigram NN-PO-PN: noun-pronoun-proper
noun. The occurrence rate of this trigram in K\"ola to the rest of
Sweden is 0 vs 0.00001. This is the wrong direction, and the value is
so low that it is probably noise. There are two possible causes for
this essentially zero result: either neither region has this feature
or there is not sufficient data to tell.

\subsubsection{Noun possessives}

Delsing mentions briefly that central Sweden, including \"Alvdalen and
V\"asterdalarna, uses the dative form of nouns for the
s-genitive. However, the part-of-speech tag set used here does not
distinguish between dative and other cases on nouns, so it is not
possible to represent this phenomenon in a way that the distance
measures could have used.

\subsubsection{Double indefinite}

In northern Sweden and northern Norway, indefinite articles are used
both before and after adjectives when modifying nouns. In map
\ref{double-indefinite-map}, this is the area covered by dark
diagonals, labeled ``Postadjektivisk artikel''. Delsing also calls
this the ``double indefinite''; for an example, see figure
\ref{double-indefinite-example}. One indefinite article is used after
each adjective, even for multiple adjectives, so {\it en stor en bil}
(a large car) but also {\it en stor en fin en bil} (a large fine car).

\begin{figure}
 \gll en stor en bil \\
  a large a car \\
  \trans `A large car'
  \caption{Double Indefinite}
  \label{double-indefinite-example}
\end{figure}

\begin{figure}
  \includegraphics[scale=0.7]{dialektboka-karta8}
  \caption{Double indefinite (post-adjectival articles)}
  \label{double-indefinite-map}
\end{figure}

In central Sweden, a similar pattern occurs, but the article is not
perceived as independent. Instead it is perceived as a suffix of the
adjective. In other words, the above example is perceived as {\it en
  stor-en bil} instead. According to Delsing, there is a difference in
intonation compared to the North Swedish construction, which does not
stress the intermediate articles nor co-ordinate them morphologically
as would be expected with a suffix. Unfortunately, this pattern
appears identical to the ordinary Swedish case given the course
part-of-speech tag set in use.  In contrast, the first pattern is
quite easy to represent with trigrams: the 4-gram EN-AJ-EN-NN and the
6-gram EN-AJ-EN-AJ-EN-NN---alternating series of indefinite articles
and adjectives ended by a noun. These larger n-grams can be broken
into the trigrams EN-AJ-EN and AJ-EN-NN in order to search for them in
the Swedia-based data.

The northern pattern includes the interview sites Anundsj\"o and
Indal. When measured, these trigrams occur at a rate of 0.00054 there
versus the rest of Sweden, which has a rate of 0.00012. From this we
can conclude that this is a rare phenomenon, but one that happens in
the north about 4 times more often than in the rest of Sweden.

\subsubsection{Double Definite}

\begin{figure}
 \gll det store huset \\
  The large the-house \\
  \trans `The large house'
  \caption{Double definite (Sweden and Norway)}
  \label{double-definite-example}
\end{figure}
\begin{figure}
 \gll det store hus \\
  The large house\\
  \trans `The large house'
  \caption{Single Indefinite (Denmark)}
  \label{single-definite-example}
\end{figure}
\begin{figure}
  \gll gamla h\'usid \\
  old house-the \\
  \trans `The old house'
  \caption{Single definite suffix (Iceland)}
  \label{single-definite-suffix-example}
\end{figure}
\begin{figure}
 \gll storhuset \\
  large-house-the \\
  \trans `The large house'
  \caption{Single definite suffix with combined adjective (Northern Sweden)}
  \label{adjective-single-definite-suffix-example}
\end{figure}

\begin{figure}
  \includegraphics[scale=0.7]{dialektboka-karta9}
  \caption{Double definite (and combined adjectives)}
  \label{double-definite-map}
\end{figure}

Double-definite with adjectives is standard in Sweden and Norway,
where there is a definite article as well as a definite suffix on the
noun (see figure \ref{double-definite-example}). This is not the case
in Denmark (figure \ref{single-definite-example}), where the definite
suffix disappears in case of a definite article, nor in Iceland, where
the definite is suffix-only and there is no article (figure
\ref{single-definite-suffix-example}).

However, in North Sweden, there is a fourth option, where the
adjective combines with the noun into a single word (figure
\ref{adjective-single-definite-suffix-example}). Delsing gives
examples like {\it storhuset} (the big house) and {\it
  stor-svart-gamm-katta} (the big, black, old cat), in which a series
of adjectives appear prefixed to a noun without their usual
morphological inflection. In Norrland, Delsing finds that this
construction is used almost to the exclusion of the normal Swedish
one. Further south, the two co-exist.


Therefore, since the annotation scheme does not differentiate between
a combined noun like {\it storhuset} and a normal noun like {\it
  huset}, the better way to detect the region difference is to count
the rate of normal trigrams like PO-AJ-NN (pronoun-adjective-noun);
this is the feature type that occurs rarely or not at all in the
north. If the region division in map \ref{double-definite-map} is
detected, then northern Sweden will have a lower rate of occurrence of
these standard trigrams.

As before, the two northern sites are Indal and Anundsj\"o. The rate
of PO-AJ-NN in this region is 0.00152, compared to 0.00216 for the
rest of Sweden. This difference is in the right direction, and it is
larger than most of the other comparisons here. However, like the
other comparisons, it has not been checked for significance so it is
currently only suggestive.

\subsubsection{Rosenkvist's Analysis of the South Swedish Apparent Cleft}

\begin{figure}
 \gll Det \"ar som han har missuppfattat. \\
  it is {\it som} he has misunderstood \\
  \trans `He has misunderstood.'
  \caption{Apparent Cleft}
  \label{apparent-cleft-example1}
\end{figure}

\begin{figure}
 \gll Det \"ar bara som han finner p\aa{}. \\
  it is only {\it som} he finds-on \\
  \trans `He just makes it up.'
  \caption{Apparent Cleft with adverb expressing speaker attitude}
  \label{apparent-cleft-example2}
\end{figure}


\namecite{rosenkvist07} analyzes a phenomenon he calls the South
Swedish Apparent Cleft. It involves an embedded clause, similar to a
cleft, but with no clefted constituent. Instead, the subordinating
conjunction {\it som} is directly preceded either by the verb or an
adverb expressing speaker attitude. The subject of of the {\it
  som}-clause must be a pronoun, though Rosenkvist notes that this may
be a pragmatic, not a syntactic, restriction. The two main variants
are given in figures \ref{apparent-cleft-example1} and
\ref{apparent-cleft-example2}, but the apparent cleft is also found in
yes/no questions and embedded clauses.

Unfortunately, Rosenkvist does not give a comprehensive syntactic
analysis of the apparent cleft. This means that a translation to our
feature set based on his description will necessarily be
surface-oriented in the same way this his analysis and results are
surface-oriented.

Accordingly, translating the sequences like {\it Det \"ar som han
  \ldots} gives the 4-gram PO-AV-UK-PO, and {\it Det \"ar bara som han
  \ldots} gives the 5-gram PO-AV-AB-UK-PO (pronoun-be
verb-adverb-subordinating conjunction-pronoun). Although these
part-of-speech sequences can obviously appear in other contexts, they
should appear more in the region that has apparent clefts than in the
region that does not. Converting these sequences to trigrams is
straightforward, producing 5 unique trigrams of interest, which the
distances measures should also have used to obtain their distances.

Rosenkvist captures the geographical distribution of the apparent
cleft in two ways. He first consults two collections of Swedish
novels, using the authors' birthplaces as proxies for their
dialect. Second, he uses the results of a questionnaire that he issued
to university students at several Swedish universities: Stockholm,
Gothenburg, Lund and Ume\aa{}.

Using author birthplace as a proxy for dialect, the apparent cleft can
be seen throughout southern and middle Sweden---this includes all the
interview sites except \AA{}rsunda, Indal and Anundsj\"o. However, based
on the survey results, the apparent cleft is only accepted by speakers
from Halland, Sm\aa{}land and Sk\aa{}ne. This includes the interview sites
Frilles\aa{}s, V\aa{}xtorp, Ankarsrum, Tors\aa{}s, Bara, L\"oderup, Norra
Rorum and \"Ossj\"o.

Therefore, the test for this comparison is the occurrence rates
for the 5 trigrams based on the two common forms Rosenkvist gives as
examples, with two variations: one region division based on author
birthplaces and one region division based on the student survey. The
southern region in both cases should have more occurrences of the
target trigrams.

For the larger cleft region division based on author birthplaces, the
comparison goes in the expected direction: a rate of 0.02430 in the
south and 0.02427 in the north. But these rates are so close to identical
that they should not be regarded as different. For the smaller division
based on the student survey, the comparison goes in the opposite
direction: 0.02264 in the south and 0.02491 in the north. Again, this
is not much of a difference.

With such a small difference, it is not possible to draw any
conclusions or even suggest whether the distance measures consistently
notice this difference. One problem is that it hard to capture a
phenomenon like this with trigrams, where the surface form is only
subtly different from that produced by other syntactic structures. A
more complete syntactic analysis of the phenomenon is needed so that
more advanced feature sets from dialectometry can be used to compare
to the results from dialectology.

\subsection{Conclusion}

The dialect constructions surveyed here do not support the agreement
of the new dialectometry results with existing dialectology results
nearly as well as the previous sections which compared the results at
a less detailed level. The larger problem is that no good method yet
exists for doing so; the differences were in some cases large enough
to be suggestive, but without significance testing, it is not possible
to know that they are reliable. It is possible that the small
differences are significant, and already being used by the distance
measures to distinguish regions; after all, the aggregation of many
small differences is the inherent in the working of the statistical
approach in this dissertation.

\section{Comparison to Phonological Dialectometry}
\label{discussion-chapter-phonological-dialectometry}
\begin{figure}
  \includegraphics[scale=0.4]{leinonen-factors-1-2}
  \caption{Factors 1 and 2 of Swedish vowels}
  \label{leinonen-factors-1-2}
\end{figure}

\begin{figure}
  \includegraphics[scale=0.4]{leinonen-factors-3-4}
  \caption{Factors 3 and 4 of Swedish vowels}
  \label{leinonen-factors-3-4}
\end{figure}

\begin{figure}
  \includegraphics[scale=0.4]{leinonen-factors-5-6}
  \caption{Factors 5 and 6 of Swedish vowels}
  \label{leinonen-factors-5-6}
\end{figure}

The comparison to phonological dialectometry is currently difficult in
two ways. First, there are few statistical methods in phonological
dialectometry. I proposed a simple Bayesian method \cite{sanders06}
and \namecite{hinrichs07} proposed two more complex methods, one
vector-based and the other from information theory. However, these
methods are less effective on small corpora than Levenshtein distance
and have not gained traction in the field.  Second, even comparing
results only, there has been little Swedish dialectometry to date. To
my knowledge, the only paper at the time of this writing is
\namecite{leinonen08}; its method is more similar to
\quotecite{spruit08} approach to syntax. It uses factor analysis to
characterize the distribution of nine phonological variables across
Sweden, but does not cluster the sites based on these
variables. However, the overall regions can still be compared. I
compare Leinonen's individual feature maps to my composite cluster and
MDS maps.

In addition, Leinonen's dissertation, currently unpublished, will
cover phonological dialectometry of Sweden comprehensively. In future
work, a better comparison should be possible, since both dissertations
are based on the same corpus.

Looking at Leinonen's first two maps, reproduced here as figure
\ref{leinonen-factors-1-2}, we see patterns similar to the
city/countryside difference from the syntactic results: in the first diagram,
Stockholm and Uppsala differ from the rest of the country, and in the
second Stockholm, Uppsala and Malm\"o areas all differ.

In Leinonen's third and fourth maps (figure \ref{leinonen-factors-3-4}),
there is a north/south divide roughly half way between Stockholm and
Malm\"o. This boundary generally reflects the north/south
gradient from my results. However, the phonological boundary is stronger and more
localized than numerous small syntactic ones, such as those seen in
the composite cluster map \ref{map-composite-5-1000}. It is closer to
the diagonal north/south boundary mentioned by \namecite{hallberg05}.

The fifth map (figure \ref{leinonen-factors-5-6}) is more specific
than the previous four; most of the sites are blue, but there are a
few in the south that are much yellower than the rest. These are the
same three sites that form the red cluster in figure \ref{red-cluster}
from the consensus tree results in chapter\ref{results-chapter}:
J\"amshog, \"Ossj\"o and Tors\aa{}s. The sixth map, however, shows a
clear east/west divide that is not reflected in my data.

Although this region-to-region comparison is not precise, it provides
hope that a quantitative comparison between the two result sets will
support high agreement with statistical evidence. The level of
agreement between the phonological results and syntactic results is
quite high. Of the six variables Leinonen illustrates with the maps in
figures \ref{leinonen-factors-5-6} -- \ref{leinonen-factors-3-4}, all
but one reflect some aspect of the combined syntactic results. The
exact overlap between Leinonen's fifth variable and the red cluster
from the consensus tree results is surprising for statistical methods.

\section{Comparison to Syntactic Dialectometry}

In the progression from dialectology of Swedish to phonological
dialectometry of Swedish and finally to syntactic dialectometry, there
is less and less existing literature. To my knowledge, this
dissertation is the first treatment of syntactic dialectometry for
Swedish. Even outside Swedish, very little syntactic dialectometry
exists. Besides \quotecite{spruit08} dissertation, based on Goebl's
limited-data techniques, statistical work is limited to Nerbonne and
Wiersma's work on Finnish \cite{nerbonne06} and \cite{wiersma09}, and
my work on English \cite{sanders07} and \cite{sanders08b}.

This dissertation is the first to show that a statistical measure
designed for syntax can find distances between dialect regions. It
directly addresses the shortcomings of the previous work, which showed
that a statistical measure could detect significant differences, but failed to
produce dialect distances. It evaluates
parameter variations, establishing which combinations of feature set,
distance measure and corpus size produce valid and useful results,
taking into account a number of practical considerations, such as
amount of existing annotation.

This dissertation shows that fairly small sites, on the order of
6,000--10,000 words, can produce significant distances. This contrasts
with previous work; the significant distances between English sites
were for much larger sizes: the ICE data for London had over 200,000
words, and Scotland over 25,000. The conclusion should be that when
the sites consist of properly collected dialect speech, the size
required to detect distance drops considerably. The Swediasyn corpus
captures dialect speech in a way that the ICE does not; the Swediasyn
contains interviews in homes, while the majority of the ICE is
interviews of students and professors at University College London.

In addition, the syntactic results of this dissertation agree closely
with the phonological results of \namecite{leinonen08}. Although
agreement of syntax and phonology is not necessarily a prediction when
looking for dialect regions, it is not surprising---circumstantial
evidence that a new method is valid because it agrees with an existing
one. This contrasts strongly with the English work, which found no
significant correlation of syntactic distance with phonological
distance. It may be that using the same corpus for both Swedish
studies was the key difference; the two English corpora's ages differed by
almost 50 years.

This dissertation agrees more closely with dialectology than previous
work. Although the English study reproduced the north/south divide
well known in British dialectology, it did not produce any more
detailed regions. In contrast, this study reproduced all of the
Swedish dialect regions. With respect to individual phenomena,
however, the feature comparison was inconclusive; a few results were
positive, but most were very close to zero. There are two problems: the
corpora once again differ in age---most of Swedish dialectology dates
from around 1900 while the Swediasyn was collected in 2000---as well as a
lack of significance testing. The small feature differences found
may well be significant, since the nature of statistical
methods is to accumulate many small differences, but it is not
possible to tell without a test.

Significance testing for precise feature analysis is future work, but
this is not necessarily a problem. For phonological dialectometry,
which began with Kessler's paper on Irish \cite{kessler95}, extraction
of specific features did not begin until much later, two to three
years after Heeringa's dissertation on the subject \cite{heeringa04},
with such work as \quotecite{prokic07}.
% I published similar work but it sucked and was only ever a
% presentation at a medical conference
In any case, \namecite{wiersma09} mentions a method for features of
individual regions that could be adapted to comparisons between a pair
of regions.

\chapter{Conclusion}
\label{conclusion-chapter}

The previous chapter discussed the impact of this work with respect to
previous work in various fields. In particular, it provided a picture
of how it advanced syntactic dialectometry. This chapter briefly
covers avenues of future work to which this work leads. This future work falls
into two categories: syntactic dialectometry and Swedish
dialectology.

\section{Future Work}
% TODO: Cite Shannon in methods chapter
% TODO: To the future work section, add:
% 1. Extract deps from CFG parses from Berkeley
% 2. Label dep features with both arc and POS tags interleaved in the
% proper order.
% 3. Better tag set. (duh, probably already have this one)
% 4. Non-linear feature set combination.
% Some other stuff from end of results chapter? (I think is only on
% paper)
% TODO: Various kinds of tag backoff; for example, to bigrams or coarser node
% tags.

% -- Other stuff left to do in results chapter --- %%
% TODO: Remove R.app's captions in favour of mine.
% TODO: Remove R.app's x-scale (y-scale) too
% TODO: CITE this, I think it's a Pieter Klieweg paper
% TODO: Get a whole example sentence probably. Ugh. People want so
% much context!
% TODO: Also format these examples properly.
% TODO: Get the full sentence either from jones or flenser


Some avenues of future work are obvious; Swediasyn is part of the
larger Nodalida project to create a syntactic dialect corpus for all
Scandinavian languages. And Swediasyn is itself not a complete
transcription of Swedia; for example, it does not include any of
Swedish-speaking Finland yet \cite{johennessen09}. Unfortunately, this work depends on
others since I do not speak any Scandinavian language natively. Once
these corpora are complete, they will provide a more complete picture
of syntactic variation over the entire Scandinavian language area.

With regard to feature sets, it is interesting that trigrams perform better
than the more complicated feature sets. From a linguists' point of
view, this is disturbing: why should the flattest representation of
syntax perform the best? This performance difference also
discourages others from developing even more complicated and
linguistically interesting feature sets. The reason for trigrams'
performance is likely because of the amount of automatic annotation
that is a prerequisite for the complex features developed
here. Trigrams rely on an automatic part-of-speech tagger, while
leaf-ancestor paths rely on an automatic parser that uses automatic
part-of-speech tags from that same tagger.

To enable more complex feature sets, manual annotation is needed. But
this is labor intensive. Failing that, improved automatic annotation
is needed, although this still usually implies some manual annotation
in the form of a seed corpus for bootstrapping
\cite{blitzer07,mcdonald06}. Bootstrapping should help automatic
parsing of dialect interviews: not only does the subject matter of an
interview differ from the typical newspaper training corpus, the
syntactic features where the dialect differs from the standard
language are precisely those that are hardest to parse. Giving a
machine parser a sample of dialect speech as training would allow it
to identify some of these features. For example, in the case of the
possible double modals discussed at the end of chapter
\ref{results-chapter}, the part of speech tagger never saw the tokens
\textit{``skulla kunna''} juxtaposed in the training. If both words
were not part of a closed class, it is likely that the tagger would
not produce the correct tag for this pair. The same problem applies to
parser, but because syntactic training is even more sparse, the parser
is less likely to to have seen similar structures in
training. The parser is correspondingly less likely to produce a
double modal structure without having seen it in training.

Processing of features is another area for future work: normalization
is the first half of this problem. The current sentence-level
normalizations function well for aggregate comparisons like cluster
maps, but for individual feature comparison, the overuse normalization
tends to rank highly features that may just be noise from the
annotation error. On the other hand, without the overuse
normalization, only very common features are high ranked. This makes
it hard to notice the unique features of a dialect that do not occur
much. A compromise that takes frequency into account to some extent is
needed, so that rare features can be highly ranked without introducing
noise from annotation errors.

The other half of the feature-processing problem is a test for
significance when comparing two regions. This would make sure that
comparisons to the dialectology literature are significant in the
future. \namecite{wiersma09} provides a similar method for testing
significance of individual features in a single region, so it should
be easy to modify this to work for comparisons between two regions.

Finally, another obvious extension of this work is a quantitative
comparison of these results on Swedish to the results in Leinonen's
upcoming dissertation on Swedish phonological dialectometry. Given the
agreement between these results and her published work, it is likely
that the correlation will be high. This comparison should be fairly
easy since both results use the same dialect corpus as a basis.

\section{Conclusion}

This dissertation establishes that statistical methods are useful
direction for syntactic dialectometry. Its results show that
significant differences can be obtained with dialect corpora. This
much had been accomplished by previous work. However, this work goes
on to establish that even smaller interviews of dialect speakers are
sufficient to produce significant distances, and investigates
variations on both feature set and distance measure. It shows that a
syntactic measure can reproduce the traditional regions of
dialectometry, and that it can produce agreement with a phonological
measure. Its comparison to individual dialect phenomena is
inconclusive, but opens an avenue for future investigation, and more
importantly, future development of methods to compare and rank
individual features.

Future directions based on this work are twofold. First, with a
statistical method established for syntax, dialectometry can begin to
investigate the syntactic features of other languages. Second, in
Swedish, this work and future work similar to it can contribute to
dialectology in general; syntax has been relatively neglected in
Swedish dialectology. As Swediasyn and Nodalida are completed, the
automatic analysis detailed in this dissertation can provide a quick
analysis of new data, and point linguists toward interesting dialect
features.

In conclusion, this dissertation has answered the questions of
agreement with dialectometry and best parameter configuration for
practical measurements, as well as agreement with phonological
dialectometry. It has established statistical methods for syntactic
dialectometry, pointing the way for future syntactic dialect studies,
future expansion of statistical methods in dialectometry, and future
syntactic analysis of Swedish.

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "dissertation.tex"
%%% End: