-
Notifications
You must be signed in to change notification settings - Fork 2
/
discussion.tex
946 lines (822 loc) · 45.2 KB
/
discussion.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
\chapter{Discussion}
\label{discussion-chapter}
This chapter compares the dissertation's results to three areas. It
compares the results to dialectology, starting with the traditional
dialect regions of Sweden and moving to individual dialect phenomena. Then it
compares the results to phonological dialectometry, which uses many of
the same analytical techniques on phonology data. Finally, it compares
this work to previous work in the field of syntactic dialectometry and
summarizes its improvements. The chapter ends with a summary of the
work and its contributions to dialectology at large and Swedish
dialectology in particular.
% A big question is why trigrams are so good. All of the fancier feature
% sets do worse than trigrams. I should address this in the summary for
% sure.
% The reason is that the simpler tags are easier to tag. So I recommend
% for the real-world analyses with only untagged transcriptions as input
% to simply use trigrams.
\section{Comparison to Syntactic Dialectology}
\label{discussion-chapter-dialectology-section}
The comparison to syntactic dialectology consists of three
sections. The first section looks at the general expectations of
dialectology with respect to correlation with geographic distance. The
second section compares the traditional dialect regions of Sweden to
the ones found by the statistical dialect measure. The third section
finishes by comparing specific phenomena of dialect regions to the
corresponding features from interview sites.
\subsection{General Expectations}
The default expectation of dialect distance is that it should
correlate with geographic distance, see \namecite{chambers98} and
\namecite{gooskens04a}. The principal places that geographic distance
fails to correlate with dialect distance are where dialect boundaries
that exist between adjacent sites; here, a small geographic distance
is paired with large dialect distance. For non-adjacent sites, in
contrast, a large geographical distance may be paired with a small
dialect distance. This can occur, for example, with relic dialects,
where the innovative dialect expands from the center, leaving similar
dialects isolated on the edges. However, neither of these cases holds
for the Scandinavian languages; \namecite{hallberg05} points out that
Swedish dialect areas form a continuous gradient without any strong
boundaries. This means a particularly strong correlation between
geography and dialect. Therefore, the first step is to compare the
correlation of geographic distance with dialect distance as measured
here.
Unfortunately, the correlations between geographic distance and
dialect distance are uniformly low, even when they do attain
significance. The highest correlation is 0.36. Correlating dialect
distance with travel distance rather than geographic distance gives
0.37, which is an improvement, albeit a small one. However, as
Gooskens point out, time and distance required to travel between two
points at the beginning of the 21st century is considerably less than
it was one hundred years ago or more. Measuring travel time between
sites at some point in the past as she does might provide an even
better correlation with dialect distance.
Nonetheless, the overall pattern agrees with Hallberg's analysis;
there is a north-to-south gradient that is fairly smooth; the
composite cluster maps (figure \ref{map-composite-5-1000} in chapter
\ref{results-chapter}, for example) show this pattern best, but the
consensus tree and MDS maps do as well. The exceptions to this
gradient are the areas surrounding Stockholm and Malm\"o, as well as
the whole of the southern provinces Sk\aa{}ne and Blekinge. It may be
that modern urbanization has created a city/country divide, with
Stockholm and Malm\"o innovating and the rural areas becoming relic
dialects. These two exceptions will be discussed more in the next
section.
\subsection{Dialect Regions}
% so which is it? A city/country divide or is just that the
% traditional areas were Right All Along.
% I guess upon reflection, it's probably the latter...
According to dialectology, Sweden does not have strong dialect
boundaries, but it still has some traditional dialect areas. However,
these are loosely defined and do not have sharp borders; the Eastern
area is centered around Stockholm, the Western around G\"oteborg, the
Southern around Malm\"o, and the Northern area covers the north of
Sweden. In addition, the island of Gotland forms a separate area. The
MDS maps and consensus tree maps reproduce these areas with varying
degrees of fidelity.
For example, in the consensus tree figure \ref{map-consensus-5-1000}, the
cyan cluster corresponds to the Northern and Western dialect areas,
the orange cluster corresponds to the Eastern area, and the red/yellow
cluster corresponds to the Southern. There is a question that arises
from this grouping, though; why should the northern and western areas
appear in the consensus tree as one group? It looks as if the
consensus tree map makes it more important that they differ from the
East and South than that the differ from each other. The MDS maps
reinforce this point; they show that the western sites and northern sites
do in fact differ quite a bit. However, because the eastern and
southern sites are so close, a clustering technique, like consensus
trees, with exclusive group membership will put distant sites in the
same group.
The boundary between the Sk\aa{}ne and Blekinge is quite abrupt,
presumably mirroring the former Danish border that existed until the
end of the Middle Ages. This contradicts Hallberg, who explicitly
mentions that dialectology research finds no border there, and
that the strongest north/south division more closely approximates
\quotecite{leinonen08} diagonal boundary in map
\ref{leinonen-factors-3-4} below.
There are three possible explanations for this: first, there could be
statistical, accumulative evidence which Swedish dialectologists have
missed; second, the distribution of Swediasyn interview sites may be
too sparse to reflect the real border; in particular, there are very
few in Sm\aa{}land; third, the dialect landscape may have changed
since the prevailing dialectology opinion was established. The last
explanation is attractive, since the Swedia corpus is around 50 years
newer than newest dialectology studies. However, this is an old
boundary: it mirrors the Sweden-Denmark political border that existed
over 400 years ago. It would be odd for it to disappear for over 350
years and re-appear just before 2000. Instead, I believe the first
explanation is more likely: Leinonen's results, in addition to
reproducing the boundary described by Hallberg, also place a boundary
at the same location as these syntactic results. This boundary is
visible in factors 2 (figure \ref{leinonen-factors-1-2}) and 5 (figure
\ref{leinonen-factors-5-6}) of her factor analysis based on the
phonology data of Swedia, discussed below in section
\ref{discussion-chapter-phonological-dialectometry}. Both Leinonen's
method and mine are capable of detecting distributional patterns that
are difficult to see from manual analysis. For example, in the
previuos chapter, I showed that the trigrams AB-AV-AB, despite
appearing in all interview sites, was more common in central Swedish
cluster A.
% The second explanation may also be true, considering that Leinonen's
% results reproduce the boundary that Hallberg mentions, but it does not
% invalidate the boundary that the syntactic measure detects, since her
% results reproduce it as well.
% remember, these produce a
% city/country divide, a boundary at Jamshog, and something kind of
% near Goteborg.
\subsection{Dialect Features}
The literature for Swedish syntactic dialectology is not extensive,
largely because there is not much syntactic dialectology for any
language. As a result, I will compare my results to two papers,
\namecite{delsing03} and \namecite{rosenkvist07}. The first paper is a
survey of syntactic dialectology from the late 19th and early 20th
centuries. In the same volume, other papers analyze specific phenomena
in more detail; the survey is mostly concerned with the dialect differences
and distributions rather than the syntactic analysis. The second is an
analysis of the South Swedish Apparent Cleft.
\namecite{delsing03} surveys a number of dialectology studies. These
studies date from the height of the field in Sweden, from circa
1880--1930, which Delsing at times augments with modern data. It is
worth noting that the Swedia data in the comparison was collected
around 2000, so there were likely changes in the dialects in the
intervening 70--120 years. This is particularly true in the northern
dialect areas, where improved travel and communication have
leveled the dialects considerably \cite{hallberg05}.
However, comparison to the phenomena in the survey may still yield
interesting results, so for each
phenomenon I will start with a summary of the phenomenon for Swedish
dialects: its geographic distribution and its linguistic
realizations. Then I will match the geographic distribution with
Swediasyn interview sites and represent the phenomenon in terms of the
feature sets developed in this dissertation. For this initial
analysis, trigram features are used because they are simple. This
matches Delsing's survey descriptions, which are for the most part
surface-oriented; other papers in the same volume with his survey
analyze the phenomena in more detail.
With the target sites and features defined, it is straightforward to count the
number of occurrences of each feature in each site and compare the
two. If the predicted dialect phenomenon is reflected in the data,
then the sites associated with the phenomenon will have more
occurrences of the target features than the non-associated sites. This
difference is precisely what the distance measures use.
This method is inadequate for two reasons: first, the translation of
linguistic analysis to feature representation will not be perfect and
may miss some valid instances of the linguistic phenomenon. Second,
more importantly, the differences are not yet checked for statistical
significance. As such, the comparison can only be suggestive;
checking for statistical significance will have to wait for future
work.
% As an aside, much of this missing information IS available to me, so I
% could look manually. But none of it made it through to the distance
% measures, and this analysis compares the way the distance measures
% make the decisions with the way that linguists make their
% decisions. So I have to use only the information that the distance
% measures used.
The maps reproduced here are taken from Delsing's survey.
\subsubsection{``Partitive'' Article}
Northern Sweden uses the suffixed article much more than the rest of
Sweden. The reason, Delsing says, is that some uses of the suffixed
article are not definite in the north; they have a partitive function,
similar to the partitive article in French, which is not present in
the rest of the country. See figure \ref{partitive-article} for an
example.
\begin{figure}
\gll H\"a finns vattne d\"ari hinken. \\
Here found water-the in bucket-the \\
\trans `There is water in the bucket.'
\caption{Suffix marking for partitive}
\label{partitive-article}
\end{figure}
Unfortunately, the part-of-speech tag set used for this dissertation
is quite coarse; it does not record whether nouns are marked with the
definite suffix. Therefore, there is no way for the distance measure
to tell the difference between suffixed dialect usage and bare
standard usage.
\subsubsection{Proper-Noun Articles}
\begin{figure}
\includegraphics[scale=0.7]{dialektboka-karta3}
\caption{Proper-Noun Articles}
\label{indefinite-article-proper-noun-map}
\end{figure}
\begin{figure}
\gll En Bjurstr\"om ha aff\"arn. \\
A Bjurstr\"om has the-store. \\
\trans `Bjurstr\"om has a store.'
\caption{Indefinite Article for Proper Nouns: First Names}
\label{indefinite-article-proper-noun}
\end{figure}
In Northern Scandinavia, first names are preceded by an indefinite
article, and sometimes last names as well. The indefinite article also
precedes kinship terms that are used as proper names, for example
``Mother'' or ``Grandfather''. An example is given in figure
\ref{indefinite-article-proper-noun}. Standard Swedish does not
include this feature. In Sweden, this feature is found along the
border with Norway as well as Northern Sweden. In the Swediasyn data, this
includes the interview sites K\"ola, Indal, and Anundsj\"o---the dark
area in figure \ref{indefinite-article-proper-noun-map}, there labeled
``Prepropriell artikel''.
Unlike the partitive article suffix, this feature is easy to detect
with a coarse part-of-speech tag set. Specifically, it can be represented as the
bigram EN-PN (indefinite article-proper noun), which can be used as a
search term in the trigram feature set. The same EN-PN sequence is
expected for leaf-head paths, since the indefinite article depends on
proper noun. The phrase-structure-rule features should
look something like NP$\to$EN-PN.
Occurrences of the EN-PN bigram in the trigram feature set for
Leksand, Indal and K\"ola agree with the linguistic analysis: a rate
of 0.00007 versus 0.00006. Unfortunately, this result cannot be
trusted because the rate of occurrence for both regions is so rare, as
well as so close between the two regions. The only conclusion that can
be drawn is that the hypothesis is not yet disproved.
\subsubsection{Possessives and the article}
\begin{figure}
\includegraphics[scale=0.7]{dialektboka-karta4}
\caption{Proper-Noun Articles}
\label{possessive-plus-article-map}
\end{figure}
\begin{figure}
\gll naboens den stribede kat \\
Neighbors' the striped cat \\
\trans `The neighbors' striped cat'
\caption{Simultaneous possessive and determiner in noun phrase in
Danish, and at one time Southwest Sweden}
\label{possessive-plus-article-example}
\end{figure}
In Swedish, and in the other Scandinavian countries, there is a good
deal of variation in the handling of possessives with articles. In
Swedish, normally only one is allowed in a noun phrase: either a
possessive or a determiner, but not both. However, in Danish and the
Danish-influenced areas of Sweden, both are allowed in certain cases:
for example, when the possessive and determiner are separated from the
noun by an adjective. Delsing gives an example from Danish, shown here
in figure \ref{possessive-plus-article-example}. This pattern also
exists in the southwest corner of Sweden, very near to Denmark. In
figure \ref{possessive-plus-article-map}, this area is shaded
left-to-right diagonally; it includes the interview site Bara. In
addition, this pattern alternates with the standard Swedish pattern on
the island of Gotland (cross-hatched on the map), which includes the
interview sites Fole, F\aa{}r\"o and Sproge.
This pattern can be detected by analyzing the per-site recall for the
4-grams PO-PO-AJ-NN, PR-PO-AJ-NN and NN-PO-AJ-NN. The first is the
sequence pronoun-pronoun-adjective-noun, for example {\it mitt det
gamla huset} ``My the old house-the''. The second starts with a
proper name, such as {\it Pers} ``Per's'', and the third starts with a
noun, such as {\it naboens} ``neighbor's''. These three 4-tag sequence
can be encoded as trigrams by breaking them into two pieces. This
allows them to be searched for in that the distance measure would have
encountered them.
In addition to this pattern, there is a second in the north of
Sweden. Here, it is simply that possessive personal pronouns are
allowed both before and after the noun. This pattern includes the
interview sites Indal and Anundsj\"o and is covered in the next
section.
Searching Bara, in the southwest of Sweden, for the previously
mentioned trigram patterns does not find them: the rate of occurrence
is 0.00289 inside Bara but 0.000341 outside. It should be higher in
Bara. However, Delsing, writing in 2003, mentions that residents of
Sk\aa{}ne that he has asked do not recognize this form either, so it is
possible that it has fallen out of use in the 70 years or so since it
was last reported.
Executing a similar search for the alternation of standard
Swedish with the possessive pronoun pattern in the Gotland sites (F\aa{}r\"o,
Fole and Sproge), the standard Swedish trigrams PO-AJ-NN, PR-AJ-NN and
NN-AJ-NN show similar results: 0.00441 in Gotland, 0.00495 outside
Gotland. This is opposite the predicted direction.
The final region in figure \ref{possessive-plus-article-map}, in northern
Sweden, which includes Indal and Anundsj\"o, is actually more
complicated than can be captured by the part-of-speech tags used here;
this region allows possessive proper nouns to occur with
suffix-determiner nouns. But this can occur in either order: for
example, both ``Pers huset'' and ``huset Pers'' is allowed. Although
both ``Pers hus'' and ``Pers huset'' produce identical tags (PN-NN),
trigrams do encode order, so the unusual order in ``huset Pers'' can be
searched for. Since both orders should be present in this northern
area, it should overuse bigrams like NN-PN (noun-proper noun) relative
to the rest of Sweden.
Searching for the bigrams NN-PN (noun-proper noun) and NN-PO
(noun-pronoun) shows a usage rate of 0.02532 for Indal and Anundsj\"o
and a rate of 0.02438 for the rest of Sweden. This is the expected
direction, but the rate of usage is very similar between the two
regions. The comparison is really too close to make a prediction
because the difference is not likely to be significant.
% \subsubsection{Pronominal Possessives}
% In Swedish, as well all of mainland Scandinavian, another possessive
% construction is the reflexive genitive, which consists of a
% noun-reflexive-noun sequence. An example is given in figure
% \ref{genitive-reflexive-normal-example}. However, this construction
% does not allow pronouns: the sequence noun-reflexive-pronoun is not
% allowed (see figure \ref{genitive-reflexive-pronoun-example}).
% \begin{figure}
% {\it Per sitt hus} \\
% Per its house
% ``Per's house'' \\
% \caption{Standard Swedish genitive reflexive construction}
% \label{genitive-reflexive-normal-example}
% \end{figure}
% \begin{figure}
% {\it han sitt hus} \\
% his its house
% ``his house'' \\
% \caption{Pronominal genitive reflexive construction}
% \label{genitive-reflexive-pronoun-example}
% \end{figure}
% However, this construction is allowed in NORTHERN SWEDEN.
% Oops, actually I think this whole section is whole throwaway intro to
% something else. Boooooo.
% However, this is not allowed with possessive pronouns:
% *{\it han sitt hus}.
% The prepositional genitive
% behaves the same way: {\it huset till Per} ``Per's
% house'' (gloss: house-the of Per) is legal but *{\it huset till meg}
% ``my house'' (gloss: house-the of me) is not.
% There is an exception for kinship words, which I don't understand
% yet. But somehow ``far min'' is different (maybe just because it's not
% ``min far''?)
% So basically standard Swedish allows trigrams sequences like NN-PO-NN
% ({\it Per sitt hus}) but not PO-PO-NN ({\it han sitt hus}). It also
% allows sequence like NN-PR-NN ({\it huset till Per}) but not NN-PR-PO
% ({\it huset till meg}).
% Does not work (is too close to call): 0.02229 vs 0.2429
% Reversing the bigram, looking for PO-NN in the south gives
% Works (but is still super close): 0.04243 vs 0.03998
% It looks like one set just uses more nouns than the others or
% something. Conclusion: inconclusive, leaning toward no---it looks like
% they're the same.
\subsubsection{Proper Noun Possessives}
\begin{figure}
\gll Huset hans Per \\
The-house his Per \\
\trans `Per's house'
\caption{Possessive formed of Possessive Pronoun and Proper Noun}
\label{proper-noun-post-possessive}
\end{figure}
\begin{figure}
\includegraphics[scale=0.7]{dialektboka-karta6}
\caption{Proper-Noun Possessives}
\label{proper-noun-post-possessive-map}
\end{figure}
In addition to the post-nominal possessive pattern of the previous
section, there is a variant that is common in Norway. Here, the
sequence is noun-possessive pronoun-proper noun. An example
of this pattern is given in figure \ref{proper-noun-post-possessive}.
This pattern overlaps slightly into Sweden, covering the interview
site K\"ola. The distribution is given in figure
\ref{proper-noun-post-possessive-map}. Note that the northern area
with small stripes is the same as in figure
\ref{possessive-plus-article-map}, and the northern area with thin
stripes has no matching sites. The area of interest is the one with
larger, thick stripes that covers the majority of Norway.
This phenomenon maps to a trigram NN-PO-PN: noun-pronoun-proper
noun. The occurrence rate of this trigram in K\"ola to the rest of
Sweden is 0 vs 0.00001. This is the wrong direction, and the value is
so low that it is probably noise. There are two possible causes for
this essentially zero result: either neither region has this feature
or there is not sufficient data to tell.
\subsubsection{Noun possessives}
Delsing mentions briefly that central Sweden, including \"Alvdalen and
V\"asterdalarna, uses the dative form of nouns for the
s-genitive. However, the part-of-speech tag set used here does not
distinguish between dative and other cases on nouns, so it is not
possible to represent this phenomenon in a way that the distance
measures could have used.
\subsubsection{Double indefinite}
In northern Sweden and northern Norway, indefinite articles are used
both before and after adjectives when modifying nouns. In map
\ref{double-indefinite-map}, this is the area covered by dark
diagonals, labeled ``Postadjektivisk artikel''. Delsing also calls
this the ``double indefinite''; for an example, see figure
\ref{double-indefinite-example}. One indefinite article is used after
each adjective, even for multiple adjectives, so {\it en stor en bil}
(a large car) but also {\it en stor en fin en bil} (a large fine car).
\begin{figure}
\gll en stor en bil \\
a large a car \\
\trans `A large car'
\caption{Double Indefinite}
\label{double-indefinite-example}
\end{figure}
\begin{figure}
\includegraphics[scale=0.7]{dialektboka-karta8}
\caption{Double indefinite (post-adjectival articles)}
\label{double-indefinite-map}
\end{figure}
In central Sweden, a similar pattern occurs, but the article is not
perceived as independent. Instead it is perceived as a suffix of the
adjective. In other words, the above example is perceived as {\it en
stor-en bil} instead. According to Delsing, there is a difference in
intonation compared to the North Swedish construction, which does not
stress the intermediate articles nor co-ordinate them morphologically
as would be expected with a suffix. Unfortunately, this pattern
appears identical to the ordinary Swedish case given the course
part-of-speech tag set in use. In contrast, the first pattern is
quite easy to represent with trigrams: the 4-gram EN-AJ-EN-NN and the
6-gram EN-AJ-EN-AJ-EN-NN---alternating series of indefinite articles
and adjectives ended by a noun. These larger n-grams can be broken
into the trigrams EN-AJ-EN and AJ-EN-NN in order to search for them in
the Swedia-based data.
The northern pattern includes the interview sites Anundsj\"o and
Indal. When measured, these trigrams occur at a rate of 0.00054 there
versus the rest of Sweden, which has a rate of 0.00012. From this we
can conclude that this is a rare phenomenon, but one that happens in
the north about 4 times more often than in the rest of Sweden.
\subsubsection{Double Definite}
\begin{figure}
\gll det store huset \\
The large the-house \\
\trans `The large house'
\caption{Double definite (Sweden and Norway)}
\label{double-definite-example}
\end{figure}
\begin{figure}
\gll det store hus \\
The large house\\
\trans `The large house'
\caption{Single Indefinite (Denmark)}
\label{single-definite-example}
\end{figure}
\begin{figure}
\gll gamla h\'usid \\
old house-the \\
\trans `The old house'
\caption{Single definite suffix (Iceland)}
\label{single-definite-suffix-example}
\end{figure}
\begin{figure}
\gll storhuset \\
large-house-the \\
\trans `The large house'
\caption{Single definite suffix with combined adjective (Northern Sweden)}
\label{adjective-single-definite-suffix-example}
\end{figure}
\begin{figure}
\includegraphics[scale=0.7]{dialektboka-karta9}
\caption{Double definite (and combined adjectives)}
\label{double-definite-map}
\end{figure}
Double-definite with adjectives is standard in Sweden and Norway,
where there is a definite article as well as a definite suffix on the
noun (see figure \ref{double-definite-example}). This is not the case
in Denmark (figure \ref{single-definite-example}), where the definite
suffix disappears in case of a definite article, nor in Iceland, where
the definite is suffix-only and there is no article (figure
\ref{single-definite-suffix-example}).
However, in North Sweden, there is a fourth option, where the
adjective combines with the noun into a single word (figure
\ref{adjective-single-definite-suffix-example}). Delsing gives
examples like {\it storhuset} (the big house) and {\it
stor-svart-gamm-katta} (the big, black, old cat), in which a series
of adjectives appear prefixed to a noun without their usual
morphological inflection. In Norrland, Delsing finds that this
construction is used almost to the exclusion of the normal Swedish
one. Further south, the two co-exist.
Therefore, since the annotation scheme does not differentiate between
a combined noun like {\it storhuset} and a normal noun like {\it
huset}, the better way to detect the region difference is to count
the rate of normal trigrams like PO-AJ-NN (pronoun-adjective-noun);
this is the feature type that occurs rarely or not at all in the
north. If the region division in map \ref{double-definite-map} is
detected, then northern Sweden will have a lower rate of occurrence of
these standard trigrams.
As before, the two northern sites are Indal and Anundsj\"o. The rate
of PO-AJ-NN in this region is 0.00152, compared to 0.00216 for the
rest of Sweden. This difference is in the right direction, and it is
larger than most of the other comparisons here. However, like the
other comparisons, it has not been checked for significance so it is
currently only suggestive.
\subsubsection{Rosenkvist's Analysis of the South Swedish Apparent Cleft}
\begin{figure}
\gll Det \"ar som han har missuppfattat. \\
it is {\it som} he has misunderstood \\
\trans `He has misunderstood.'
\caption{Apparent Cleft}
\label{apparent-cleft-example1}
\end{figure}
\begin{figure}
\gll Det \"ar bara som han finner p\aa{}. \\
it is only {\it som} he finds-on \\
\trans `He just makes it up.'
\caption{Apparent Cleft with adverb expressing speaker attitude}
\label{apparent-cleft-example2}
\end{figure}
\namecite{rosenkvist07} analyzes a phenomenon he calls the South
Swedish Apparent Cleft. It involves an embedded clause, similar to a
cleft, but with no clefted constituent. Instead, the subordinating
conjunction {\it som} is directly preceded either by the verb or an
adverb expressing speaker attitude. The subject of of the {\it
som}-clause must be a pronoun, though Rosenkvist notes that this may
be a pragmatic, not a syntactic, restriction. The two main variants
are given in figures \ref{apparent-cleft-example1} and
\ref{apparent-cleft-example2}, but the apparent cleft is also found in
yes/no questions and embedded clauses.
Unfortunately, Rosenkvist does not give a comprehensive syntactic
analysis of the apparent cleft. This means that a translation to our
feature set based on his description will necessarily be
surface-oriented in the same way this his analysis and results are
surface-oriented.
Accordingly, translating the sequences like {\it Det \"ar som han
\ldots} gives the 4-gram PO-AV-UK-PO, and {\it Det \"ar bara som han
\ldots} gives the 5-gram PO-AV-AB-UK-PO (pronoun-be
verb-adverb-subordinating conjunction-pronoun). Although these
part-of-speech sequences can obviously appear in other contexts, they
should appear more in the region that has apparent clefts than in the
region that does not. Converting these sequences to trigrams is
straightforward, producing 5 unique trigrams of interest, which the
distances measures should also have used to obtain their distances.
Rosenkvist captures the geographical distribution of the apparent
cleft in two ways. He first consults two collections of Swedish
novels, using the authors' birthplaces as proxies for their
dialect. Second, he uses the results of a questionnaire that he issued
to university students at several Swedish universities: Stockholm,
Gothenburg, Lund and Ume\aa{}.
Using author birthplace as a proxy for dialect, the apparent cleft can
be seen throughout southern and middle Sweden---this includes all the
interview sites except \AA{}rsunda, Indal and Anundsj\"o. However, based
on the survey results, the apparent cleft is only accepted by speakers
from Halland, Sm\aa{}land and Sk\aa{}ne. This includes the interview sites
Frilles\aa{}s, V\aa{}xtorp, Ankarsrum, Tors\aa{}s, Bara, L\"oderup, Norra
Rorum and \"Ossj\"o.
Therefore, the test for this comparison is the occurrence rates
for the 5 trigrams based on the two common forms Rosenkvist gives as
examples, with two variations: one region division based on author
birthplaces and one region division based on the student survey. The
southern region in both cases should have more occurrences of the
target trigrams.
For the larger cleft region division based on author birthplaces, the
comparison goes in the expected direction: a rate of 0.02430 in the
south and 0.02427 in the north. But these rates are so close to identical
that they should not be regarded as different. For the smaller division
based on the student survey, the comparison goes in the opposite
direction: 0.02264 in the south and 0.02491 in the north. Again, this
is not much of a difference.
With such a small difference, it is not possible to draw any
conclusions or even suggest whether the distance measures consistently
notice this difference. One problem is that it hard to capture a
phenomenon like this with trigrams, where the surface form is only
subtly different from that produced by other syntactic structures. A
more complete syntactic analysis of the phenomenon is needed so that
more advanced feature sets from dialectometry can be used to compare
to the results from dialectology.
\subsection{Conclusion}
The dialect constructions surveyed here do not support the agreement
of the new dialectometry results with existing dialectology results
nearly as well as the previous sections which compared the results at
a less detailed level. The larger problem is that no good method yet
exists for doing so; the differences were in some cases large enough
to be suggestive, but without significance testing, it is not possible
to know that they are reliable. It is possible that the small
differences are significant, and already being used by the distance
measures to distinguish regions; after all, the aggregation of many
small differences is the inherent in the working of the statistical
approach in this dissertation.
\section{Comparison to Phonological Dialectometry}
\label{discussion-chapter-phonological-dialectometry}
\begin{figure}
\includegraphics[scale=0.4]{leinonen-factors-1-2}
\caption{Factors 1 and 2 of Swedish vowels}
\label{leinonen-factors-1-2}
\end{figure}
\begin{figure}
\includegraphics[scale=0.4]{leinonen-factors-3-4}
\caption{Factors 3 and 4 of Swedish vowels}
\label{leinonen-factors-3-4}
\end{figure}
\begin{figure}
\includegraphics[scale=0.4]{leinonen-factors-5-6}
\caption{Factors 5 and 6 of Swedish vowels}
\label{leinonen-factors-5-6}
\end{figure}
The comparison to phonological dialectometry is currently difficult in
two ways. First, there are few statistical methods in phonological
dialectometry. I proposed a simple Bayesian method \cite{sanders06}
and \namecite{hinrichs07} proposed two more complex methods, one
vector-based and the other from information theory. However, these
methods are less effective on small corpora than Levenshtein distance
and have not gained traction in the field. Second, even comparing
results only, there has been little Swedish dialectometry to date. To
my knowledge, the only paper at the time of this writing is
\namecite{leinonen08}; its method is more similar to
\quotecite{spruit08} approach to syntax. It uses factor analysis to
characterize the distribution of nine phonological variables across
Sweden, but does not cluster the sites based on these
variables. However, the overall regions can still be compared. I
compare Leinonen's individual feature maps to my composite cluster and
MDS maps.
In addition, Leinonen's dissertation, currently unpublished, will
cover phonological dialectometry of Sweden comprehensively. In future
work, a better comparison should be possible, since both dissertations
are based on the same corpus.
Looking at Leinonen's first two maps, reproduced here as figure
\ref{leinonen-factors-1-2}, we see patterns similar to the
city/countryside difference from the syntactic results: in the first diagram,
Stockholm and Uppsala differ from the rest of the country, and in the
second Stockholm, Uppsala and Malm\"o areas all differ.
In Leinonen's third and fourth maps (figure \ref{leinonen-factors-3-4}),
there is a north/south divide roughly half way between Stockholm and
Malm\"o. This boundary generally reflects the north/south
gradient from my results. However, the phonological boundary is stronger and more
localized than numerous small syntactic ones, such as those seen in
the composite cluster map \ref{map-composite-5-1000}. It is closer to
the diagonal north/south boundary mentioned by \namecite{hallberg05}.
The fifth map (figure \ref{leinonen-factors-5-6}) is more specific
than the previous four; most of the sites are blue, but there are a
few in the south that are much yellower than the rest. These are the
same three sites that form the red cluster in figure \ref{red-cluster}
from the consensus tree results in chapter\ref{results-chapter}:
J\"amshog, \"Ossj\"o and Tors\aa{}s. The sixth map, however, shows a
clear east/west divide that is not reflected in my data.
Although this region-to-region comparison is not precise, it provides
hope that a quantitative comparison between the two result sets will
support high agreement with statistical evidence. The level of
agreement between the phonological results and syntactic results is
quite high. Of the six variables Leinonen illustrates with the maps in
figures \ref{leinonen-factors-5-6} -- \ref{leinonen-factors-3-4}, all
but one reflect some aspect of the combined syntactic results. The
exact overlap between Leinonen's fifth variable and the red cluster
from the consensus tree results is surprising for statistical methods.
\section{Comparison to Syntactic Dialectometry}
In the progression from dialectology of Swedish to phonological
dialectometry of Swedish and finally to syntactic dialectometry, there
is less and less existing literature. To my knowledge, this
dissertation is the first treatment of syntactic dialectometry for
Swedish. Even outside Swedish, very little syntactic dialectometry
exists. Besides \quotecite{spruit08} dissertation, based on Goebl's
limited-data techniques, statistical work is limited to Nerbonne and
Wiersma's work on Finnish \cite{nerbonne06} and \cite{wiersma09}, and
my work on English \cite{sanders07} and \cite{sanders08b}.
This dissertation is the first to show that a statistical measure
designed for syntax can find distances between dialect regions. It
directly addresses the shortcomings of the previous work, which showed
that a statistical measure could detect significant differences, but failed to
produce dialect distances. It evaluates
parameter variations, establishing which combinations of feature set,
distance measure and corpus size produce valid and useful results,
taking into account a number of practical considerations, such as
amount of existing annotation.
This dissertation shows that fairly small sites, on the order of
6,000--10,000 words, can produce significant distances. This contrasts
with previous work; the significant distances between English sites
were for much larger sizes: the ICE data for London had over 200,000
words, and Scotland over 25,000. The conclusion should be that when
the sites consist of properly collected dialect speech, the size
required to detect distance drops considerably. The Swediasyn corpus
captures dialect speech in a way that the ICE does not; the Swediasyn
contains interviews in homes, while the majority of the ICE is
interviews of students and professors at University College London.
In addition, the syntactic results of this dissertation agree closely
with the phonological results of \namecite{leinonen08}. Although
agreement of syntax and phonology is not necessarily a prediction when
looking for dialect regions, it is not surprising---circumstantial
evidence that a new method is valid because it agrees with an existing
one. This contrasts strongly with the English work, which found no
significant correlation of syntactic distance with phonological
distance. It may be that using the same corpus for both Swedish
studies was the key difference; the two English corpora's ages differed by
almost 50 years.
This dissertation agrees more closely with dialectology than previous
work. Although the English study reproduced the north/south divide
well known in British dialectology, it did not produce any more
detailed regions. In contrast, this study reproduced all of the
Swedish dialect regions. With respect to individual phenomena,
however, the feature comparison was inconclusive; a few results were
positive, but most were very close to zero. There are two problems: the
corpora once again differ in age---most of Swedish dialectology dates
from around 1900 while the Swediasyn was collected in 2000---as well as a
lack of significance testing. The small feature differences found
may well be significant, since the nature of statistical
methods is to accumulate many small differences, but it is not
possible to tell without a test.
Significance testing for precise feature analysis is future work, but
this is not necessarily a problem. For phonological dialectometry,
which began with Kessler's paper on Irish \cite{kessler95}, extraction
of specific features did not begin until much later, two to three
years after Heeringa's dissertation on the subject \cite{heeringa04},
with such work as \quotecite{prokic07}.
% I published similar work but it sucked and was only ever a
% presentation at a medical conference
In any case, \namecite{wiersma09} mentions a method for features of
individual regions that could be adapted to comparisons between a pair
of regions.
\chapter{Conclusion}
\label{conclusion-chapter}
The previous chapter discussed the impact of this work with respect to
previous work in various fields. In particular, it provided a picture
of how it advanced syntactic dialectometry. This chapter briefly
covers avenues of future work to which this work leads. This future work falls
into two categories: syntactic dialectometry and Swedish
dialectology.
\section{Future Work}
% TODO: Cite Shannon in methods chapter
% TODO: To the future work section, add:
% 1. Extract deps from CFG parses from Berkeley
% 2. Label dep features with both arc and POS tags interleaved in the
% proper order.
% 3. Better tag set. (duh, probably already have this one)
% 4. Non-linear feature set combination.
% Some other stuff from end of results chapter? (I think is only on
% paper)
% TODO: Various kinds of tag backoff; for example, to bigrams or coarser node
% tags.
% -- Other stuff left to do in results chapter --- %%
% TODO: Remove R.app's captions in favour of mine.
% TODO: Remove R.app's x-scale (y-scale) too
% TODO: CITE this, I think it's a Pieter Klieweg paper
% TODO: Get a whole example sentence probably. Ugh. People want so
% much context!
% TODO: Also format these examples properly.
% TODO: Get the full sentence either from jones or flenser
Some avenues of future work are obvious; Swediasyn is part of the
larger Nodalida project to create a syntactic dialect corpus for all
Scandinavian languages. And Swediasyn is itself not a complete
transcription of Swedia; for example, it does not include any of
Swedish-speaking Finland yet \cite{johennessen09}. Unfortunately, this work depends on
others since I do not speak any Scandinavian language natively. Once
these corpora are complete, they will provide a more complete picture
of syntactic variation over the entire Scandinavian language area.
With regard to feature sets, it is interesting that trigrams perform better
than the more complicated feature sets. From a linguists' point of
view, this is disturbing: why should the flattest representation of
syntax perform the best? This performance difference also
discourages others from developing even more complicated and
linguistically interesting feature sets. The reason for trigrams'
performance is likely because of the amount of automatic annotation
that is a prerequisite for the complex features developed
here. Trigrams rely on an automatic part-of-speech tagger, while
leaf-ancestor paths rely on an automatic parser that uses automatic
part-of-speech tags from that same tagger.
To enable more complex feature sets, manual annotation is needed. But
this is labor intensive. Failing that, improved automatic annotation
is needed, although this still usually implies some manual annotation
in the form of a seed corpus for bootstrapping
\cite{blitzer07,mcdonald06}. Bootstrapping should help automatic
parsing of dialect interviews: not only does the subject matter of an
interview differ from the typical newspaper training corpus, the
syntactic features where the dialect differs from the standard
language are precisely those that are hardest to parse. Giving a
machine parser a sample of dialect speech as training would allow it
to identify some of these features. For example, in the case of the
possible double modals discussed at the end of chapter
\ref{results-chapter}, the part of speech tagger never saw the tokens
\textit{``skulla kunna''} juxtaposed in the training. If both words
were not part of a closed class, it is likely that the tagger would
not produce the correct tag for this pair. The same problem applies to
parser, but because syntactic training is even more sparse, the parser
is less likely to to have seen similar structures in
training. The parser is correspondingly less likely to produce a
double modal structure without having seen it in training.
Processing of features is another area for future work: normalization
is the first half of this problem. The current sentence-level
normalizations function well for aggregate comparisons like cluster
maps, but for individual feature comparison, the overuse normalization
tends to rank highly features that may just be noise from the
annotation error. On the other hand, without the overuse
normalization, only very common features are high ranked. This makes
it hard to notice the unique features of a dialect that do not occur
much. A compromise that takes frequency into account to some extent is
needed, so that rare features can be highly ranked without introducing
noise from annotation errors.
The other half of the feature-processing problem is a test for
significance when comparing two regions. This would make sure that
comparisons to the dialectology literature are significant in the
future. \namecite{wiersma09} provides a similar method for testing
significance of individual features in a single region, so it should
be easy to modify this to work for comparisons between two regions.
Finally, another obvious extension of this work is a quantitative
comparison of these results on Swedish to the results in Leinonen's
upcoming dissertation on Swedish phonological dialectometry. Given the
agreement between these results and her published work, it is likely
that the correlation will be high. This comparison should be fairly
easy since both results use the same dialect corpus as a basis.
\section{Conclusion}
This dissertation establishes that statistical methods are useful
direction for syntactic dialectometry. Its results show that
significant differences can be obtained with dialect corpora. This
much had been accomplished by previous work. However, this work goes
on to establish that even smaller interviews of dialect speakers are
sufficient to produce significant distances, and investigates
variations on both feature set and distance measure. It shows that a
syntactic measure can reproduce the traditional regions of
dialectometry, and that it can produce agreement with a phonological
measure. Its comparison to individual dialect phenomena is
inconclusive, but opens an avenue for future investigation, and more
importantly, future development of methods to compare and rank
individual features.
Future directions based on this work are twofold. First, with a
statistical method established for syntax, dialectometry can begin to
investigate the syntactic features of other languages. Second, in
Swedish, this work and future work similar to it can contribute to
dialectology in general; syntax has been relatively neglected in
Swedish dialectology. As Swediasyn and Nodalida are completed, the
automatic analysis detailed in this dissertation can provide a quick
analysis of new data, and point linguists toward interesting dialect
features.
In conclusion, this dissertation has answered the questions of
agreement with dialectometry and best parameter configuration for
practical measurements, as well as agreement with phonological
dialectometry. It has established statistical methods for syntactic
dialectometry, pointing the way for future syntactic dialect studies,
future expansion of statistical methods in dialectometry, and future
syntactic analysis of Swedish.
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "dissertation.tex"
%%% End: