Skip to content

No, let's not tokenize hyphens and dashes #14

@ebeshero

Description

@ebeshero

The collateX software does not normally tokenize hyphens and dashes, and we wondered last fall whether tokenizing them might improve alignments. We introduced a function in the Python script to insert newline characters around all forms of hyphens and dashes so they would be treated as their own tokens. However, in testing collation outputs over the holiday break, I took a moment to compare outputs with and without the function in operation, and discovered that our function is not helping us as much as we had hoped. In many cases the tokenized hyphens are not a problem, but they do create problems and alignment issues. NOT treating them as separate tokens creates longer aligned readings and a clearer view of variation. So we are really better to let collateX attach the hyphens and dashes to the word-token to their left (like it handles other punctuation). Here are examples of outputs with and without the function:

Python script APPLIES function to tokenize hyphens and dashes: output example from C-13:

<app>
		<rdgGrp n="['or', 'confine', 'a', 'mountain']">
			<rdg wit="f1818">or confine a mountain </rdg>
			<rdg wit="f1823">or confine a mountain </rdg>
			<rdg wit="fThomas">or confine a mountain </rdg>
			<rdg wit="f1831">or confine a mountain </rdg>
			<rdg wit="fMS">or confine a &lt;lb n=&quot;c56-0086__main__27&quot;/&gt;mountain </rdg>
		</rdgGrp>
	</app>
	<app>
		<rdgGrp n="['-']">
			<rdg wit="f1818">- </rdg>
			<rdg wit="f1823">- </rdg>
			<rdg wit="fThomas">- </rdg>
			<rdg wit="f1831">- </rdg>
		</rdgGrp>
	</app>
	<app>
		<rdgGrp n="['stream', 'with', 'a']">
			<rdg wit="f1818">stream with a </rdg>
			<rdg wit="f1823">stream with a </rdg>
			<rdg wit="fThomas">stream with a </rdg>
			<rdg wit="f1831">stream with a </rdg>
			<rdg wit="fMS">stream with a </rdg>
		</rdgGrp>
	</app>

Python script REMOVES function to tokenize hyphens and dashes: output example of the same passage from C-13:

<app>
		<rdgGrp n="['or', 'confine', 'a']">
			<rdg wit="f1818">or confine a </rdg>
			<rdg wit="f1823">or confine a </rdg>
			<rdg wit="fThomas">or confine a </rdg>
			<rdg wit="f1831">or confine a </rdg>
			<rdg wit="fMS">or confine a </rdg>
		</rdgGrp>
	</app>
	<app>
		<rdgGrp n="['mountain', 'stream']">
			<rdg wit="fMS">&lt;lb n=&quot;c56-0086__main__27&quot;/&gt;mountain stream </rdg>
		</rdgGrp>
		<rdgGrp n="['mountain-stream']">
			<rdg wit="f1818">mountain-stream </rdg>
			<rdg wit="f1823">mountain-stream </rdg>
			<rdg wit="fThomas">mountain-stream </rdg>
			<rdg wit="f1831">mountain-stream </rdg>
		</rdgGrp>
	</app>
	<app>
		<rdgGrp n="['with', 'a']">
			<rdg wit="f1818">with a </rdg>
			<rdg wit="f1823">with a </rdg>
			<rdg wit="fThomas">with a </rdg>
			<rdg wit="f1831">with a </rdg>
			<rdg wit="fMS">with a </rdg>
		</rdgGrp>
	</app>

The second alignment is better. Why?

  • The first example created an <app> only around the - character and did not include all the witnesses, so it was not so easy to notice where the hyphen was absent. In the Variorum Viewer, this would be difficult to read.
  • The second example shows the variation more clearly: some witnesses give us "mountain-stream" while one gives us just "mountain stream": that's the real difference and all the witnesses are present in the <app> to show it. That means that all witnesses will equally be able to show how they are different from each other (no one is left out of the app).
    @Yuying-Jin

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions