No, let's not tokenize hyphens and dashes

The collateX software does not normally tokenize hyphens and dashes, and we wondered last fall whether tokenizing them might improve alignments. We introduced a function in the Python script to insert newline characters around all forms of hyphens and dashes so they would be treated as their own tokens. However, in testing collation outputs over the holiday break, I took a moment to compare outputs with and without the function in operation, and discovered that our function is not helping us as much as we had hoped. In many cases the tokenized hyphens are not a problem, but they do create problems and alignment issues. NOT treating them as separate tokens creates longer aligned readings and a clearer view of variation. So we are really better to let collateX attach the hyphens and dashes to the word-token to their left (like it handles other punctuation). Here are examples of outputs with and without the function:

### Python script APPLIES function to tokenize hyphens and dashes: output example from C-13:

```xml
<app>
		<rdgGrp n="['or', 'confine', 'a', 'mountain']">
			<rdg wit="f1818">or confine a mountain </rdg>
			<rdg wit="f1823">or confine a mountain </rdg>
			<rdg wit="fThomas">or confine a mountain </rdg>
			<rdg wit="f1831">or confine a mountain </rdg>
			<rdg wit="fMS">or confine a &lt;lb n=&quot;c56-0086__main__27&quot;/&gt;mountain </rdg>
		</rdgGrp>
	</app>
	<app>
		<rdgGrp n="['-']">
			<rdg wit="f1818">- </rdg>
			<rdg wit="f1823">- </rdg>
			<rdg wit="fThomas">- </rdg>
			<rdg wit="f1831">- </rdg>
		</rdgGrp>
	</app>
	<app>
		<rdgGrp n="['stream', 'with', 'a']">
			<rdg wit="f1818">stream with a </rdg>
			<rdg wit="f1823">stream with a </rdg>
			<rdg wit="fThomas">stream with a </rdg>
			<rdg wit="f1831">stream with a </rdg>
			<rdg wit="fMS">stream with a </rdg>
		</rdgGrp>
	</app>
```
### Python script REMOVES function to tokenize hyphens and dashes: output example of the same passage from C-13:

```xml
<app>
		<rdgGrp n="['or', 'confine', 'a']">
			<rdg wit="f1818">or confine a </rdg>
			<rdg wit="f1823">or confine a </rdg>
			<rdg wit="fThomas">or confine a </rdg>
			<rdg wit="f1831">or confine a </rdg>
			<rdg wit="fMS">or confine a </rdg>
		</rdgGrp>
	</app>
	<app>
		<rdgGrp n="['mountain', 'stream']">
			<rdg wit="fMS">&lt;lb n=&quot;c56-0086__main__27&quot;/&gt;mountain stream </rdg>
		</rdgGrp>
		<rdgGrp n="['mountain-stream']">
			<rdg wit="f1818">mountain-stream </rdg>
			<rdg wit="f1823">mountain-stream </rdg>
			<rdg wit="fThomas">mountain-stream </rdg>
			<rdg wit="f1831">mountain-stream </rdg>
		</rdgGrp>
	</app>
	<app>
		<rdgGrp n="['with', 'a']">
			<rdg wit="f1818">with a </rdg>
			<rdg wit="f1823">with a </rdg>
			<rdg wit="fThomas">with a </rdg>
			<rdg wit="f1831">with a </rdg>
			<rdg wit="fMS">with a </rdg>
		</rdgGrp>
	</app>

```
The second alignment is better. Why? 
* The first example created an `<app>` only around the `-` character and did not include all the witnesses, so it was not so easy to notice where the hyphen was absent.  In the Variorum Viewer, this would be difficult to read.
* The second example shows the variation more clearly: some witnesses give us "mountain-stream" while one gives us just "mountain stream": that's the real difference and all the witnesses are present in the `<app>` to show it. That means that all witnesses will equally be able to show how they are different from each other (no one is left out of the app).  
@Yuying-Jin 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No, let's not tokenize hyphens and dashes #14

Python script APPLIES function to tokenize hyphens and dashes: output example from C-13:

Python script REMOVES function to tokenize hyphens and dashes: output example of the same passage from C-13:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

No, let's not tokenize hyphens and dashes #14

Description

Python script APPLIES function to tokenize hyphens and dashes: output example from C-13:

Python script REMOVES function to tokenize hyphens and dashes: output example of the same passage from C-13:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions