You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+21-30Lines changed: 21 additions & 30 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -63,7 +63,7 @@ Before running any alignment, make sure you edit your copy of `config.ini`. See
63
63
#### NOTE: source designates the source database from which reuses are deemed to originate, and target is the collection borrowing from source. In practice, the number of alignments won't vary significantly if you swap source and target
64
64
65
65
The sequence aligner is executed via the `textpair` command. The basic command is:
`textpair` takes the following command-line arguments:
69
69
@@ -72,13 +72,14 @@ The sequence aligner is executed via the `textpair` command. The basic command i
72
72
-`--output_path`: path to results
73
73
-`--debug`: turn on debugging
74
74
-`--workers`: Set number of workers/threads to use for parsing, ngram generation, and alignment.
75
-
-`--load_only_web_app`: Define whether to load results into a database viewable via a web application. Set to True by default.
76
75
-`--update_db`: update database without rebuilding web_app. Should be used in conjunction with the --file argument
77
76
-`--file`: alignment results file to load into database. Only used with the --update_db argument.
78
-
-`--only_align`: Skip parsing or ngram generation phase to go straight to the aligner. This is when you want to run a new alignment without having to go through preprocessing.
77
+
-`--source_metadata`: source metadata needed for loading database. Used only with the --update_db and --file argument.
78
+
-`--target_metadata`: target metadata needed for loading database. Used only with the --update_db and --file argument.
79
+
-`--only_align`: Run alignment based on preprocessed text data from a previous alignment.
80
+
-`--load_only_web_app`: Define whether to load results into a database viewable via a web application. Set to True by default.
79
81
-`--skip_web_app`: define whether to load results into a database and build a corresponding web app
80
82
81
-
82
83
## Configuring the alignment
83
84
84
85
When running an alignment, you need to provide a configuration file to the `textpair` command.
@@ -88,14 +89,17 @@ Then you can start editing this file. Note that all parameters have comments exp
88
89
89
90
While most values are reasonable defaults and don't require any edits, here are the most important settings you will want to checkout:
90
91
91
-
#### In the FILE_PATHS section
92
+
#### In the TEXT_SOURCES section
93
+
92
94
This is where you should define the paths for your source and target files. Note that if you define no target, files from source will be compared to one another. In this case, files will be compared only when the source file is older or of the same year as the target file. This is to avoid considering a source a document which was written after the target.
93
95
To leverage a PhiloLogic database to extract text and relevant metadata, point to the directory of the PhiloLogic DB used. You should then use the `--is_philo_db` flag.
96
+
To link your TextPAIR web app to PhiloLogic databases (for source and target), set source_url and target_url.
94
97
95
-
#### In the TEI Parsing section
98
+
#### In the TEXT_PARSING section
96
99
97
100
-`parse_source_files`, and `parse_target_files`: both of these setting determine whether you want textPAIR to parse your TEI files or not.
98
101
Set to `yes` by default. If you are relying on parsed output from PhiloLogic, you will want to set this to `no` or `false`.
102
+
-`source_file_type` and `target_file_type`: defines the type of text file: either TEI or plain text. If using plain text, you will need to supply a metadata file in the TEXT_SOURCES section
99
103
-`source_words_to_keep` and `target_words_to_keep`: defines files containing lists of words (separated by a newline) which the parser should keep.
100
104
Other words are discarded.
101
105
@@ -104,39 +108,24 @@ To leverage a PhiloLogic database to extract text and relevant metadata, point t
104
108
-`source_text_object_level` and `target_text_object_level`: Define the individual text object from which to compare other texts with.
105
109
Possible values are `doc`, `div1`, `div2`, `div3`, `para`, `sent`. This is only used when relying on a PhiloLogic database.
106
110
-`ngram`: Size of your ngram. The default is 3, which seems to work well in most cases. A lower number tends to produce more uninteresting short matches.
107
-
-`language`: This determines the language used by the Porter Stemmer as well as by Spacy (if using more advanced POS filtering features and lemmatization).
111
+
-`language`: This determines the language used by the Porter Stemmer as well as by Spacy (if using more advanced POS filtering features, lemmatization, or NER).
108
112
Note that you should use language codes from the <ahref="https://spacy.io/models/">Spacy
109
113
documentation</a>.
110
-
Note that there is a section on Vector Space Alignment preprocessing. These options are for the `vsa` matcher (see next section) only. It is not recommended that you use these at this time.
114
+
Note that there is a section on Vector Space Alignment preprocessing. These options are for the `vsa` matcher (see next section) only. It is not recommended that you use these at this time.
111
115
112
116
#### In the Matching section
113
-
Note that there are two different types of matching algorithms, with different parameters. The current recommended one is `sa` (for sequence alignment). The `vsa` algorith is HIGHLY experimental, still under heavy development, and is not guaranteed to work.
114
117
115
-
#### In the Web Application section
116
-
117
-
-`api_server`: This should point to the server where the TextPAIR is running.
118
-
-`source_philo_db_link` and `target_philo_db_link`: Provide a URL for the source and target PhiloLogic databases if you want to
119
-
link back to the original PhiloLogic instance to contextualize your results.
120
-
121
-
122
-
Note that the `--is_philo_db` flag assumes both source and target DBs are PhiloLogic databases.
118
+
Note that there are two different types of matching algorithms, with different parameters. The current recommended one is `sa` (for sequence alignment). The `vsa` algorith is HIGHLY experimental, still under heavy development, and is not guaranteed to work.
123
119
124
120
## Run comparison between preprocessed files manually
125
121
126
122
It is possible to run a comparison between documents without having to regenerate ngrams. In this case you need to use the
127
-
`--only_align` argument with the `textpair` command. Source files (and target files if doing a cross-database alignment) need to point
128
-
to the location of generated ngrams. You will also need to point to the `metadata.json` file which should be found in the `metadata`
129
-
directory found in the parent directory of your ngrams.
130
-
131
-
-`--source_files`: path to source ngrams generated by `textpair`
132
-
-`--target_files`: path to target ngrams generated by `textpair`. If this option is not defined, the comparison will be done between source files.
133
-
-`--source_metadata`: path to source metadata
134
-
-`--target_metadata`: path to target metadata
123
+
`--only_align` argument with the `textpair` command.
135
124
136
-
Example: assuming source files are in `./source` and target files in `./target`:
@@ -150,26 +139,28 @@ In this file, there are a number of fields that can be configured:
150
139
-`webServer`: should not be changed as only Apache is supported for the foreseeable future.
151
140
-`appPath`: this should match the WSGI configuration in `/etc/text-pair/apache_wsgi.conf`. Should not be changed without knowing how to work with `mod_wsgi`.
152
141
-`databaseName`: Defines the name of the PostgreSQL database where the data lives.
142
+
-`matchingAlgorithm`: DO NOT EDIT: tells the web app which matching method you used, and therefore impacts functionality within the Web UI.
153
143
-`databaseLabel`: Title of the database used in the Web Application
154
144
-`branding`: Defines links in the header
155
145
-`sourcePhiloDBLink` and `targetPhiloDBLink`: Provide URL to PhiloLogic database to contextualize shared passages.
156
146
-`sourceLabel` and `targetLabel` are the names of source DB and target DB. This field supports HTML tags.
157
-
-`metadataTypes`: defines the value type of field. Either `TEXT` or `INTEGER`.
158
147
-`sourceCitation` and `targetCitation` define the bibliography citation in results. `field` defines the metadata field to use, and `style` is for CSS styling (using key/value for CSS rules)
159
148
-`metadataFields` defines the fields available for searching in the search form for `source` and `target`.
160
149
`label` is the name used in the form and `value` is the actual name of the metadata field as stored in the SQL database.
161
150
-`facetFields` works the same way as `metadataFields` but for defining which fields are available in the faceted browser section.
162
151
-`timeSeriesIntervals` defines the time intervals available for the time series functionnality.
152
+
-`banalitiesStored` DO NOT EDIT: defines whether banalities (formulaic passages) have been stored.
163
153
164
154
Once you've edited these fields to your liking, you can regenerate your database by running the `npm run build` command from the directory where the `appConfig.json` file is located.
165
155
166
156
Built with support from the Mellon Foundation and the Fondation de la Maison des Sciences de l'Homme.
167
157
168
158
## Post processing alignment results
169
159
170
-
TextPAIR produces two different files (found in the `output/results/` directory) as a result of each alignment task:
160
+
TextPAIR produces two (or three if passage filtering is enabled) different files (found in the `output/results/` directory) as a result of each alignment task:
171
161
172
162
- The `alignments.jsonl` file: this contains all alignments which were found by TextPAIR. Each line is formatted as an individual JSON string.
173
-
- The `duplicate_files.txt` file: this contains a list of potential duplicate files TextPAIR identified between the source and target databases.
163
+
- The `duplicate_files.csv` file: this contains a list of potential duplicate files TextPAIR identified between the source and target databases.
164
+
- The `filtered_passages` file: shows source_passages which were filtered out based on phrase matching. Only generated if a file containing passages to filter has been provided.
174
165
175
166
These files are designed to be used for further inspection of the alignments, and potential post processing tasks such as alignment filtering or clustering.
0 commit comments