Skip to content

Commit b51be42

Browse files
committed
update docs, and NPM packages
1 parent 49a6d5c commit b51be42

File tree

7 files changed

+77
-190
lines changed

7 files changed

+77
-190
lines changed

README.md

Lines changed: 21 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ Before running any alignment, make sure you edit your copy of `config.ini`. See
6363
#### NOTE: source designates the source database from which reuses are deemed to originate, and target is the collection borrowing from source. In practice, the number of alignments won't vary significantly if you swap source and target
6464

6565
The sequence aligner is executed via the `textpair` command. The basic command is:
66-
`textpair --config=/path/to/config [OPTIONS] [database_name`]
66+
`textpair --config=/path/to/config [OPTIONS] [database_name]`
6767

6868
`textpair` takes the following command-line arguments:
6969

@@ -72,13 +72,14 @@ The sequence aligner is executed via the `textpair` command. The basic command i
7272
- `--output_path`: path to results
7373
- `--debug`: turn on debugging
7474
- `--workers`: Set number of workers/threads to use for parsing, ngram generation, and alignment.
75-
- `--load_only_web_app`: Define whether to load results into a database viewable via a web application. Set to True by default.
7675
- `--update_db`: update database without rebuilding web_app. Should be used in conjunction with the --file argument
7776
- `--file`: alignment results file to load into database. Only used with the --update_db argument.
78-
- `--only_align`: Skip parsing or ngram generation phase to go straight to the aligner. This is when you want to run a new alignment without having to go through preprocessing.
77+
- `--source_metadata`: source metadata needed for loading database. Used only with the --update_db and --file argument.
78+
- `--target_metadata`: target metadata needed for loading database. Used only with the --update_db and --file argument.
79+
- `--only_align`: Run alignment based on preprocessed text data from a previous alignment.
80+
- `--load_only_web_app`: Define whether to load results into a database viewable via a web application. Set to True by default.
7981
- `--skip_web_app`: define whether to load results into a database and build a corresponding web app
8082

81-
8283
## Configuring the alignment
8384

8485
When running an alignment, you need to provide a configuration file to the `textpair` command.
@@ -88,14 +89,17 @@ Then you can start editing this file. Note that all parameters have comments exp
8889

8990
While most values are reasonable defaults and don't require any edits, here are the most important settings you will want to checkout:
9091

91-
#### In the FILE_PATHS section
92+
#### In the TEXT_SOURCES section
93+
9294
This is where you should define the paths for your source and target files. Note that if you define no target, files from source will be compared to one another. In this case, files will be compared only when the source file is older or of the same year as the target file. This is to avoid considering a source a document which was written after the target.
9395
To leverage a PhiloLogic database to extract text and relevant metadata, point to the directory of the PhiloLogic DB used. You should then use the `--is_philo_db` flag.
96+
To link your TextPAIR web app to PhiloLogic databases (for source and target), set source_url and target_url.
9497

95-
#### In the TEI Parsing section
98+
#### In the TEXT_PARSING section
9699

97100
- `parse_source_files`, and `parse_target_files`: both of these setting determine whether you want textPAIR to parse your TEI files or not.
98101
Set to `yes` by default. If you are relying on parsed output from PhiloLogic, you will want to set this to `no` or `false`.
102+
- `source_file_type` and `target_file_type`: defines the type of text file: either TEI or plain text. If using plain text, you will need to supply a metadata file in the TEXT_SOURCES section
99103
- `source_words_to_keep` and `target_words_to_keep`: defines files containing lists of words (separated by a newline) which the parser should keep.
100104
Other words are discarded.
101105

@@ -104,39 +108,24 @@ To leverage a PhiloLogic database to extract text and relevant metadata, point t
104108
- `source_text_object_level` and `target_text_object_level`: Define the individual text object from which to compare other texts with.
105109
Possible values are `doc`, `div1`, `div2`, `div3`, `para`, `sent`. This is only used when relying on a PhiloLogic database.
106110
- `ngram`: Size of your ngram. The default is 3, which seems to work well in most cases. A lower number tends to produce more uninteresting short matches.
107-
- `language`: This determines the language used by the Porter Stemmer as well as by Spacy (if using more advanced POS filtering features and lemmatization).
111+
- `language`: This determines the language used by the Porter Stemmer as well as by Spacy (if using more advanced POS filtering features, lemmatization, or NER).
108112
Note that you should use language codes from the <a href="https://spacy.io/models/">Spacy
109113
documentation</a>.
110-
Note that there is a section on Vector Space Alignment preprocessing. These options are for the `vsa` matcher (see next section) only. It is not recommended that you use these at this time.
114+
Note that there is a section on Vector Space Alignment preprocessing. These options are for the `vsa` matcher (see next section) only. It is not recommended that you use these at this time.
111115

112116
#### In the Matching section
113-
Note that there are two different types of matching algorithms, with different parameters. The current recommended one is `sa` (for sequence alignment). The `vsa` algorith is HIGHLY experimental, still under heavy development, and is not guaranteed to work.
114117

115-
#### In the Web Application section
116-
117-
- `api_server`: This should point to the server where the TextPAIR is running.
118-
- `source_philo_db_link` and `target_philo_db_link`: Provide a URL for the source and target PhiloLogic databases if you want to
119-
link back to the original PhiloLogic instance to contextualize your results.
120-
121-
122-
Note that the `--is_philo_db` flag assumes both source and target DBs are PhiloLogic databases.
118+
Note that there are two different types of matching algorithms, with different parameters. The current recommended one is `sa` (for sequence alignment). The `vsa` algorith is HIGHLY experimental, still under heavy development, and is not guaranteed to work.
123119

124120
## Run comparison between preprocessed files manually
125121

126122
It is possible to run a comparison between documents without having to regenerate ngrams. In this case you need to use the
127-
`--only_align` argument with the `textpair` command. Source files (and target files if doing a cross-database alignment) need to point
128-
to the location of generated ngrams. You will also need to point to the `metadata.json` file which should be found in the `metadata`
129-
directory found in the parent directory of your ngrams.
130-
131-
- `--source_files`: path to source ngrams generated by `textpair`
132-
- `--target_files`: path to target ngrams generated by `textpair`. If this option is not defined, the comparison will be done between source files.
133-
- `--source_metadata`: path to source metadata
134-
- `--target_metadata`: path to target metadata
123+
`--only_align` argument with the `textpair` command.
135124

136-
Example: assuming source files are in `./source` and target files in `./target`:
125+
Example:
137126

138127
```console
139-
textpair --only_align --source_files=source/ngrams --source_metadata=source/metadata/metadata.json --target_files=target/ngrams --target_metadata=target/metadata/metadata.json --workers=10 --output_path=results/
128+
textpair --config=config.ini--only_align --workers=10 my_database_name
140129
```
141130

142131
## Configuring the Web Application
@@ -150,26 +139,28 @@ In this file, there are a number of fields that can be configured:
150139
- `webServer`: should not be changed as only Apache is supported for the foreseeable future.
151140
- `appPath`: this should match the WSGI configuration in `/etc/text-pair/apache_wsgi.conf`. Should not be changed without knowing how to work with `mod_wsgi`.
152141
- `databaseName`: Defines the name of the PostgreSQL database where the data lives.
142+
- `matchingAlgorithm`: DO NOT EDIT: tells the web app which matching method you used, and therefore impacts functionality within the Web UI.
153143
- `databaseLabel`: Title of the database used in the Web Application
154144
- `branding`: Defines links in the header
155145
- `sourcePhiloDBLink` and `targetPhiloDBLink`: Provide URL to PhiloLogic database to contextualize shared passages.
156146
- `sourceLabel` and `targetLabel` are the names of source DB and target DB. This field supports HTML tags.
157-
- `metadataTypes`: defines the value type of field. Either `TEXT` or `INTEGER`.
158147
- `sourceCitation` and `targetCitation` define the bibliography citation in results. `field` defines the metadata field to use, and `style` is for CSS styling (using key/value for CSS rules)
159148
- `metadataFields` defines the fields available for searching in the search form for `source` and `target`.
160149
`label` is the name used in the form and `value` is the actual name of the metadata field as stored in the SQL database.
161150
- `facetFields` works the same way as `metadataFields` but for defining which fields are available in the faceted browser section.
162151
- `timeSeriesIntervals` defines the time intervals available for the time series functionnality.
152+
- `banalitiesStored` DO NOT EDIT: defines whether banalities (formulaic passages) have been stored.
163153

164154
Once you've edited these fields to your liking, you can regenerate your database by running the `npm run build` command from the directory where the `appConfig.json` file is located.
165155

166156
Built with support from the Mellon Foundation and the Fondation de la Maison des Sciences de l'Homme.
167157

168158
## Post processing alignment results
169159

170-
TextPAIR produces two different files (found in the `output/results/` directory) as a result of each alignment task:
160+
TextPAIR produces two (or three if passage filtering is enabled) different files (found in the `output/results/` directory) as a result of each alignment task:
171161

172162
- The `alignments.jsonl` file: this contains all alignments which were found by TextPAIR. Each line is formatted as an individual JSON string.
173-
- The `duplicate_files.txt` file: this contains a list of potential duplicate files TextPAIR identified between the source and target databases.
163+
- The `duplicate_files.csv` file: this contains a list of potential duplicate files TextPAIR identified between the source and target databases.
164+
- The `filtered_passages` file: shows source_passages which were filtered out based on phrase matching. Only generated if a file containing passages to filter has been provided.
174165

175166
These files are designed to be used for further inspection of the alignments, and potential post processing tasks such as alignment filtering or clustering.

config/config.ini

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -182,7 +182,7 @@ max_gap = 15
182182
minimum_matching_ngrams = 4
183183

184184
# Automatically increase max_gap once minimum_matching_ngrams is reached
185-
flex_gap = false
185+
flex_gap = true
186186

187187
# ONLY FOR VSA: defines similarity threshold. Value between 0 and 1, with values closer to one
188188
# meaning higher similarity.

docker/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ RUN curl -sL https://deb.nodesource.com/setup_16.x | sudo -E bash && apt-get ins
88

99
RUN apt-get clean && rm -rf /var/lib/apt
1010

11-
RUN mkdir textpair && curl -L https://github.com/ARTFL-Project/text-pair/archive/v2.0-beta.11.tar.gz | tar xz -C textpair --strip-components 1 &&\
11+
RUN mkdir textpair && curl -L https://github.com/ARTFL-Project/text-pair/archive/v2.0-beta.12.tar.gz | tar xz -C textpair --strip-components 1 &&\
1212
cd textpair && sh install.sh && mkdir -p /var/www/html/text-pair
1313

1414
RUN echo "[WEB_APP]\nweb_app_path = /var/www/html/text-pair\napi_server = http://localhost/text-pair-api\n[DATABASE]\ndatabase_name = textpair\ndatabase_user = textpair\ndatabase_password = textpair" > /etc/text-pair/global_settings.ini

lib/textpair/banality_finder.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,7 @@ def phrase_matcher(filepath: str, banality_phrases_path: str, count: Optional[in
114114
for phrase in banality_phrases:
115115
if phrase in clean_text(alignment["source_passage"]):
116116
passage = f"{phrase}\nFOUND IN:\n{alignment['source_passage']}\n\n"
117-
filtered_passages.write(passage.encode("utf8"))
117+
filtered_passages.write(passage.encode("utf8")) # type: ignore
118118
passages_filtered += 1
119119
banality = True
120120
break
@@ -127,13 +127,13 @@ def phrase_matcher(filepath: str, banality_phrases_path: str, count: Optional[in
127127
if __name__ == "__main__":
128128
import sys
129129

130-
filepath = sys.argv[1]
130+
file_path = sys.argv[1]
131131
# ngrams_file = sys.argv[2]
132132
# ngram_doc_path = sys.argv[3]
133133
# percentage = float(sys.argv[4])
134134
# with open(filepath.replace("alignments.jsonl.lz4", "count.txt"), "rb") as input_file:
135135
# count = int(input_file.read().strip())
136136
# total = banality_auto_detect(filepath, ngrams_file, ngram_doc_path, count, percentage=percentage)
137137
phrase_path = sys.argv[2]
138-
total = phrase_matcher(filepath, phrase_path, int(sys.argv[3]))
138+
total = phrase_matcher(file_path, phrase_path, int(sys.argv[3]))
139139
print(total, "banalities found.")

lib/textpair/compare_ngrams.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
#!/usr/bin env python3
22

33

4-
import rapidjson as json
54
import os
65
from collections import defaultdict
76
from math import floor
8-
from multiprocess import Pool
97

8+
import rapidjson as json
9+
from multiprocess import Pool
1010
from namedlist import namedlist
1111

1212
docIndex = namedlist("docIndex", "doc_id, ngrams, ngram_length")

0 commit comments

Comments
 (0)