update docs, and NPM packages

clovis · clovis · commit b51be42280d6 · 2023-05-16T22:14:54.000-05:00
diff --git a/README.md b/README.md
@@ -63,7 +63,7 @@ Before running any alignment, make sure you edit your copy of `config.ini`. See
 #### NOTE: source designates the source database from which reuses are deemed to originate, and target is the collection borrowing from source. In practice, the number of alignments won't vary significantly if you swap source and target
 
 The sequence aligner is executed via the `textpair` command. The basic command is:
-`textpair --config=/path/to/config [OPTIONS] [database_name`]
+`textpair --config=/path/to/config [OPTIONS] [database_name]`
 
 `textpair` takes the following command-line arguments:
 
@@ -72,13 +72,14 @@ The sequence aligner is executed via the `textpair` command. The basic command i
 -   `--output_path`: path to results
 -   `--debug`: turn on debugging
 -   `--workers`: Set number of workers/threads to use for parsing, ngram generation, and alignment.
--   `--load_only_web_app`: Define whether to load results into a database viewable via a web application. Set to True by default.
 -   `--update_db`: update database without rebuilding web_app. Should be used in conjunction with the --file argument
 -   `--file`: alignment results file to load into database. Only used with the --update_db argument.
--   `--only_align`: Skip parsing or ngram generation phase to go straight to the aligner. This is when you want to run a new alignment without having to go through preprocessing.
+-   `--source_metadata`: source metadata needed for loading database. Used only with the --update_db and --file argument.
+-   `--target_metadata`: target metadata needed for loading database. Used only with the --update_db and --file argument.
+-   `--only_align`: Run alignment based on preprocessed text data from a previous alignment.
+-   `--load_only_web_app`: Define whether to load results into a database viewable via a web application. Set to True by default.
 -   `--skip_web_app`: define whether to load results into a database and build a corresponding web app
 
-
 ## Configuring the alignment
 
 When running an alignment, you need to provide a configuration file to the `textpair` command.
@@ -88,14 +89,17 @@ Then you can start editing this file. Note that all parameters have comments exp
 
 While most values are reasonable defaults and don't require any edits, here are the most important settings you will want to checkout:
 
-#### In the FILE_PATHS section
+#### In the TEXT_SOURCES section
+
 This is where you should define the paths for your source and target files. Note that if you define no target, files from source will be compared to one another. In this case, files will be compared only when the source file is older or of the same year as the target file. This is to avoid considering a source a document which was written after the target.
 To leverage a PhiloLogic database to extract text and relevant metadata, point to the directory of the PhiloLogic DB used. You should then use the `--is_philo_db` flag.
+To link your TextPAIR web app to PhiloLogic databases (for source and target), set source_url and target_url.
 
-#### In the TEI Parsing section
+#### In the TEXT_PARSING section
 
 -   `parse_source_files`, and `parse_target_files`: both of these setting determine whether you want textPAIR to parse your TEI files or not.
     Set to `yes` by default. If you are relying on parsed output from PhiloLogic, you will want to set this to `no` or `false`.
+-   `source_file_type` and `target_file_type`: defines the type of text file: either TEI or plain text. If using plain text, you will need to supply a metadata file in the TEXT_SOURCES section
 -   `source_words_to_keep` and `target_words_to_keep`: defines files containing lists of words (separated by a newline) which the parser should keep.
     Other words are discarded.
 
@@ -104,39 +108,24 @@ To leverage a PhiloLogic database to extract text and relevant metadata, point t
 -   `source_text_object_level` and `target_text_object_level`: Define the individual text object from which to compare other texts with.
     Possible values are `doc`, `div1`, `div2`, `div3`, `para`, `sent`. This is only used when relying on a PhiloLogic database.
 -   `ngram`: Size of your ngram. The default is 3, which seems to work well in most cases. A lower number tends to produce more uninteresting short matches.
--   `language`: This determines the language used by the Porter Stemmer as well as by Spacy (if using more advanced POS filtering features and lemmatization).
+-   `language`: This determines the language used by the Porter Stemmer as well as by Spacy (if using more advanced POS filtering features, lemmatization, or NER).
     Note that you should use language codes from the <a href="https://spacy.io/models/">Spacy
     documentation</a>.
-Note that there is a section on Vector Space Alignment preprocessing. These options are for the `vsa` matcher (see next section) only. It is not recommended that you use these at this time.
+    Note that there is a section on Vector Space Alignment preprocessing. These options are for the `vsa` matcher (see next section) only. It is not recommended that you use these at this time.
 
 #### In the Matching section
-Note that there are two different types of matching algorithms, with different parameters. The current recommended one is `sa` (for sequence alignment). The `vsa` algorith is HIGHLY experimental, still under heavy development, and is not guaranteed to work.
 
-#### In the Web Application section
-
--   `api_server`: This should point to the server where the TextPAIR is running.
--   `source_philo_db_link` and `target_philo_db_link`: Provide a URL for the source and target PhiloLogic databases if you want to
-    link back to the original PhiloLogic instance to contextualize your results.
-
-
-Note that the `--is_philo_db` flag assumes both source and target DBs are PhiloLogic databases.
+Note that there are two different types of matching algorithms, with different parameters. The current recommended one is `sa` (for sequence alignment). The `vsa` algorith is HIGHLY experimental, still under heavy development, and is not guaranteed to work.
 
 ## Run comparison between preprocessed files manually
 
 It is possible to run a comparison between documents without having to regenerate ngrams. In this case you need to use the
-`--only_align` argument with the `textpair` command. Source files (and target files if doing a cross-database alignment) need to point
-to the location of generated ngrams. You will also need to point to the `metadata.json` file which should be found in the `metadata`
-directory found in the parent directory of your ngrams.
-
--   `--source_files`: path to source ngrams generated by `textpair`
--   `--target_files`: path to target ngrams generated by `textpair`. If this option is not defined, the comparison will be done between source files.
--   `--source_metadata`: path to source metadata
--   `--target_metadata`: path to target metadata
+`--only_align` argument with the `textpair` command.
 
-Example: assuming source files are in `./source` and target files in `./target`:
+Example:
 
 ```console
-textpair --only_align --source_files=source/ngrams --source_metadata=source/metadata/metadata.json --target_files=target/ngrams --target_metadata=target/metadata/metadata.json --workers=10 --output_path=results/
+textpair --config=config.ini--only_align --workers=10 my_database_name
 ```
 
 ## Configuring the Web Application
@@ -150,26 +139,28 @@ In this file, there are a number of fields that can be configured:
 -   `webServer`: should not be changed as only Apache is supported for the foreseeable future.
 -   `appPath`: this should match the WSGI configuration in `/etc/text-pair/apache_wsgi.conf`. Should not be changed without knowing how to work with `mod_wsgi`.
 -   `databaseName`: Defines the name of the PostgreSQL database where the data lives.
+-   `matchingAlgorithm`: DO NOT EDIT: tells the web app which matching method you used, and therefore impacts functionality within the Web UI.
 -   `databaseLabel`: Title of the database used in the Web Application
 -   `branding`: Defines links in the header
 -   `sourcePhiloDBLink` and `targetPhiloDBLink`: Provide URL to PhiloLogic database to contextualize shared passages.
 -   `sourceLabel` and `targetLabel` are the names of source DB and target DB. This field supports HTML tags.
--   `metadataTypes`: defines the value type of field. Either `TEXT` or `INTEGER`.
 -   `sourceCitation` and `targetCitation` define the bibliography citation in results. `field` defines the metadata field to use, and `style` is for CSS styling (using key/value for CSS rules)
 -   `metadataFields` defines the fields available for searching in the search form for `source` and `target`.
     `label` is the name used in the form and `value` is the actual name of the metadata field as stored in the SQL database.
 -   `facetFields` works the same way as `metadataFields` but for defining which fields are available in the faceted browser section.
 -   `timeSeriesIntervals` defines the time intervals available for the time series functionnality.
+-   `banalitiesStored` DO NOT EDIT: defines whether banalities (formulaic passages) have been stored.
 
 Once you've edited these fields to your liking, you can regenerate your database by running the `npm run build` command from the directory where the `appConfig.json` file is located.
 
 Built with support from the Mellon Foundation and the Fondation de la Maison des Sciences de l'Homme.
 
 ## Post processing alignment results
 
-TextPAIR produces two different files (found in the `output/results/` directory) as a result of each alignment task:
+TextPAIR produces two (or three if passage filtering is enabled) different files (found in the `output/results/` directory) as a result of each alignment task:
 
 -   The `alignments.jsonl` file: this contains all alignments which were found by TextPAIR. Each line is formatted as an individual JSON string.
--   The `duplicate_files.txt` file: this contains a list of potential duplicate files TextPAIR identified between the source and target databases.
+-   The `duplicate_files.csv` file: this contains a list of potential duplicate files TextPAIR identified between the source and target databases.
+-   The `filtered_passages` file: shows source_passages which were filtered out based on phrase matching. Only generated if a file containing passages to filter has been provided.
 
 These files are designed to be used for further inspection of the alignments, and potential post processing tasks such as alignment filtering or clustering.
diff --git a/config/config.ini b/config/config.ini
@@ -182,7 +182,7 @@ max_gap = 15
 minimum_matching_ngrams  = 4
 
 # Automatically increase max_gap once minimum_matching_ngrams is reached
-flex_gap = false
+flex_gap = true
 
 # ONLY FOR VSA: defines similarity threshold. Value between 0 and 1, with values closer to one
 # meaning higher similarity.
diff --git a/docker/Dockerfile b/docker/Dockerfile
@@ -8,7 +8,7 @@ RUN curl -sL https://deb.nodesource.com/setup_16.x | sudo -E bash && apt-get ins
 
 RUN apt-get clean && rm -rf /var/lib/apt
 
-RUN mkdir textpair && curl -L  https://github.com/ARTFL-Project/text-pair/archive/v2.0-beta.11.tar.gz | tar xz -C textpair --strip-components 1 &&\
+RUN mkdir textpair && curl -L  https://github.com/ARTFL-Project/text-pair/archive/v2.0-beta.12.tar.gz | tar xz -C textpair --strip-components 1 &&\
     cd textpair && sh install.sh && mkdir -p /var/www/html/text-pair
 
 RUN echo "[WEB_APP]\nweb_app_path = /var/www/html/text-pair\napi_server = http://localhost/text-pair-api\n[DATABASE]\ndatabase_name = textpair\ndatabase_user = textpair\ndatabase_password = textpair" > /etc/text-pair/global_settings.ini
diff --git a/lib/textpair/banality_finder.py b/lib/textpair/banality_finder.py
@@ -114,7 +114,7 @@ def phrase_matcher(filepath: str, banality_phrases_path: str, count: Optional[in
             for phrase in banality_phrases:
                 if phrase in clean_text(alignment["source_passage"]):
                     passage = f"{phrase}\nFOUND IN:\n{alignment['source_passage']}\n\n"
-                    filtered_passages.write(passage.encode("utf8"))
+                    filtered_passages.write(passage.encode("utf8")) # type: ignore
                     passages_filtered += 1
                     banality = True
                     break
@@ -127,13 +127,13 @@ def phrase_matcher(filepath: str, banality_phrases_path: str, count: Optional[in
 if __name__ == "__main__":
     import sys
 
-    filepath = sys.argv[1]
+    file_path = sys.argv[1]
     # ngrams_file = sys.argv[2]
     # ngram_doc_path = sys.argv[3]
     # percentage = float(sys.argv[4])
     # with open(filepath.replace("alignments.jsonl.lz4", "count.txt"), "rb") as input_file:
     #     count = int(input_file.read().strip())
     # total = banality_auto_detect(filepath, ngrams_file, ngram_doc_path, count, percentage=percentage)
     phrase_path = sys.argv[2]
-    total = phrase_matcher(filepath, phrase_path, int(sys.argv[3]))
+    total = phrase_matcher(file_path, phrase_path, int(sys.argv[3]))
     print(total, "banalities found.")
diff --git a/lib/textpair/compare_ngrams.py b/lib/textpair/compare_ngrams.py
@@ -1,12 +1,12 @@
 #!/usr/bin env python3
 
 
-import rapidjson as json
 import os
 from collections import defaultdict
 from math import floor
-from multiprocess import Pool
 
+import rapidjson as json
+from multiprocess import Pool
 from namedlist import namedlist
 
 docIndex = namedlist("docIndex", "doc_id, ngrams, ngram_length")
diff --git a/web-app/package-lock.json b/web-app/package-lock.json
diff --git a/web-app/package.json b/web-app/package.json