openlibhums · ajrbyers · Oct 22, 2024 · Dec 13, 2024 · Dec 13, 2024
diff --git a/README.md b/README.md
@@ -1,28 +1,135 @@
-# Pandoc Plugin
 
-This is a plugin for [Janeway](https://github.com/BirkbeckCTP/janeway) that provides a button for typesetters to automatically generate html files from user article submissions in docx/rtf. These files are first converted to markdown, and from there to html, and then registered as galleys of the original article.
+# Pandoc Plugin for Janeway
 
-## How to install:
+This plugin integrates [Pandoc](https://pandoc.org/) into the Janeway platform, enabling document conversion for articles. It supports generating HTML and PDF files from manuscripts and adding optional stamped coversheets.
 
-(You can find plugin installation instructions in the README for the back_content plugin [here](https://github.com/BirkbeckCTP/back_content))
+## Features
 
-1. SSH into the server and navigate to: /path/to/janeway/src/plugins
-2. Use git to clone the plugin's repository here. For example: `git clone https://github.com/hackman104/pandoc_plugin.git`
-3. Make sure you have activated janeway's virtual environment
-4. Return to /path/to/janeway/src and run `python manage.py install_plugins`
-5. Restart apache (command will depend on your distro)
-6. Go to your journal webpage, go to the manager, and click "Plugins" at the bottom of the side-bar on the left
-7. Find the plugin you are working on, click its link, and then enable it and click submit
+- Converts Word documents (`.docx`) and RTF manuscripts into Markdown, HTML, and optionally PDF format.
+- Automatically registers the generated HTML as galleys for the original article.
+- Optional inclusion of custom coversheets for PDF generation.
 
-### pandoc
+## How to Install
 
-*N. B. You must have pandoc installed on your server to use this plugin. Please see pandoc's installation documentation __[here](https://pandoc.org/installing.html)__.*
+1. SSH into the server and navigate to the plugins directory:
+   ```bash
+   cd /path/to/janeway/src/plugins
+   ```
 
-Most of the package managers for Linux distributions offer older versions of Pandoc, and you need at least 1.13 for full docx support. Luckily, pandoc offers a compiled distribution in .deb format:
+2. Use git to clone the plugin repository:
+   ```bash
+   git clone https://github.com/openlibhums/pandoc_plugin.git
+   ```
 
-``` sh
+3. Activate Janeway's virtual environment:
+   ```bash
+   source /path/to/janeway/venv/bin/activate
+   ```
+
+4. Return to the Janeway source directory and install the plugin:
+   ```bash
+   cd /path/to/janeway/src
+   pip3 install -r plugins/pandoc_plugin/requirements.txt
+   python manage.py install_plugins pandoc_plugin
+   ```
+
+5. Restart your webserver to apply the changes (command depends on your distro).
+
+6. Log in to your journal's admin interface:
+   - Go to the "Manager" section.
+   - Click "Plugins" at the bottom of the left-hand sidebar.
+   - Locate the Pandoc Plugin, enable it, and click submit.
+
+### Installing Pandoc and XeLaTeX
+
+#### Pandoc
+
+You must have Pandoc installed on your server to use this plugin. Most Linux distributions include older versions of Pandoc, but at least version 1.13 is required for full `.docx` support.
+
+To install a newer version of Pandoc:
+
+```bash
 wget 'https://github.com/jgm/pandoc/releases/download/2.19.2/pandoc-2.19.2-1-amd64.deb'
-dpkg -i pandoc-2.5-1-amd64.deb
-rm pandoc-2.5-1-amd64.deb
+dpkg -i pandoc-2.19.2-1-amd64.deb
+rm pandoc-2.19.2-1-amd64.deb
 ```
-Pandoc should now be available for all users to run, ensuring the plugin will work
+
+Verify that Pandoc is installed and available:
+```bash
+pandoc --version
+```
+
+#### XeLaTeX
+
+The plugin uses `xelatex` as the PDF engine. If you encounter the error `xelatex not found`, follow these steps to install `xelatex` on your system.
+
+##### On Ubuntu/Debian:
+To install `xelatex` using the TeX Live distribution:
+```bash
+sudo apt update
+sudo apt install texlive-xetex
+```
+
+##### On CentOS/RHEL:
+```bash
+sudo yum install texlive-xetex
+```
+
+##### On macOS:
+```bash
+brew install --cask mactex
+```
+
+Verify `xelatex` is installed:
+```bash
+xelatex --version
+```
+
+## Configuration
+
+### Settings
+
+The plugin provides the following configurable settings:
+
+1. **Coversheet HTML (`cover_sheet`)**:
+   - Specifies the HTML template used for the stamped coversheet.
+   - Configured via the plugin manager interface.
+
+2. **Extract Images**:
+   - Boolean setting to enable extraction of images.
+   - Configured via the plugin manager interface.
+
+
+### Updating Settings
+
+1. Go to the Janeway admin panel.
+2. Navigate to `Settings > Plugins > Pandoc Plugin`.
+3. Configure the above settings as needed.
+
+## Usage
+
+### Generating Files via Management Command
+
+To generate HTML or PDFs for specific articles or files, use the management command:
+
+```bash
+python manage.py generate_pdfs <article_id> --owner <email> --conversion_type stamped or unstamped
+```
+
+- `article_id`: ID of the article to process.
+- `--owner`: Email of the account owner for the generated file.
+- `--conversion_type`: Tells Janeway whether to add a cover sheet to the converted file.
+
+#### Example Command:
+
+```bash
+python manage.py generate_pdfs 123 456 --owner=editor@example.com
+```
+
+### File Generation in the User Interface
+
+- Editors can trigger HTML and PDF generation from the Janeway interface for individual files in typesetting by clicking the "Options" link
+
+## License
+
+This plugin is licensed under the [GNU Affero General Public License](https://www.gnu.org/licenses/agpl-3.0.en.html) (AGPLv3).
diff --git a/convert.py b/convert.py
@@ -3,11 +3,13 @@
 import subprocess
 import tempfile
 
+from django.conf import settings
+
 from bs4 import BeautifulSoup
+from PyPDF2 import PdfMerger
+
 from core.files import IMAGE_MIMETYPES
-from utils import models, setting_handler
 from utils.logger import get_logger
-
 from plugins.pandoc_plugin import plugin_settings
 
 logger = get_logger(__name__)
@@ -72,8 +74,143 @@ def generate_html_from_doc(doc_path, extract_images=False):
     return str(pandoc_soup), image_paths
 
 
+def generate_pdf_with_cover_sheet(doc_path, mime_type, cover_sheet_html):
+    if mime_type not in plugin_settings.PDF_CONVERSION_SUPPORTED_MIME_TYPES:
+        raise TypeError(f"File MIME type {mime_type} not supported")
+
+    images_temp_path = tempfile.mkdtemp()
+    cover_pdf_path = os.path.join(tempfile.gettempdir(), 'cover_sheet.pdf')
+    document_pdf_path = os.path.join(tempfile.gettempdir(),
+                                     f'{os.path.basename(doc_path)}_document.pdf')
+    merged_pdf_path = os.path.join(tempfile.gettempdir(),
+                                   f'{os.path.basename(doc_path)}_merged.pdf')
+
+    try:
+        subprocess.run(
+            [
+                'pandoc',
+                '-o',
+                cover_pdf_path,
+                '--from=html',
+                '--to=pdf',
+                '--pdf-engine=xelatex',
+            ],
+            input=cover_sheet_html.encode(),
+            check=True,
+        )
+    except subprocess.CalledProcessError as err:
+        raise PandocError(
+            f"Error during cover sheet PDF conversion: {str(err)}")
+
+    pandoc_command = (
+        PANDOC_CMD
+        + MEMORY_LIMIT_ARG
+        + [EXTRACT_MEDIA, images_temp_path]
+        + [doc_path, '-o', document_pdf_path, '--pdf-engine=xelatex']
+    )
+
+    try:
+        logger.info(f"[PANDOC] Running command: {pandoc_command}")
+        subprocess.run(pandoc_command, check=True)
+    except subprocess.CalledProcessError as e:
+        raise PandocError(f"PandocError: {e.stderr}")
+
+    image_paths = [
+        os.path.join(base, f)
+        for base, _, files in os.walk(images_temp_path)
+        for f in files
+        if mimetypes.guess_type(f)[0] in IMAGE_MIMETYPES
+    ]
+
+    merger = PdfMerger()
+    merger.append(cover_pdf_path)
+    merger.append(document_pdf_path)
+    with open(merged_pdf_path, 'wb') as merged_file:
+        merger.write(merged_file)
+
+    return merged_pdf_path, image_paths
+
+
+def convert_word_to_pdf(doc_path, output_filename):
+    """Convert Word doc (docx, rtf, odt, etc.) to PDF using Pandoc with MIME type validation."""
+    mime_type, _ = mimetypes.guess_type(doc_path)
+
+    if mime_type not in plugin_settings.PDF_CONVERSION_SUPPORTED_MIME_TYPES:
+        raise TypeError(f"Unsupported file type for conversion: {mime_type}")
+
+    output_pdf_path = os.path.join(settings.BASE_DIR, 'files', 'temp',
+                                   output_filename)
+
+    pandoc_command = [
+        'pandoc',
+        doc_path,
+        '-o',
+        output_pdf_path,
+        '--pdf-engine=xelatex',
+        '-V',
+        'geometry:margin=1.5in',
+    ]
+
+    try:
+        subprocess.run(pandoc_command, check=True)
+    except subprocess.CalledProcessError as e:
+        raise Exception(f"Error converting Word document to PDF: {e}")
+
+    return output_pdf_path
+
+
+def convert_html_to_pdf(html_content, output_filename):
+    """Convert HTML content to PDF using Pandoc."""
+    output_pdf_path = os.path.join(
+        settings.BASE_DIR,
+        'files',
+        'temp',
+        output_filename,
+    )
+
+    try:
+        subprocess.run(
+            [
+                'pandoc',
+                '--from=html',
+                '--to=pdf',
+                '-o',
+                output_pdf_path,
+                '--pdf-engine=xelatex',
+                "-V",
+                "pagestyle=empty",
+                "-V",
+                "geometry:margin=1.5in",
+            ],
+            input=html_content.encode(),
+            check=True,
+        )
+    except subprocess.CalledProcessError as e:
+        raise Exception(f"Error converting HTML to PDF: {e}")
+
+    return output_pdf_path
+
+
+def merge_pdfs(cover_pdf, doc_pdf, output_filename):
+    """Merge two PDFs (cover sheet and document) into one."""
+    output_pdf_path = os.path.join(settings.BASE_DIR, 'files', 'temp', output_filename)
+
+    merger = PdfMerger()
+
+    try:
+        merger.append(cover_pdf)
+        merger.append(doc_pdf)
+        with open(output_pdf_path, 'wb') as merged_pdf:
+            merger.write(merged_pdf)
+    except Exception as e:
+        raise Exception(f"Error merging PDFs: {e}")
+    finally:
+        merger.close()
+
+    return output_pdf_path
+
+
 class PandocError(Exception):
     def __init__(self, msg, cmd=None):
         super().__init__(self, msg)
         self.cmd = cmd
-
diff --git a/forms.py b/forms.py
@@ -1,5 +1,58 @@
 from django import forms
 
+from tinymce.widgets import TinyMCE
+
+from utils import setting_handler, models
+from plugins.pandoc_plugin import plugin_settings
+
+
 class PandocAdminForm(forms.Form):
-    pandoc_enabled = forms.BooleanField(required=False)
-    pandoc_extract_images = forms.BooleanField(required=False)
+    pandoc_enabled = forms.BooleanField(label="Enable Pandoc", required=False)
+    pandoc_extract_images = forms.BooleanField(label="Extract Images", required=False)
+    cover_sheet = forms.CharField(
+        label="Cover Sheet",
+        widget=TinyMCE(attrs={'cols': 80, 'rows': 30}),
+        required=False
+    )
+    def __init__(self, *args, journal=None, **kwargs):
+        """Initialize form with current plugin settings and apply help text."""
+        super().__init__(*args, **kwargs)
+        self.journal = journal
+        self.plugin = models.Plugin.objects.get(
+            name=plugin_settings.SHORT_NAME,
+        )
+
+        # Initialize fields with settings values and help texts
+        pandoc_enabled_setting = setting_handler.get_plugin_setting(
+            self.plugin, 'pandoc_enabled', self.journal, create=True,
+            pretty='Enable Pandoc', types='boolean'
+        )
+        self.fields[
+            'pandoc_enabled'
+        ].initial = pandoc_enabled_setting.processed_value
+
+        extract_images_setting = setting_handler.get_plugin_setting(
+            self.plugin, 'pandoc_extract_images', self.journal, create=True,
+            pretty='Pandoc extract images', types='boolean'
+        )
+        self.fields[
+            'pandoc_extract_images'
+        ].initial = extract_images_setting.processed_value
+
+        cover_sheet_setting = setting_handler.get_plugin_setting(
+            self.plugin, 'cover_sheet', self.journal, create=True,
+            pretty='Cover Sheet', types='text'
+        )
+        self.fields[
+            'cover_sheet'
+        ].initial = cover_sheet_setting.processed_value
+
+    def save(self):
+        """Save each setting in the cleaned data to the plugin settings."""
+        for setting_name, setting_value in self.cleaned_data.items():
+            setting_handler.save_plugin_setting(
+                plugin=self.plugin,
+                setting_name=setting_name,
+                value=setting_value,
+                journal=self.journal
+            )