Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
143 changes: 125 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,135 @@
# Pandoc Plugin

This is a plugin for [Janeway](https://github.com/BirkbeckCTP/janeway) that provides a button for typesetters to automatically generate html files from user article submissions in docx/rtf. These files are first converted to markdown, and from there to html, and then registered as galleys of the original article.
# Pandoc Plugin for Janeway

## How to install:
This plugin integrates [Pandoc](https://pandoc.org/) into the Janeway platform, enabling document conversion for articles. It supports generating HTML and PDF files from manuscripts and adding optional stamped coversheets.

(You can find plugin installation instructions in the README for the back_content plugin [here](https://github.com/BirkbeckCTP/back_content))
## Features

1. SSH into the server and navigate to: /path/to/janeway/src/plugins
2. Use git to clone the plugin's repository here. For example: `git clone https://github.com/hackman104/pandoc_plugin.git`
3. Make sure you have activated janeway's virtual environment
4. Return to /path/to/janeway/src and run `python manage.py install_plugins`
5. Restart apache (command will depend on your distro)
6. Go to your journal webpage, go to the manager, and click "Plugins" at the bottom of the side-bar on the left
7. Find the plugin you are working on, click its link, and then enable it and click submit
- Converts Word documents (`.docx`) and RTF manuscripts into Markdown, HTML, and optionally PDF format.
- Automatically registers the generated HTML as galleys for the original article.
- Optional inclusion of custom coversheets for PDF generation.

### pandoc
## How to Install

*N. B. You must have pandoc installed on your server to use this plugin. Please see pandoc's installation documentation __[here](https://pandoc.org/installing.html)__.*
1. SSH into the server and navigate to the plugins directory:
```bash
cd /path/to/janeway/src/plugins
```

Most of the package managers for Linux distributions offer older versions of Pandoc, and you need at least 1.13 for full docx support. Luckily, pandoc offers a compiled distribution in .deb format:
2. Use git to clone the plugin repository:
```bash
git clone https://github.com/openlibhums/pandoc_plugin.git
```

``` sh
3. Activate Janeway's virtual environment:
```bash
source /path/to/janeway/venv/bin/activate
```

4. Return to the Janeway source directory and install the plugin:
```bash
cd /path/to/janeway/src
pip3 install -r plugins/pandoc_plugin/requirements.txt
python manage.py install_plugins pandoc_plugin
```

5. Restart your webserver to apply the changes (command depends on your distro).

6. Log in to your journal's admin interface:
- Go to the "Manager" section.
- Click "Plugins" at the bottom of the left-hand sidebar.
- Locate the Pandoc Plugin, enable it, and click submit.

### Installing Pandoc and XeLaTeX

#### Pandoc

You must have Pandoc installed on your server to use this plugin. Most Linux distributions include older versions of Pandoc, but at least version 1.13 is required for full `.docx` support.

To install a newer version of Pandoc:

```bash
wget 'https://github.com/jgm/pandoc/releases/download/2.19.2/pandoc-2.19.2-1-amd64.deb'
dpkg -i pandoc-2.5-1-amd64.deb
rm pandoc-2.5-1-amd64.deb
dpkg -i pandoc-2.19.2-1-amd64.deb
rm pandoc-2.19.2-1-amd64.deb
```
Pandoc should now be available for all users to run, ensuring the plugin will work

Verify that Pandoc is installed and available:
```bash
pandoc --version
```

#### XeLaTeX

The plugin uses `xelatex` as the PDF engine. If you encounter the error `xelatex not found`, follow these steps to install `xelatex` on your system.

##### On Ubuntu/Debian:
To install `xelatex` using the TeX Live distribution:
```bash
sudo apt update
sudo apt install texlive-xetex
```

##### On CentOS/RHEL:
```bash
sudo yum install texlive-xetex
```

##### On macOS:
```bash
brew install --cask mactex
```

Verify `xelatex` is installed:
```bash
xelatex --version
```

## Configuration

### Settings

The plugin provides the following configurable settings:

1. **Coversheet HTML (`cover_sheet`)**:
- Specifies the HTML template used for the stamped coversheet.
- Configured via the plugin manager interface.

2. **Extract Images**:
- Boolean setting to enable extraction of images.
- Configured via the plugin manager interface.


### Updating Settings

1. Go to the Janeway admin panel.
2. Navigate to `Settings > Plugins > Pandoc Plugin`.
3. Configure the above settings as needed.

## Usage

### Generating Files via Management Command

To generate HTML or PDFs for specific articles or files, use the management command:

```bash
python manage.py generate_pdfs <article_id> --owner <email> --conversion_type stamped or unstamped
```

- `article_id`: ID of the article to process.
- `--owner`: Email of the account owner for the generated file.
- `--conversion_type`: Tells Janeway whether to add a cover sheet to the converted file.

#### Example Command:

```bash
python manage.py generate_pdfs 123 456 --owner=editor@example.com
```

### File Generation in the User Interface

- Editors can trigger HTML and PDF generation from the Janeway interface for individual files in typesetting by clicking the "Options" link

## License

This plugin is licensed under the [GNU Affero General Public License](https://www.gnu.org/licenses/agpl-3.0.en.html) (AGPLv3).
143 changes: 140 additions & 3 deletions convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,13 @@
import subprocess
import tempfile

from django.conf import settings

from bs4 import BeautifulSoup
from PyPDF2 import PdfMerger

from core.files import IMAGE_MIMETYPES
from utils import models, setting_handler
from utils.logger import get_logger

from plugins.pandoc_plugin import plugin_settings

logger = get_logger(__name__)
Expand Down Expand Up @@ -72,8 +74,143 @@ def generate_html_from_doc(doc_path, extract_images=False):
return str(pandoc_soup), image_paths


def generate_pdf_with_cover_sheet(doc_path, mime_type, cover_sheet_html):
if mime_type not in plugin_settings.PDF_CONVERSION_SUPPORTED_MIME_TYPES:
raise TypeError(f"File MIME type {mime_type} not supported")

images_temp_path = tempfile.mkdtemp()
cover_pdf_path = os.path.join(tempfile.gettempdir(), 'cover_sheet.pdf')
document_pdf_path = os.path.join(tempfile.gettempdir(),
f'{os.path.basename(doc_path)}_document.pdf')
merged_pdf_path = os.path.join(tempfile.gettempdir(),
f'{os.path.basename(doc_path)}_merged.pdf')

try:
subprocess.run(
[
'pandoc',
'-o',
cover_pdf_path,
'--from=html',
'--to=pdf',
'--pdf-engine=xelatex',
],
input=cover_sheet_html.encode(),
check=True,
)
except subprocess.CalledProcessError as err:
raise PandocError(
f"Error during cover sheet PDF conversion: {str(err)}")

pandoc_command = (
PANDOC_CMD
+ MEMORY_LIMIT_ARG
+ [EXTRACT_MEDIA, images_temp_path]
+ [doc_path, '-o', document_pdf_path, '--pdf-engine=xelatex']
)

try:
logger.info(f"[PANDOC] Running command: {pandoc_command}")
subprocess.run(pandoc_command, check=True)
except subprocess.CalledProcessError as e:
raise PandocError(f"PandocError: {e.stderr}")

image_paths = [
os.path.join(base, f)
for base, _, files in os.walk(images_temp_path)
for f in files
if mimetypes.guess_type(f)[0] in IMAGE_MIMETYPES
]

merger = PdfMerger()
merger.append(cover_pdf_path)
merger.append(document_pdf_path)
with open(merged_pdf_path, 'wb') as merged_file:
merger.write(merged_file)

return merged_pdf_path, image_paths


def convert_word_to_pdf(doc_path, output_filename):
"""Convert Word doc (docx, rtf, odt, etc.) to PDF using Pandoc with MIME type validation."""
mime_type, _ = mimetypes.guess_type(doc_path)

if mime_type not in plugin_settings.PDF_CONVERSION_SUPPORTED_MIME_TYPES:
raise TypeError(f"Unsupported file type for conversion: {mime_type}")

output_pdf_path = os.path.join(settings.BASE_DIR, 'files', 'temp',
output_filename)

pandoc_command = [
'pandoc',
doc_path,
'-o',
output_pdf_path,
'--pdf-engine=xelatex',
'-V',
'geometry:margin=1.5in',
]

try:
subprocess.run(pandoc_command, check=True)
except subprocess.CalledProcessError as e:
raise Exception(f"Error converting Word document to PDF: {e}")

return output_pdf_path


def convert_html_to_pdf(html_content, output_filename):
"""Convert HTML content to PDF using Pandoc."""
output_pdf_path = os.path.join(
settings.BASE_DIR,
'files',
'temp',
output_filename,
)

try:
subprocess.run(
[
'pandoc',
'--from=html',
'--to=pdf',
'-o',
output_pdf_path,
'--pdf-engine=xelatex',
"-V",
"pagestyle=empty",
"-V",
"geometry:margin=1.5in",
],
input=html_content.encode(),
check=True,
)
except subprocess.CalledProcessError as e:
raise Exception(f"Error converting HTML to PDF: {e}")

return output_pdf_path


def merge_pdfs(cover_pdf, doc_pdf, output_filename):
"""Merge two PDFs (cover sheet and document) into one."""
output_pdf_path = os.path.join(settings.BASE_DIR, 'files', 'temp', output_filename)

merger = PdfMerger()

try:
merger.append(cover_pdf)
merger.append(doc_pdf)
with open(output_pdf_path, 'wb') as merged_pdf:
merger.write(merged_pdf)
except Exception as e:
raise Exception(f"Error merging PDFs: {e}")
finally:
merger.close()

return output_pdf_path


class PandocError(Exception):
def __init__(self, msg, cmd=None):
super().__init__(self, msg)
self.cmd = cmd

57 changes: 55 additions & 2 deletions forms.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,58 @@
from django import forms

from tinymce.widgets import TinyMCE

from utils import setting_handler, models
from plugins.pandoc_plugin import plugin_settings


class PandocAdminForm(forms.Form):
pandoc_enabled = forms.BooleanField(required=False)
pandoc_extract_images = forms.BooleanField(required=False)
pandoc_enabled = forms.BooleanField(label="Enable Pandoc", required=False)
pandoc_extract_images = forms.BooleanField(label="Extract Images", required=False)
cover_sheet = forms.CharField(
label="Cover Sheet",
widget=TinyMCE(attrs={'cols': 80, 'rows': 30}),
required=False
)
def __init__(self, *args, journal=None, **kwargs):
"""Initialize form with current plugin settings and apply help text."""
super().__init__(*args, **kwargs)
self.journal = journal
self.plugin = models.Plugin.objects.get(
name=plugin_settings.SHORT_NAME,
)

# Initialize fields with settings values and help texts
pandoc_enabled_setting = setting_handler.get_plugin_setting(
self.plugin, 'pandoc_enabled', self.journal, create=True,
pretty='Enable Pandoc', types='boolean'
)
self.fields[
'pandoc_enabled'
].initial = pandoc_enabled_setting.processed_value

extract_images_setting = setting_handler.get_plugin_setting(
self.plugin, 'pandoc_extract_images', self.journal, create=True,
pretty='Pandoc extract images', types='boolean'
)
self.fields[
'pandoc_extract_images'
].initial = extract_images_setting.processed_value

cover_sheet_setting = setting_handler.get_plugin_setting(
self.plugin, 'cover_sheet', self.journal, create=True,
pretty='Cover Sheet', types='text'
)
self.fields[
'cover_sheet'
].initial = cover_sheet_setting.processed_value

def save(self):
"""Save each setting in the cleaned data to the plugin settings."""
for setting_name, setting_value in self.cleaned_data.items():
setting_handler.save_plugin_setting(
plugin=self.plugin,
setting_name=setting_name,
value=setting_value,
journal=self.journal
)
Loading