Modify an existing PDF based on content #1627
cognospaul
started this conversation in
Show and tell
Replies: 1 comment 1 reply
-
Hey! That looks very interesting. We'd love to add this feature to pdf-lib. We're currently maintaining the most advanced fork of the lib at cantoo/pdf-lib. Would you agree to open a PR there with your work? We would need to define an API to interact with the existing content, but just start with something simple and we'll build the API as it goes. pdf-lib already depends on pako, so there is no problem if you use it. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I've been tasked with doing a project to modify an existing PDF based on content. There are a ton of posts with people asking how to extract data, only to be told that it's not currently supported. Well, I always believe that anything is possible, given enough time. So I figured it out.
To begin with, I'm using this against PDFs generated in Cognos. Your milage may vary with PDFs generated in other sources. In addition to pdf-lib extracting content will need a library like Pako to decompress the data.
I'm going to skip various parts of my code to focus on actually extracting the content.
This is the standard bit.
let { PDFDocument, PDFName, PDFRawStream, decodePDFRawStream, arrayAsString, rgb, StandardFonts, degrees, setFontAndSize } = pdflib;
Getting the pdf from Cognos and converting it into an ArrayBuffer
Getting the document, form, and pages from the document.
let pdfDoc = await PDFDocument.load(pdfBlob) , pdfForm = pdfDoc.getForm() , pages = pdfDoc.getPages() , page = pages[0]
Now we can actually start extracting the content for each page. For this I'm just pulling the first page.
We can extract the content node from the dict map using PDFName["Contents"]
Once we get the content node, we can pull it from the indirectObjects list. Now we need to use decodePDFRawStream to get the data.
The data returned isn't usable yet.
It's actually compressed using gzip. In my solution I'm using pako to decompress it.
string = new TextDecoder().decode(pako.ungzip(io.contents))
Split out it looks like this:
Now we actually have data we can work with.
This is looking for any instance of a rectangle colored yellow, and adds a field on top of it.
Before:
After
I'm pretty sure it should be possible to use this technique to replace data in the PDF as well, but I haven't tested that yet.
Beta Was this translation helpful? Give feedback.
All reactions