Modify an existing PDF based on content #1627

cognospaul · 2024-05-08T14:39:16Z

cognospaul
May 8, 2024

I've been tasked with doing a project to modify an existing PDF based on content. There are a ton of posts with people asking how to extract data, only to be told that it's not currently supported. Well, I always believe that anything is possible, given enough time. So I figured it out.

To begin with, I'm using this against PDFs generated in Cognos. Your milage may vary with PDFs generated in other sources. In addition to pdf-lib extracting content will need a library like Pako to decompress the data.

I'm going to skip various parts of my code to focus on actually extracting the content.

This is the standard bit.
let { PDFDocument, PDFName, PDFRawStream, decodePDFRawStream, arrayAsString, rgb, StandardFonts, degrees, setFontAndSize } = pdflib;

Getting the pdf from Cognos and converting it into an ArrayBuffer

let blob = await getPDF(`${__glassAppController.glassContext.gateway}/v1/disp${content}`)
  , pdfBlob = await blob.arrayBuffer()

Getting the document, form, and pages from the document.
let pdfDoc = await PDFDocument.load(pdfBlob) , pdfForm = pdfDoc.getForm() , pages = pdfDoc.getPages() , page = pages[0]

Now we can actually start extracting the content for each page. For this I'm just pulling the first page.

 let contentRef = page.node.dict.get(PDFName["Contents"])
  , io = pdfDoc.context.indirectObjects.get(contentRef)
  , decodeIo = decodePDFRawStream(io)

We can extract the content node from the dict map using PDFName["Contents"]
Once we get the content node, we can pull it from the indirectObjects list. Now we need to use decodePDFRawStream to get the data.

The data returned isn't usable yet.

It's actually compressed using gzip. In my solution I'm using pako to decompress it.
string = new TextDecoder().decode(pako.ungzip(io.contents))

Split out it looks like this:

Now we actually have data we can work with.

  let lines = string.split('\n')
  , fieldLines = lines.map((line,int)=>{if(line.includes('1 1 0 rg')) return lines[int+1]}).filter(x=>x)
;

 
fieldLines.forEach((line,int)=>{
  let arr = line.split(' ')
    , field = pdfForm.createTextField('field'+int)
    , x = parseFloat(arr[0])
    , y = parseFloat(arr[1])
    , w = parseFloat(arr[2])
    , h = parseFloat(arr[3])
    , da = field.acroField.getDefaultAppearance() ?? ''
    , newDa = da + '\n' + setFontAndSize('Courier', 8).toString()
  ;
  field.acroField.setDefaultAppearance(newDa);
  field.enableMultiline()
  field.setText('');

  //using the placement settings from the PDF seems to leave a visible gap. I suspect there might be some scaling going on, but adding this as a bandaid until I can figure it out. Also, in my test PDF the height and y data is negative for some reason.  
  x = x<0?x+0.25:x-0.25;
  y = y<0?y+0.5:y-0.5;
  w = w>0?w+0.5:w-0.5;
  h = h>0?h+1:h-1;
  
  field.addToPage(page, { x: x, y: y, width: w, height:h,maxWidth: w - 0.5, wordBreaks: [" "] })
})

downloadBlob(await pdfDoc.save(), "test", "application/pdf");

This is looking for any instance of a rectangle colored yellow, and adds a field on top of it.

Before:

After

I'm pretty sure it should be possible to use this technique to replace data in the PDF as well, but I haven't tested that yet.

Sharcoux · 2024-05-20T07:57:53Z

Sharcoux
May 20, 2024

Hey! That looks very interesting. We'd love to add this feature to pdf-lib.

We're currently maintaining the most advanced fork of the lib at cantoo/pdf-lib. Would you agree to open a PR there with your work?

We would need to define an API to interact with the existing content, but just start with something simple and we'll build the API as it goes.

pdf-lib already depends on pako, so there is no problem if you use it.

1 reply

alistair0adams Sep 4, 2024

I'm assuming this didn't go anywhere?

A few days ago I knew nothing about PDF formats. Now I've spend more time in the 1000 page spec, the 36,88 lines of pdf-lib JavaScript code and some pdfreader Python code than I really wanted!

I want to extract strings from a Form XObject. Got it working very quickly with the Python pdfreader but I needed JavaScript.

Here's the Python code, quite straight forwards:

import pdfreader
from pdfreader import PDFDocument, SimplePDFViewer
pdf_path = '<path to pdf>'
fd = open(pdf_path, 'rb')
viewer = SimplePDFViewer(fd)
viewer.render()

for canvas in viewer:
    page_forms = canvas.forms
    for key in page_forms:
        print (f'keys: {key}: {page_forms[key].strings}')

Naive me thought I could do the same with pdf-lib but I learned the hard way that it supports Interactive Forms but not Form XObjects and these are very different things.

With the hint from this post, and https://pdfcrowd.com/, I managed to extract the stream string with this code:

 try {
   pdfDoc = await PDFLib.PDFDocument.load(arrayBuffer, { updateMetadata: true, password: "" })
  }
  catch (error) {
    console.error(`Couldn't load PDF: ${error}`);
  }

  const page = pdfDoc.getPage(0)
  resources = page.node.Resources()
  xobject   = resources.dict.get(PDFLib.PDFName["XObject"])
  xi2       =   xobject.dict.get(PDFLib.PDFName.of("Xi2"))
  stream = pdfDoc.context.indirectObjects.get(xi2)
  string = PDFLib.decodePDFRawStream(stream).decode()

Then I got this string

q Q /Tx BMC q 1 1 104.2 15.4 re W n q BT 1 0 0 1 2 4.12 Tm /F1 12.22 Tf 0 g (\x00\x07\x00$\x00/\x005\x00\x01\x00\b\x00 \x00-\x00\)\x00$\x00/\x00$\x00.)Tj 0 g ET Q Q EMC

That made no sense till I read Annex A, it mentioned Postscript and I learned all that almost 40 years ago so finally something familiar.

Am I right in thinking there's nothing in pdf-lib to decode this?

[Edit update]
I implemented a crude CMAP decoder and now have the strings from the Form XObjects.

BTW, I'm a crusty old sw guy who's more at home writing interrupt service routines in C for real-time embedded systems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify an existing PDF based on content #1627

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Modify an existing PDF based on content #1627

cognospaul May 8, 2024

Replies: 1 comment · 1 reply

Sharcoux May 20, 2024

alistair0adams Sep 4, 2024

cognospaul
May 8, 2024

Replies: 1 comment 1 reply

Sharcoux
May 20, 2024