A quick and dirty approach to redacting PDFs on Debian 11 Bullseye

2022-02-09

Warning: this is a “quick and dirty” approach. It is the completely wrong thing to use if you do not trust the PDF that you are redacting, because it opens it to convert it. From the testing I have done, it does remove the redacted text, but this is my limited testing, and I am not the NSA.

Occasionally, I want to redact text from a document, typically a PDF, before I share it with someone.

On macOS, there is the excellent PDFPen Pro. But I have not found a simple solution for Linux. So I came up with this.

Please see the warning above: it appears to work, but I am making no promises that it will work suitably for your use case. You might prefer dangerzone.

Xournal++ to block out the bits I want to redact

I use Xournal++ a lot, because it’s a great tool for marking up PDFs on the touchscreen of my Microsoft Surface computer. I can sit comfortably, and, a bit like marking up a printed document, I can scribble on it to my heart’s content.

For this scenario, I use the “Draw Rectangle” tool to block out the text I want to redact:

Xournal++’s draw rectangle tool

By default, this just adds an outline, so you need to select the “Fill” option (Tools / Pen Options / Fill):

Xournal++’s Pen Options menu

Then increase the opacity of the fill (Tools / Pen Options / Fill Opacity) to 100%:

Xournal++’s fill opacity menu

And finally select “Black” as your colour.

When you’ve done that, you can select the bits of the document you want to block out.

Screenshot of text with the middle line blocked out in black

No, it’s not as convenient as being able to highlight text and redact it, nor to do search-based redaction (e.g. redacting all instances of someone’s name).

When you’re done, export the PDF (File / Export as PDF).

Warning: you have not redacted the PDF

At this stage, you do not have a redacted PDF. You have a PDF with a black box over bits of the text. The underlying text is still there.

If you want to test this, just open the blocked PDF, and select all the text. The text under the blocking is still visible / selectable:

Screenshot showing the text selected on the blocked-out PDF, showing you can see the underlying text

Redacting the blocked text using `ghostscript`, `imagemagick`, and `ocrmypdf`

I use the following quick and dirty script to redact the blocked text, and then to OCR the remaining file. (You could skip this last bit, if making the remaining text of the PDF searchable was not of benefit to you.)

#!/bin/bash

INPUTPDF="$1"
REDACTEDPDF="Redacted_$INPUTPDF"
OCRPDF="OCR_$REDACTEDPDF"


gs -o /tmp/RedactedPDFConvertedToTif.tif -sDEVICE=tiffg4 "$INPUTPDF" && convert /tmp/RedactedPDFConvertedToTif.tif "/tmp/$REDACTEDPDF"

ocrmypdf "/tmp/$REDACTEDPDF" "$OCRPDF"


rm /tmp/RedactedPDFConvertedToTif.tif
rm "/tmp/$REDACTEDPDF"

Why “quick and dirty”?

It does not check the PDF is safe
It does not deal with complex or unusual PDF names
It does not check for naming collisions
It does not use full paths for the executables
It does not handle errors or show error messages (aside from those coming from ghostscript, imagemagick, ocrmypdf, or rm themselves)

But, assuming you are comfortable with those constraints, what it does is:

uses ghostscript to convert the blocked PDF into the .tif image format
uses imagemagick’s convert function to turn the .tif image back into a PDF
uses ocrmypdf to OCR the newly-created PDF

What you should find is that you get a PDF with the blocked text removed, but the rest of the PDF is searchable:

Screenshot showing the redacted PDF with all text selected, but you cannot see the redacted text

If you open the output file (beginning OCR_Redacted…), and select all the text, and paste it into a text editor, you should not be able to see the blocked out text:

Screenshot showing pasted text, not showing the redacted text

Errors you might run into:

convert-im6.q16: attempt to perform an operation not allowed by the security policy `PDF' @ error/constitute.c/IsCoderAuthorized/421.

This is down to an imagemagick security policy. The solution described here feels a bit brutal, but it worked:

In /etc/ImageMagick-6/policy.xml, remove the lines:

  <policy domain="coder" rights="none" pattern="PS" />
  <policy domain="coder" rights="none" pattern="PS2" />
  <policy domain="coder" rights="none" pattern="PS3" />
  <policy domain="coder" rights="none" pattern="EPS" />
  <policy domain="coder" rights="none" pattern="PDF" />
  <policy domain="coder" rights="none" pattern="XPS" />

Neil's blog

A quick and dirty approach to redacting PDFs on Debian 11 Bullseye

Xournal++ to block out the bits I want to redact

Warning: you have not redacted the PDF

Redacting the blocked text using `ghostscript`, `imagemagick`, and `ocrmypdf`

Errors you might run into:

You may also like:

Xournal++ to block out the bits I want to redact

Warning: you have not redacted the PDF

Redacting the blocked text using ghostscript, imagemagick, and ocrmypdf

Errors you might run into:

You may also like:

Redacting the blocked text using `ghostscript`, `imagemagick`, and `ocrmypdf`