Warning: this is a "quick and dirty" approach. It is the completely wrong thing to use if you do not trust the PDF that you are redacting, because it opens it to convert it. From the testing I have done, it does remove the redacted text, but this is my limited testing, and I am not the NSA.
Occasionally, I want to redact text from a document, typically a PDF, before I share it with someone.
On macOS, there is the excellent PDFPen Pro. But I have not found a simple solution for Linux. So I came up with this.
Please see the warning above: it appears to work, but I am making no promises that it will work suitably for your use case. You might prefer dangerzone.
Xournal++ to block out the bits I want to redact
I use Xournal++ a lot, because it's a great tool for marking up PDFs on the touchscreen of my Microsoft Surface computer. I can sit comfortably, and, a bit like marking up a printed document, I can scribble on it to my heart's content.
For this scenario, I use the "Draw Rectangle" tool to block out the text I want to redact:
By default, this just adds an outline, so you need to select the "Fill" option (Tools / Pen Options / Fill):
Then increase the opacity of the fill (Tools / Pen Options / Fill Opacity) to 100%:
And finally select "Black" as your colour.
When you've done that, you can select the bits of the document you want to block out.
No, it's not as convenient as being able to highlight text and redact it, nor to do search-based redaction (e.g. redacting all instances of someone's name).
When you're done, export the PDF (File / Export as PDF).
Warning: you have not redacted the PDF
At this stage, you do not have a redacted PDF. You have a PDF with a black box over bits of the text. The underlying text is still there.
If you want to test this, just open the blocked PDF, and select all the text. The text under the blocking is still visible / selectable:
Redacting the blocked text using
I use the following quick and dirty script to redact the blocked text, and then to OCR the remaining file. (You could skip this last bit, if making the remaining text of the PDF searchable was not of benefit to you.)
#!/bin/bash INPUTPDF="$1" REDACTEDPDF="Redacted_$INPUTPDF" OCRPDF="OCR_$REDACTEDPDF" gs -o /tmp/RedactedPDFConvertedToTif.tif -sDEVICE=tiffg4 "$INPUTPDF" && convert /tmp/RedactedPDFConvertedToTif.tif "/tmp/$REDACTEDPDF" ocrmypdf "/tmp/$REDACTEDPDF" "$OCRPDF" rm /tmp/RedactedPDFConvertedToTif.tif rm "/tmp/$REDACTEDPDF"
Why "quick and dirty"?
- It does not check the PDF is safe
- It does not deal with complex or unusual PDF names
- It does not check for naming collisions
- It does not use full paths for the executables
- It does not handle errors or show error messages (aside from those coming from
But, assuming you are comfortable with those constraints, what it does is:
ghostscriptto convert the blocked PDF into the .tif image format
convertfunction to turn the .tif image back into a PDF
ocrmypdfto OCR the newly-created PDF
What you should find is that you get a PDF with the blocked text removed, but the rest of the PDF is searchable:
If you open the output file (beginning OCR_Redacted...), and select all the text, and paste it into a text editor, you should not be able to see the blocked out text:
Errors you might run into:
convert-im6.q16: attempt to perform an operation not allowed by the security policy `PDF' @ error/constitute.c/IsCoderAuthorized/421.
This is down to an
imagemagick security policy. The solution described here feels a bit brutal, but it worked.