The Hidden Metadata in Your PDFs
A few years ago, the U.S. military accidentally revealed classified information in a PDF about the Abu Ghraib prison scandal. The visible text was properly redacted—black boxes covered the sensitive content. But when journalists copied the text, the original words came right out of the clipboard.
The investigators had used PDF editing software to draw black rectangles over text. The text itself was still there, hidden but extractable. This is just one example of how PDF metadata and structure can leak information you never intended to share.
What's Actually in a PDF?
PDF is a container format, more like a ZIP file with internal structure than a simple document. A typical PDF contains:
- Content streams: The actual visible text and graphics
- Embedded fonts: Subsets of fonts used in the document
- Images: Often stored in full resolution regardless of display size
- Metadata: Author name, creation date, editing software, and more
- Annotations: Comments, highlights, sometimes including deleted ones
- Previous versions: PDF supports incremental saves that keep old content
When you open the same PDF in different viewers, they all parse this container format and render the content. What you see is a reconstruction, not a static image.
The Metadata Problem
By default, most PDF creation tools embed metadata about the document and its creator:
- Author field: Often pulled from your user account name
- Creation and modification dates: Including precise timestamps
- Creator application: What software made the PDF
- Title and subject: Sometimes auto-filled from the first heading
This information seems harmless until it's not. Sending a proposal to a client? Maybe you don't want them to see it was created in Google Docs (not the expensive design software you implied you use). Submitting work anonymously? Your full name might be in the author field.
Images Tell Stories
A photo you dropped into your PDF might carry:
- EXIF data: Camera model, settings, and potentially GPS coordinates
- Original resolution: Even if displayed small, the full image might be embedded
- Color profile: Information about displaying device capabilities
That casual photo of the product prototype might geo-locate your secret manufacturing facility. The screenshot might include the full resolution of your monitor, revealing what other windows were visible.
The Incremental Save Trap
PDF supports "incremental saves"—instead of rewriting the entire file, changes append to the end. This makes saving faster but has a side effect: old content may persist in the file.
Delete a paragraph in Word, export to PDF? The deleted text is gone. Delete text in a PDF editor's "edit" mode? Depending on the software, the original might still be there, just hidden from the rendering engine.
This is why proper redaction tools exist. They don't just draw black boxes—they remove the underlying content entirely and sometimes re-render the document to ensure nothing persists.
Merging and Splitting: Hidden Complexity
When you merge multiple PDFs, you're combining containers. Each source file might have different:
- Page sizes
- Embedded fonts (which may duplicate)
- Color profiles
- PDF versions
- Security settings
Naive merging can create bloated files where the same font is embedded multiple times, or where incompatible security settings cause viewer warnings.
When you split a PDF, you're extracting pages—but the resulting files might still contain fonts and resources from pages you didn't include, bloating file size.
What You Can Do
For documents you're sharing externally:
- Check metadata before sending. Most PDF readers show properties somewhere.
- Use proper redaction if hiding content. Don't just draw over it.
- Consider "print to PDF" if you need a clean copy without history. This renders and rebuilds the document fresh.
- Watch image sources. Strip EXIF data from photos before including them.
The Upside: PDFs That Preserve
Everything that makes PDF metadata dangerous also makes it useful for legitimate purposes:
- Embed fonts so documents look identical everywhere
- Include high-resolution images that readers can zoom into
- Preserve edit history for compliance requirements
- Track document lineage for version control
PDF's complexity exists because it tries to be a complete, self-contained document representation. The trade-off is that it's very easy to include more than you intended.
Understanding what's in your PDFs gives you control over what you're actually sharing.