Privacy and PDF metadata

When you create a PDF file, what you see is not all you get. There is metadata embedded in the file that might be useful. It also might reveal information you’d rather not reveal.

The previous post looked at just the time stamp on a file. This post will look at more metadata, focusing on privacy implications.

Inspecting metadata

Here’s a little Python script we’ll use to inspect some of the metadata in a PDF. I say some because this does not pick out everything in every PDF.

    from pypdf import PdfReader

    def print_metadata(filename):
        print("File: ", filename, "\n")    
        reader = PdfReader(filename)
        meta = reader.metadata
        for m in meta:
            print(m, meta[m])

Let’s run this on the “Hello world” example from the previous post.

    File:  humpty.pdf

    /Creator Writer
    /Producer LibreOffice 7.5
    /CreationDate D:20240208064322-06'00'

OK, so this shows that the file was created with LibreOffice Writer, version 7.5.

Time and location

It also shows when the file was written. As I discussed in the previous post, the file was written today at 6:43:22. But what I didn’t comment on before was the -6'00' at the end. This is my time zone, six hours behind GMT, i.e. US Central Standard Time.

Note that the time zone isn’t just time information, it’s also location information. It’s no secret that I live in Houston, but if I didn’t want to reveal my location, this time stamp would partially give away where I live. (Probably. Strictly speaking it reveals the time zone setting on my computer.)

Microsoft Word files

I repeated my “Hello world” file experiment with Microsoft Word on an old laptop. When I exported to PDF I got the following.

    /Author John Cook
    /Creator Microsoft® Word 2016
    /CreationDate D:20240208101055-06'00'
    /ModDate D:20240208101055-06'00'
    /Producer Microsoft® Word 2016

So this includes my name. The installation program for Microsoft Office asks for your name, and I must have provided it. Either LibreOffice doesn’t ask or I didn’t enter it.

When I print to PDF rather than export to PDF I get slightly different output.

    /Author John
    /CreationDate D:20240208101220-06'00'
    /ModDate D:20240208101220-06'00'
    /Producer Microsoft: Print To PDF
    /Title Microsoft Word - Document1

LaTeX files

Now let’s look at a PDF created from a LaTeX file. I created a file foo.tex with the following content

    \documentclass{article}
    \begin{document}
    Hello world.
    \end{document}

then compiled it with pdflatex foo.tex. Let’s see what metadata our Python code can find.

    /Producer pdfTeX-1.40.25
    /Creator TeX
    /CreationDate D:20240208075059-06'00'
    /ModDate D:20240208075059-06'00'
    /Trapped /False
    /PTEX.Fullbanner This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023/MacPorts 2023.66589_1) kpathsea version 6.3.5

Obviously the file was created with TeX [1]. You can usually identify TeX files by their appearance. You can make a TeX file look less distinctive by changing the default font and a few other things. But if you did so without changing the metadata, someone could still determine that the file was made using TeX.

I’m not trying to conceal that I use LaTeX. But if you create a PDF with an obscure program, maybe that reveals more than you’d like to reveal.

Operating system

You can see that the file was produced on a Mac. When I compiled the same file on my Linux desktop, it showed the operating system as Debian but was not any more specific.

When you see that a file was created using Microsoft Word, it was probably created on Windows. I don’t have Word on my Mac, but I wouldn’t be surprised if the application was reported to be something like Office for MacOS rather than just Word.

I created a document with Microsoft 365 online and it reported the following.

    /Author John Cook
    /Creator Microsoft Word
    /CreationDate D:20240208084209-08'00'
    /ModDate D:20240208084209-08'00'

The lack of an operating system in the Creator field may indicate that the document was created online. Note that the time zone is −8, i.e. Pacific Standard Time. This isn’t my time zone but the time zone of the server, perhaps in Seattle.

[1] LaTeX is written on top of TeX. The metadata says the file was created with TeX, because ultimately it really was.

5 thoughts on “Your PDF may reveal more than you intend”

Derek Jones

8 February 2024 at 12:10

A pdf also contains the identity of the program used to create any contained images, which is useful for reverse engineering data https://shape-of-code.com/2013/12/19/converting-graphs-in-pdf-files-to-csv-format/
Nathan

8 February 2024 at 14:05

I find this stuff super fascinating; thanks for sharing!

I think about how some tools use heuristics to determine things that were not intended to be revealed. For instance, the popular networking tool nmap will send a variety of requests to a port or a server and based on the responses it gets back can figure out things like what operating system is running on that host or what version of the server software is running. This information is not necessarily intended to be exposed, but can be determined by examining known differences in output.

In the same way, it could be possible to fingerprint those PDF metadata by determining (for example) that Office 365 writes “Microsoft Word” as the creator, while Office 2016 writes “Microsoft® Word 2016”. The fact that Office 2016 identifies itself explicitly also implicitly identifies Office 365 because it doesn’t.

I assume that nobody has done this work because nobody cares, but as your post indicates, the very fact that it’s possible to do this might matter to someone.
Wayne Joubert

8 February 2024 at 18:51

Both MS Word and LibreOffice have a way to purge personal information from a document before saving. I wonder if this also purges the info from the PDF. This is concerning, for example if you’re doing a blind review of a publication or proposal, your identity is supposed to be anonymized.
Wayne Joubert

12 February 2024 at 09:35

Also, relatedly, the Gimp image editing application by default stores a lot of metadata in an image file, possibly a privacy hazard. The default behavior can be changed, but this is opt-in, and many users may not understand the implications. Likewise the purge of MS Word and LibreOffice personal metadata is opt-in, and many users might not know about this feature. Plus, not doing it is an easy mistake to make.
Tom T.

15 February 2024 at 10:05

The are specialized tools that can scrub PDF metadata, like BatchPurifier.

Comments are closed.