Introducing Photocite

Recently I’ve been processing a number of family history photos or scans of old family artifacts such as letters. For images that are historic which I may share or distribute as a part of my research, I want to be sure that I have a good citation for the image and I want to embed it in the image itself so that the sourcing information is less likely to be lost as the image gets distributed or passed around. I started off using Pixelmator Pro to painstakingly add text citations to the images, but this seemed arduous and somewhat inconsistent after a while, and if I needed to tweak the citation it got a bit fiddly.

I thought I could probably figure out a way to do this via command line. It took more effort than I expected to figure out an efficient way which behaved the way I wanted, but I thought I would share the tool and explain how it works.

Photocite is a python script that chains together a few different tools to do this:

  • ImageMagick is a command-line swiss army knife for images, it’s a great tool and very fast. It even has some built-in captioning capability. I spent a bunch of time unsuccessfully trying to get this to produce the type of captions I wanted, but I didn’t have any luck. I particularly had problems with mixing regular and italic text and with positioning the captions how I wanted.
  • LaTeX is a document preparation and typesetting system. With this I found I could get the consistency and formatting I wanted in the citations.
  • Pandoc is a universal document converter. I use this for converting Markdown to LaTeX.
  • PdfCrop comes with the LaTeX/TeX installation.

The basic flow is as follows:

  1. Get the dimensions and DPI of the image
  2. If the image is a JPEG, get the quality of the JPEG
  3. Read in Markdown and use Pandoc and Latex to create a high resolution PDF of the citation
  4. Crop the PDF down to just encompass the text and some padding
  5. Convert the PDF to a PNG and resize it to be smaller than the original image
  6. Use ImageMagick to append the PNG to the original image.

I mostly used Claude.ai to generate the python code, but did some hand-tweaking as well.

Here’s an example:

Assuming I have an image and a markdown file that contains the text of my citation:

$ cat "Charles and Rhoda and possibly Hubert Crane.md"
Photograph depicting Charles Irvin Crane, Rhoda Ellen (Jenkins) Crane, and possibly Hubert Crane, ca. late 1895. Original print, approx. 6 × 4.5 in.; privately held by Todd Wells, Seattle, Washington, 2025. Inscription in the cursive handwriting of Agnes Crane Wells on back reads: “Charles & Rhoda Crane (& Hubert??)”. 

Then I can execute

$ photocite crane.jpg -c "Charles and Rhoda and possibly Hubert Crane.md"
Created 'crane with citation.jpg' using citation text from file: Charles and Rhoda and possibly Hubert Crane.md
$ 

And I get a new image file with a citation embedded:


PDF Slurping

I’ve been using Advantage Archives for looking at the newspaper archives of a number of different libraries as a part of genealogy research.

The trouble is that each library has a slightly different UI for browsing these newspapers, and the experience can be fairly cumbersome. Ultimately, you can download a PDF if you want, but all the clicking around still makes the process slow and frustrating.

Of course the built-in MacOS Preview tool can show you PDFs, too, and it’s navigation/zoom interface is easier, too (especially using pinch-to-zoom, etc).

TextExpander Citations

In my research I’ve been finding a lot of newspaper articles and transcribing them using automation. But when I’m documenting those articles and linking them to people, I also need to create a source citation. I try to use Evidence Explained-style citations to the best of my ability, but there’s a lot of repeated boilerplate text when writing these up – and there is formatting! Some parts of italicized, so that gets tedious, too.

Automating Transcription with ChatGPT

I recently discovered a trove of online newspaper archives for an region where a branch of my family research is focused. Advantage Archives partners with mostly small-town libraries, especially in the midwestern USA, to digitize old newspaper archives and put them online.

The discovery of all these newpaper archives has led me to want to transcribe and write source citations for hundreds of articles. You know what’s tedious? Transcribing and citing hundreds of articles! This is a job for automation!

Hello World

Welcome to Pedigree Pipeline, a place for me to share tips and techniques for automating genealogy and family history research — along with other insights or discoveries from my research itself. Much of my work is computer-centric these days, and I often find myself repeating the same tasks. As a software engineer, I’m always thinking about how to automate these workflows — primarily on a Mac, using the Unix command line and Mac-centric automation tools.