Yesterday, I wrote about how I cleaned up a lot of old mess here on this blog. This made me realize that I have not paid attention to creating alt text for all my images, that is, descriptive text that describes the content of the image. That is an enormous job when you have several thousand images on a blog like this. So I decided to ask CoPilot for help.
Getting AI help
CoPilot first suggested creating alt text based on the image file names. That could have been a good idea if the filenames had been descriptive. However, they are not (in general), hence I asked for a solution that would actually analyse the content of the images. After a series of iterations, we (CoPilot and I) ended up with a Python script that does the job.
This is what the script does:
- Scan all my blog posts (
content/**/*.md) and check which ones already have an alt text. - Generate descriptive alt text for an image using a local Hugging Face image-to-text model.
- Produce a machine-readable report with tagging suggestions.
- Adds alt text to images and backs up the old files.
It all ran remarkably quickly on my laptop. With a GPU, it swept through a few thousand images in just a couple of minutes.
Transformers
The script uses the nlpconnect/vit-gpt2-image-captioning image‑to‑text model. This is a transformer-based image‑to‑text model that uses a Vision Transformer (ViT) encoder to process the image and a GPT‑2 (transformer) decoder to generate the caption.
Transformers are a class of neural network architectures for processing sequences. They use self-attention mechanisms to compute pairwise attention weights between elements of a sequence, enabling the model to capture long-range dependencies and contextual relationships while allowing parallel computation.
You can think of a transformer as a smart reader that looks at every part of its input and decides which parts matter most. Instead of reading left-to-right, which is common in sequence processing, it compares all pieces to each other (self‑attention) to better understand context. It uses multiple “attention heads,” which lets the model look for different kinds of relationships at once.
Writing good alt text
I have previously written about an object-action-context approach to writing alt text. That was in the context of audio files, though, which may also need descriptive text for those who cannot hear them. Still, the idea is that it helps identify the objects in the scene (auditory or visual), the actions performed on or with them, and the environment in which it all happens.
In addition, there are some general guidelines for what makes up a good alt text:
- Be specific and concise
- Include important context
- Skip “image of” or “photo of” unless needed for context
- If decorative (no meaning), leave alt empty (alt=" “) so screen readers skip it. So I have left all the post cover images without alt text, since they don’t usually have much meaning.
Of course, mass creation of alt text using an AI tool is not perfect. Ideally, I would have created them all manually. However, that is not going to happen for all my former posts. For now, at least I have some descriptive text for all images. That is better than nothing.
