Improving the PDF files in the NIME archive

This blog post summarizes my experimentation with improving the quality of the PDF files in the proceedings of the annual International Conference on New Interfaces for Musical Expression (NIME).

Centralized archive

We have, over the last few years, worked hard on getting the NIME adequately archived. Previously, the files were scattered on each year’s conference web site. The first step was to create a central archive on nime.org. The list there is automagically generated from a collection of publicly available BibTeX files that serve as the master document of the proceedings archive. The fact that the metadata is openly available in GitHub makes it possible for people to fix errors in the database. Yes, there are errors here and there, because the files were made by “scraping” the PDF content. It is not just possible to do this manually for more than 1000 PDF files.

The archive points to all the PDF files, some media files (more are coming), and DOIs to archived PDFs in Zenodo. Together, this has turned out to be a stable and (we believe) future-proof solution.

PDF problems

However, as it has turned out, the PDF files in the archive have various issues. All of them work fine in regular PDF readers, but many of them have accessibility issues. There are (at least) three problems with this.

  1. Non-accessible PDFs do not work well for people using alternative readers. We need to strive for universal access at NIME, and this includes the archive.
  2. The files are not optimized for text mining tasks. The latter is something that more people have been interested in. Such an extensive collection of files is a great resource when it comes to understanding a community and how it has developed. This was something I tried myself in a NIME paper in which I analyzed the use of the word “gesture” in all NIME papers up until 2013.
  3. If machines have problems with the files, so have the Google crawlers and other robots looking at the content of the files. This, again, has implications for how the files can be read and indexed in various academic databases.

It is not strange that there are issues with the files. After all, there are a total of 1728 of them. They have been produced from 2001 and until today on a myriad of different types of OSes and software. During this time, the PDF standard itself has also evolved considerably. For that reason, we have found it necessary to do some optimization of the files.

Renaming

The first thing I did was to download the entire collection of PDFs. I quickly discovered that there were some inconsistencies in the file names. We did a large cleanup of the file names some years ago, so things were not entirely bad. But it was still necessary to clean up the file names to have one convention. Here I ended up with renaming everything to a pattern like:

nime2001_paper001.pdf

This makes it possible to sort by year first, and then the submission type (currently only paper and music, but could be more) and then a three-digit unique number based on the submission number. Not all the numbers had leading 0’s, so I added this as well for consistency. Since the conference year and ID are unique, it is easy to do query-replace in the BibTeX database to correct the links there.

Acrobat testing

I usually don’t work much in Acrobat these days, but decided to start my testing there. I was able to get access to a copy of Acrobat XI on a university machine and started looking into different options. From the list of batch processes available, I found these to be particularly promising:

  • “Optimize scanned documents” (converting content into searchable text and reducing file size)
  • “Prepare for distribution” (removing hidden information and other oddities)
  • “Archive documents” (create PDF/A compliant documents)

I first tried to run a batch process using OCR. The aim here was to see if I could retrieve some text from files with images containing text. This did not work particularly well. It skipped most files and crashed on several. After the tenth crash, I gave up and moved on.

The “prepare for distribution” option worked better. It ran through the first 300 files or so with no problems, and reduced the files properly. But then the problems started. For many of the files, it just crashed. And when I came to the 2009-files, they turned out to the protected from editing. So I gave up again.

Finally, I tried the archiving function. Here it popped up a dialogue box asking me to fill in title and authors for every single file. I agree that this would be nice to have, but I do not have time to do this manually for 1728 files.

All in all, my Acrobat exploration turned out quite unsuccessful. Therefore, I went back to my ubuntu machine and decided to investigate what type of command-line tools I could use to get the job done.

File integrity

After searching some forums about checking if PDF files are corrupted I came across the useful qpdf application. Running this on the original NIME collection showed that the majority of files had issues.

find . -type f -iname '*.pdf' \( -exec sh -c 'qpdf --check "{}" > /dev/null && echo "{}": OK' \; -o -exec echo "{}": FAILED \; \)

The check showed that only 794 of the files were labeled as OK, while the others (934) failed. I looked at the failing files, trying to figure out what was wrong. However, I have been unable to find any consistency among failing or passing files. Initially, I thought that there might be differences based on whether they were made in LaTeX or MSWord (or something else), the platform, etc. But it turns out to not be that easy. This may also be because many of the files have been through several steps of updating along the way. For example, for many of the NIME editions, the paper chairs have added page numbers, watermarks, and so on.

Rather than trying to fix the myriad of different problems with the files, I hoped that a file compression step and saving with a newer (and common) PDF format version could help the problem.

File compression

Several of the files were unnecessarily large. Some files were close to 100 MB, and too many were more than 2 MB. This should not be necessary for 4-6 page PDF files. Large files cause bandwidth issues on the server, which means extra cost for the organization and long download time for the user. Although we don’t think about it much, saving space also saves energy and helps reduce our carbon footprint on the planet.

To compress the PDF files, I turned to the convert command line utility, which is part of the Ghostscript family. I experimented with different types of settings, but found that the settings “Screen” and “Ebook” rendered the images pixelated, even on screen. So I went for the “Printer” version, which according to the ghostscript manual should mean a downsampling of images to 300 DPI. This means that they should also print well. The script I used was this:

for i in *.pdf; do name=`echo $i | cut -d'.' -f1`; gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.6 -dPDFSETTINGS=/printer -dNOPAUSE -dQUIET -dBATCH -sOutputFile="${name}_printer.pdf" "$i"; done

The result was that the folder shrank from 3.8GB to 1.0GB, a quite lovely saving. The image quality also appears to be more or less preserved. However, this is only based on visual inspection of some of the files.

Re-running the file integrity check on all these new files, showed that all 1728 files now passed the check!

PDF/A

I have been working with PDF files for years but had not really read up on the details of the different versions. What turns out to be important when it comes to longterm preservation of files, is that they comply with the PDF/A standard. The regular PDF format has different versions (1.4, 1.5, 1.6) but these are proprietary. However, PDF/A is an ISO standard and appears to be what people use for archiving.

Unfortunately, it turns out that creating PDF/A files using Ghostscript is not entirely straightforward. So more exploration needs to be done there.

Metadata

Finally, one of the problems with the proceedings archive is to get it properly indexed by various search engines. Then having PDF metadata is important. Again, I wish we had the capacity to do this properly for all 1728 files, but that is currently out of the scope.

However, adding some general metadata is better than nothing, so I found the function ExifTool, which can be used to set the metadata on PDF files:

for i in *.pdf; do name=`echo $i | cut -d'.' -f1`;  exiftool -Title="Proceedings of the International Conference on New Interfaces for Musical Expression" -Author="International Conference on New Interfaces for Musical Expression" -Subject="A Peer Reviewed article presented at the International Conference on New Interfaces for Musical Expression" "$i"; done

Conclusion

I still need to figure out the PDF/A issue (help wanted!), but the above recipe has helped in improving the quality of the PDF files considerably. It will save us bandwidth, improve accessibility, and, hopefully, also lead to better indexing of the files.

Creating individual image files from presentation slides

How do you create full-screen images from each of the slides of a Google Docs presentation without too much manual work? For the previous blog post on my Munin keynote, I wanted to include some pictures from my 90-slide presentation. There is probably a point and click solution to this problem, but it is even more fun to use some command line tools to help out. These commands have been tested on Ubuntu 19.10, but should probably work on many other systems as well, as long as you have installed pdfseparate and convert.

After exporting a PDF from the Google Presentation, I made a separate PDF file of each slide using this command:

pdfseparate input.pdf output%d.pdf

This creates a bunch of PDF files with a running number. Then I ran this little for loop:

for i in *.pdf; do name=`echo $i | cut -d'.' -f1`; convert -density 200 "$i" "$name.png"; done

And voila, then I had nice PNG files of all my slides. I found that the trick is to use the “-density 200” setting (choose the density that suit your needs), since the default resolution and quality is too low.

Split PDF files easily using Ubuntu scripts

One of the fun parts of reinstalling an OS (yes, I think it is fun!), is to discover new software and new ways of doing things. As such, it works as a “digital shower”, getting rid of unnecessary stuff that has piled up.

Trying to also get rid of some physical mess, I am scanning some piles of paper documents. This leaves me with some large multi-page PDFs that I would like to split up easily. In the spirit of software carpentry I looked for a simple solution for splitting up a PDF file, and came across the command “burst” in the little terminal application pdftk. To use it on Ubuntu, you will first need to install it, using the terminal command:

sudo apt update && sudo apt install pdftk

Then this one-liner is all that is necessary to split a PDF file into a series of individual PDFs:

pdftk your-file.pdf burst

For convenience, I also made it into a small Ubuntu script:

This script can run by right-clicking on a file in the file manager:

And the end-result is a series of individual PDF files:

And then you can of course also combine the files again, either all PDFs:

pdftk *.pdf cat output newfile.pdf

or only the files you like:

pdftk file1.pdf file2.pdf cat output newfile.pdf

Shell script for compressing PDF files on Ubuntu

ubuntu-logo112Back on OSX one of my favourite small programs was called PDFCompress, which compressed a large PDF file into something more manageable. There are many ways of doing this on Ubuntu as well, but nothing really as smooth as I used to on OX.

Finally I took the time to figure out how I could make a small shell script based on ghostscript. The whole script looks like this:

#!/bin/sh
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/default -dNOPAUSE -dQUIET -dBATCH -dDetectDuplicateImages -dCompressFonts=true -r150 -sOutputFile="compress_$@" "$@"

and by saving it in the nautilus scripts directory:

 ~/.local/share/nautilus/scripts

It shows up when I right click on a file. For most of the files I have tried so far today (uncompressed PDF files), it compresses the files to at least 1/10th of the original size. Very useful, particularly when I only need screen resolution for files.

Screenshot from 2016-06-29 16-55-49

Reducing PDF file size

I am working on finalizing an electronic version of a large PDF file (600 page NIME proceedings), and have had some problems optimizing the PDF file. This may not be so strange, since the file is an assembly of 130 individual PDF files all made by different people and using all sorts of programs and OS.

Usually, PDFCompress works wonders when it comes to reducing PDF file sizes, but for the proceedings-file it choked at some of the fonts. Strangely enough, Acrobat Pro also encountered problems, and with no useful explanation on what went wrong.

Fortunately, OSX came to the rescue. When saving a PDF file in OSX it is possible to apply a quartz filter. And OSX had no problems saving and reducing a new PDF file of the proceedings-file. However, the built-in “reduce file size” filter reduced the images too much. But I found an explanation on how to create your own quartz filters, where it is possible to choose compression settings.

I find it strange that Acrobat Pro couldn’t do the job, but I am very happy to have found a solution.