Programming challenge: reformatting a log file

Part of a shell script that automates searches on PDFs

I don’t work in IT, I work in book publishing. Nevertheless, I’m finding more and more that automating tasks can have real benefits for my job so I’m teaching myself to code. Sort of. My approach has been slightly haphazard, to be honest, dictated mainly by what I need to achieve in the real world rather than what it might be most sensible to learn next. But I’m enjoying it and saving my future self a lot of time and bother.

Here’s what I’m working on now: I have written a shell script that automates a group of searches that I need to run from time to time on PDFs. Sometimes it’s one file, sometimes it’s a whole directory of files. Either way the shell script can handle it. The searches are run using pdfgrep and the results are output to a text file. It’s this log of results that I’m focussing on in the next step of the process.

At the moment results in the log are grouped together by search pattern, and then within that they are ordered by filename and then by page number. That’s OK, but it would be more convenient to reformat it so results are grouped by filename and then page number and then by search pattern. That would allow me to take one pass through the file(s) correcting all errors in order, rather than having to make multiple passes working to fix errors found by each search pattern in turn.

First attempt at figuring out my approach

I suspect arrays might be useful here. Lines in the log begin with the filename, then a colon, then the page number and another colon. My idea is to capture one instance of each unique filename and page number combination and store them in an array. Having done this, I could loop through that array, and use that as a basis for outputting the lines from the log in the new desired order. To take it one step further I could perhaps use a multidimensional array, with each inner array containing all of the lines that start with a given filename and page number combination.

Some questions I must ask myself:

  • Which language to use? Do I continue with shell script? How can I tell which is best suited to my needs?
  • How do I search for unique line beginnings? Do I capture them all and de-duplicate the array afterwards or test for uniqueness while capturing and storing the line beginnings?
  • I need an example of the relevant search pattern in each line to show what is that has been picked up in the corresponding PDF. How do I do this?

Why this post?

I thought it might be useful to keep a record of my thought process in approaching this problem. Also, if I decide I want to ask someone for advice, I can point them towards this post rather than having to type it all out again.

Everywhere else I write about beer, cider and spirits. This is where I put other stuff, mostly about coding.

Everywhere else I write about beer, cider and spirits. This is where I put other stuff, mostly about coding.