# Highext: A Tool to Create Structured Summaries from PDFs

Table of Contents

Summary

Highext is a personal script I created to help me write college summaries from PDFs. It works by extracting highlighted text based on specific colors and formatting it into a structured document. This lets you quickly generate summaries with markup and even a table of contents, just by highlighting your PDF.

Tech Stack

The script is built entirely in Python. The core PDF processing is handled by the PyMuPDF (fitz) library.

I structured the project using principles from Clean Architecture (or Hexagonal Architecture) to keep the code maintainable and testable. This separates the application into distinct layers:

  • Domain: Contains the core business logic and entities (like a Highlight or Screenshot).
  • Application: Orchestrates the workflow, defining interfaces for external tools (like IPdfParser or IConfigProvider).
  • Infrastructure: Provides the concrete implementations for those interfaces (e.g., FitzPdfParser which uses PyMuPDF, or JsonConfigProvider which reads the config.json).
  • Presentation: The CLI entry point that wires everything together.

Testing is handled using pytest.

Technical Challenges

While the idea sounds simple, I ran into a few interesting challenges:

  • Mapping Text Correctly: A highlight annotation doesn’t actually “contain” the text. It’s just a set of coordinates for rectangles on the page. The biggest challenge was to reliably find all the Word objects (which also have coordinates) that intersected with these highlight rectangles. This had to account for multi-line highlights and highlights that only covered part of a word.

  • Ensuring Correct Order: Annotations aren’t always stored in the PDF in the order they appear on the page. I had to implement a custom sorting key based on an annotation’s vertical (top-to-bottom) and then horizontal (left-to-right) coordinates. This ensures that a highlight at the top of the page is always processed before one at the bottom, making the final summary coherent.

  • Reliable Geometric Intersection Logic: A simple bounding box intersection check was too noisy. It often grabbed parts of adjacent words that barely touched the highlight’s rectangle, leading to dirty or incorrect text. I had to implement a more robust logic (HighlightLine) that only includes a word if the intersection covers a significant percentage of the word’s own width. This ensures only the text truly inside the highlight is extracted.

  • Handling Screenshots: I expanded the tool to also recognize “Square” annotations. The script treats these as screenshot requests. It uses PyMuPDF to render that specific region of the PDF as a high-resolution PNG, crops it, and generates a unique, sequential filename for it (like 1a.png, 1b.png) so you can have multiple screenshots on one page.

How it works

Text extraction

Highlights in PDFs are commonly implemented as annotations, which are basically a series of rectangles (with a specific color and transparency) laid over the text. A single highlight can be one rectangle or, if it spans multiple lines, a series of rectangles.

For this to work, we also need the bounding box (the surrounding rectangle) for every word on the page.

As mentioned, the core logic is finding the intersection between the words bounding boxes and the highlights rectangles.

So, for any given page, we have a list of highlight annotations and a list of all words (with their coordinates).

We filter the annotations list for only highlights. Then, we iterate over every word on the page and check if its bounding box intersects with any of those highlight rectangles.

Extracted text format

For each highlight color, a specific format can be defined. For example, this is the default configuration I use:

"string_template_by_color": {
  "#ffaaff": { "format": "\n# {}", "behaviour": "join", "type": "header" },
  "#55ffff": { "format": "\n## {}", "behaviour": "join", "type": "header" },
  "#00ff00": { "format": "\n### {}", "behaviour": "join", "type": "header" },
  "#55aaff": { "format": "\n#### {}", "behaviour": "join", "type": "header" },
  "#aaaaff": { "format": "\n##### {}", "behaviour": "join", "type": "header" },
  "#ffaa00": { "format": "- {}", "behaviour": "split", "type": "content" },
  "default": { "format": "- {}", "behaviour": "join", "type": "content" }
},

This config lets me map different colors to Markdown headers (like <h1>, <h2>) or bullet points. The behaviour key controls if multiple highlights of the same color are joined into one item (join) or kept as separate items (split).

Examples

Summary Generation

In this example, I’ll demonstrate how I used the script to generate a summary from some college bibliography for an Operating Systems course.

Highlighted text

In this case, we are using the following format:

Color Markdown Element   
Cyan   Level 2 Header     
Green Level 3 Header     
Blue   Level 4 Header     
YellowNormal text       
OrangeBold text         

Output

This is the markdown output rendered using pandoc.

Naturally, this also has a table of contents:

Mind map generation

In this example, I’ll show the rendered mind map based on the text extracted from an AI chatbot’s presentation.

Input presentation

Generated mind map

Synergies with other scripts

The script’s Markdown output creates great synergies with custom-made scripts as well as existing programs.

It’s a great plain-text format for headings and lists, and tons of other programs already support it.

Use cases

Convert to Anki cards

If you haven’t heard about it, Anki is an open-source flashcard software that leverages spaced repetition and active recall.

It’s a great way to learn and memorize, and I’ve used it extensively during my college years to memorize math theorems and theoretical knowledge.

Moving on, how does my project synergize with Anki?

Well, I’ve created another script that generates Anki flashcards from Markdown documents. Here’s the link to the repo.

Be warned, in its current state it’s very opinionated and adapted to my past use case, which means that it might not work for you.

There are future plans to improve it and to integrate it better with this script.

Generate mind maps

If you think of each heading and bullet list as a node in a graph, you can easily create mind maps from the script’s output.

In fact, Xmind already does this, letting you import Markdown directly into a fully-fledged mind map.

I’ve used this capability extensively as well, when working with classes that require a great deal of theoretical knowledge in the shape of text, rather than math.

It’s a great way to visualize a whole topic from a bird’s eye view.

It also has the added benefit of skipping the dreadful task of reviewing hundreds of Anki cards in a row, providing more or less the same benefits if used correctly.

History

One of my first programs was a basic script that allowed me to extract highlighted text from Word documents. It was rudimentary, but it saved me a lot of time back in high school.

This idea eventually evolved when I started using more and more PDFs in college.

I really missed the ability to just highlight text, extract it, and create a summary.

At first, I tried looking for libraries that already did what I was looking for, but at that time, I found none that suited my specific needs. Eventually, I came across a Stack Overflow answer about extracting highlighted text from PDFs, which offered a specific method using the MuPDF library.

The answer pointed to a simple concept: you just had to find the words that intersected the highlight’s rectangle annotation.

It was definitely more complicated than with Word documents, but it was nothing out of this world.

Thanks to this, I created my first prototype and started adding more and more features.

Future plans

  • Create a full-stack application to let users try this script online.
  • Improve the integration between this script and md2anki.
  • Build my own mind map generator instead of depending on Xmind.

How to try it?

Sadly, the script isn’t public yet.

But fret not, I’m working on polishing it so everyone can try it out.

My avatar

Thanks for reading my blog post! Feel free to check out my other posts or contact me via the social links in the footer.


More Posts

Comments