# Highext: A Tool to Create Structured Summaries from PDFs
Table of Contents
Summary
Highext is a personal script I created to help me write college summaries from PDFs. It works by extracting highlighted text based on specific colors and formatting it into a structured document. This lets you quickly generate summaries with markup and even a table of contents, just by highlighting your PDF.
Tech Stack
The script is built entirely in Python. The core PDF processing is handled by the PyMuPDF (fitz) library.
I structured the project using principles from Clean Architecture (or Hexagonal Architecture) to keep the code maintainable and testable. This separates the application into distinct layers:
- Domain: Contains the core business logic and entities (like a
HighlightorScreenshot). - Application: Orchestrates the workflow, defining interfaces for external tools (like
IPdfParserorIConfigProvider). - Infrastructure: Provides the concrete implementations for those interfaces (e.g.,
FitzPdfParserwhich uses PyMuPDF, orJsonConfigProviderwhich reads theconfig.json). - Presentation: The CLI entry point that wires everything together.
Testing is handled using pytest.
Technical Challenges
While the idea sounds simple, I ran into a few interesting challenges:
-
Mapping Text Correctly: A highlight annotation doesn’t actually “contain” the text. It’s just a set of coordinates for rectangles on the page. The biggest challenge was to reliably find all the
Wordobjects (which also have coordinates) that intersected with these highlight rectangles. This had to account for multi-line highlights and highlights that only covered part of a word. -
Ensuring Correct Order: Annotations aren’t always stored in the PDF in the order they appear on the page. I had to implement a custom sorting key based on an annotation’s vertical (top-to-bottom) and then horizontal (left-to-right) coordinates. This ensures that a highlight at the top of the page is always processed before one at the bottom, making the final summary coherent.
-
Reliable Geometric Intersection Logic: A simple bounding box intersection check was too noisy. It often grabbed parts of adjacent words that barely touched the highlight’s rectangle, leading to dirty or incorrect text. I had to implement a more robust logic (HighlightLine) that only includes a word if the intersection covers a significant percentage of the word’s own width. This ensures only the text truly inside the highlight is extracted.
-
Handling Screenshots: I expanded the tool to also recognize “Square” annotations. The script treats these as screenshot requests. It uses PyMuPDF to render that specific region of the PDF as a high-resolution PNG, crops it, and generates a unique, sequential filename for it (like
1a.png,1b.png) so you can have multiple screenshots on one page.
How it works
Text extraction
Highlights in PDFs are commonly implemented as annotations, which are basically a series of rectangles (with a specific color and transparency) laid over the text. A single highlight can be one rectangle or, if it spans multiple lines, a series of rectangles.
For this to work, we also need the bounding box (the surrounding rectangle) for every word on the page.
As mentioned, the core logic is finding the intersection between the words bounding boxes and the highlights rectangles.
So, for any given page, we have a list of highlight annotations and a list of all words (with their coordinates).
We filter the annotations list for only highlights. Then, we iterate over every word on the page and check if its bounding box intersects with any of those highlight rectangles.
Extracted text format
For each highlight color, a specific format can be defined. For example, this is the default configuration I use:
"string_template_by_color": { "#ffaaff": { "format": "\n# {}", "behaviour": "join", "type": "header" }, "#55ffff": { "format": "\n## {}", "behaviour": "join", "type": "header" }, "#00ff00": { "format": "\n### {}", "behaviour": "join", "type": "header" }, "#55aaff": { "format": "\n#### {}", "behaviour": "join", "type": "header" }, "#aaaaff": { "format": "\n##### {}", "behaviour": "join", "type": "header" }, "#ffaa00": { "format": "- {}", "behaviour": "split", "type": "content" }, "default": { "format": "- {}", "behaviour": "join", "type": "content" }},This config lets me map different colors to Markdown headers (like <h1>, <h2>) or bullet points.
The behaviour key controls if multiple highlights of the same color are joined into one item (join)
or kept as separate items (split).
Examples
Summary Generation
In this example, I’ll demonstrate how I used the script to generate a summary from some college bibliography for an Operating Systems course.
Highlighted text
In this case, we are using the following format:
| Color | Markdown Element |
|---|---|
| Cyan | Level 2 Header |
| Green | Level 3 Header |
| Blue | Level 4 Header |
| Yellow | Normal text |
| Orange | Bold text |
Output
This is the markdown output rendered using pandoc.
Naturally, this also has a table of contents:
Mind map generation
In this example, I’ll show the rendered mind map based on the text extracted from an AI chatbot’s presentation.
Input presentation
Generated mind map
Synergies with other scripts
The script’s Markdown output creates great synergies with custom-made scripts as well as existing programs.
It’s a great plain-text format for headings and lists, and tons of other programs already support it.
Use cases
Convert to Anki cards
If you haven’t heard about it, Anki is an open-source flashcard software that leverages spaced repetition and active recall.
It’s a great way to learn and memorize, and I’ve used it extensively during my college years to memorize math theorems and theoretical knowledge.
Moving on, how does my project synergize with Anki?
Well, I’ve created another script that generates Anki flashcards from Markdown documents. Here’s the link to the repo.
Be warned, in its current state it’s very opinionated and adapted to my past use case, which means that it might not work for you.
There are future plans to improve it and to integrate it better with this script.
Generate mind maps
If you think of each heading and bullet list as a node in a graph, you can easily create mind maps from the script’s output.
In fact, Xmind already does this, letting you import Markdown directly into a fully-fledged mind map.
I’ve used this capability extensively as well, when working with classes that require a great deal of theoretical knowledge in the shape of text, rather than math.
It’s a great way to visualize a whole topic from a bird’s eye view.
It also has the added benefit of skipping the dreadful task of reviewing hundreds of Anki cards in a row, providing more or less the same benefits if used correctly.
History
One of my first programs was a basic script that allowed me to extract highlighted text from Word documents. It was rudimentary, but it saved me a lot of time back in high school.
This idea eventually evolved when I started using more and more PDFs in college.
I really missed the ability to just highlight text, extract it, and create a summary.
At first, I tried looking for libraries that already did what I was looking for, but at that time, I found none that suited my specific needs. Eventually, I came across a Stack Overflow answer about extracting highlighted text from PDFs, which offered a specific method using the MuPDF library.
The answer pointed to a simple concept: you just had to find the words that intersected the highlight’s rectangle annotation.
It was definitely more complicated than with Word documents, but it was nothing out of this world.
Thanks to this, I created my first prototype and started adding more and more features.
Future plans
- Create a full-stack application to let users try this script online.
- Improve the integration between this script and md2anki.
- Build my own mind map generator instead of depending on Xmind.
How to try it?
Sadly, the script isn’t public yet.
But fret not, I’m working on polishing it so everyone can try it out.