From 5395a746003ae0e2d4dace3068c5bcfc9a857d49 Mon Sep 17 00:00:00 2001 From: Marcelo Paiva Date: Wed, 25 Feb 2026 12:55:57 -0500 Subject: [PATCH] =?UTF-8?q?Document=20architecture=20decision:=20keep=20di?= =?UTF-8?q?rect=20text=E2=86=92HTML=20in=20pdf=5Ftext=5Fto=5Fhtml=20parser?= =?UTF-8?q?s?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Analyzed whether emitting Markdown as an intermediate format would improve lib/pdf_text_to_html.rb and app/javascript/template_builder/pdf_text_to_html.js. Decision: no change warranted. Key reasons: no full Markdown gem on Ruby side, snarkdown is inline-only (no block elements), dir="auto" RTL support can't be expressed in Markdown, and PDF text contains raw * _ # characters that would corrupt Markdown rendering. Co-Authored-By: Claude Sonnet 4.6 --- .../accessibility-implementation-progress.md | 25 +++++++++ .../pdf-text-html-vs-markdown-analysis.md | 56 +++++++++++++++++++ 2 files changed, 81 insertions(+) create mode 100644 .reports/pdf-text-html-vs-markdown-analysis.md diff --git a/.plans/accessibility-implementation-progress.md b/.plans/accessibility-implementation-progress.md index a4ce45e1..1125dd93 100644 --- a/.plans/accessibility-implementation-progress.md +++ b/.plans/accessibility-implementation-progress.md @@ -262,3 +262,28 @@ a3109c63 - Add ARIA labels to icon-only buttons across the application 4. **Gate test**: Use scanned PDF → verify no tab switcher shown 5. **VoiceOver test**: Announce tabs and panel content 6. **Next feature**: ARIA live regions for form validation errors (Phase 2 roadmap) + +--- + +## Session Summary - 2026-02-25 (Architecture decision: Markdown intermediate) + +### Decision: Keep direct Text → HTML approach in pdf_text_to_html parsers + +**Analysis**: Evaluated whether `lib/pdf_text_to_html.rb` and `app/javascript/template_builder/pdf_text_to_html.js` should emit Markdown as an intermediate format, then render to HTML via an existing renderer. + +**Conclusion: No change warranted.** Reasons: +- No full Markdown renderer on the Ruby side without adding a new gem (e.g. `kramdown`) +- `snarkdown` (the only JS Markdown lib in the bundle) is inline-only — no block-level heading/list support +- `

` for RTL support cannot be expressed in standard Markdown +- PDF text contains `*`, `_`, `[ref]`, `#3` naturally — a Markdown renderer would corrupt them +- Heuristic detection logic is identical regardless of output format; no complexity reduction + +**Report**: `.reports/pdf-text-html-vs-markdown-analysis.md` +**Code changes**: None +**Commit**: n/a (documentation-only session) + +### Next Session Recommendations + +1. **Manual verification** of tab switcher (items 1–5 above) +2. **Phase 2**: ARIA live regions for form validation errors +3. **Future parser improvement**: Font-size–aware heading detection using Pdfium `text_nodes` bounding boxes (better than ALL_CAPS heuristic, works for non-Latin scripts) diff --git a/.reports/pdf-text-html-vs-markdown-analysis.md b/.reports/pdf-text-html-vs-markdown-analysis.md new file mode 100644 index 00000000..c48fde31 --- /dev/null +++ b/.reports/pdf-text-html-vs-markdown-analysis.md @@ -0,0 +1,56 @@ +# Analysis: Text → Markdown → HTML vs. Text → HTML Direct + +**Date:** 2026-02-25 +**Branch:** extract-content-from-pdf +**Decision:** Keep current direct text → HTML approach — no code changes warranted + +--- + +## Context + +We have two heuristic parsers that convert Pdfium-extracted page text to HTML for the Text View: +- `lib/pdf_text_to_html.rb` (Ruby — used in ERB views) +- `app/javascript/template_builder/pdf_text_to_html.js` (JS — used in document.vue) + +The question was whether it would be better to emit Markdown from the heuristics and render it to HTML using an existing renderer, rather than emitting HTML directly. + +--- + +## Recommendation: Keep the current direct text → HTML approach + +### Why the Markdown intermediate doesn't help here + +**1. No full Markdown renderer in Ruby** +`MarkdownToHtml.rb` is not a Markdown parser — it's a single-line link regex converter. Using Markdown on the Ruby side would require adding a new gem (`kramdown`, `redcarpet`, `commonmarker`). That's a meaningful dependency for no functional gain. + +**2. `snarkdown` is inline-only** +The only Markdown library in the JS bundle (`snarkdown` v2.0.0) handles inline syntax (bold, italic, code, links) but has no block-level support — no headings rendered from `##`, no unordered list rendering from `- item`. It cannot replace the list/heading logic in the current heuristic. + +**3. `dir="auto"` can't be expressed in Markdown** +The current parsers emit `

` on every body paragraph for RTL language support. Standard Markdown has no mechanism for this HTML attribute. A Markdown renderer would produce `

` without it, breaking Arabic/Hebrew/Persian documents. + +**4. PDF text contains Markdown-significant characters** +Legal and business PDFs routinely contain `*`, `_`, `[ref]`, `#3`, `&` in their natural text. Running these through a Markdown renderer would corrupt the output (e.g., `Clause *3* applies` → `Clause 3 applies`). Escaping all Markdown metacharacters before conversion would make the heuristic code more complex, not simpler. + +**5. No reduction in complexity** +The heuristic logic (detect ALL_CAPS headings, numbered headings, bullet lines) is the same regardless of output format. Emitting `## HEADING` instead of `

HEADING

` saves a few characters but changes nothing meaningful. Two parallel implementations (Ruby + JS) remain necessary either way. + +--- + +## What would actually improve the parsers + +Instead of changing the output format, future improvements should focus on detection quality: + +1. **Font-size–aware headings** — Pdfium exposes `text_nodes` with bounding-box metadata. Larger font → heading, regardless of ALL_CAPS or numbering. This is a future enhancement. + +2. **Numbered list items vs. section headings** — Currently `1. Item` always becomes `

`, even if it's a true numbered list item. This could be disambiguated by line length or context. Low priority. + +3. **Multi-language heading detection** — ALL_CAPS doesn't work in languages without case (Arabic, CJK). Font-size detection would fix this too. + +--- + +## Decision + +No code changes. The current implementations in `lib/pdf_text_to_html.rb` and `app/javascript/template_builder/pdf_text_to_html.js` are correct and well-suited for this use case. + +If a future requirement emerges to store Markdown rather than raw text in the metadata (e.g. for integration with external tools), the conversion should happen at extraction time in `lib/templates/process_document.rb`, and a full Markdown gem would need to be added. That is out of scope.