From 5395a746003ae0e2d4dace3068c5bcfc9a857d49 Mon Sep 17 00:00:00 2001
From: Marcelo Paiva <mpaiva@gmail.com>
Date: Wed, 25 Feb 2026 12:55:57 -0500
Subject: [PATCH] =?UTF-8?q?Document=20architecture=20decision:=20keep=20di?=
 =?UTF-8?q?rect=20text=E2=86=92HTML=20in=20pdf=5Ftext=5Fto=5Fhtml=20parser?=
 =?UTF-8?q?s?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Analyzed whether emitting Markdown as an intermediate format would improve
lib/pdf_text_to_html.rb and app/javascript/template_builder/pdf_text_to_html.js.
Decision: no change warranted. Key reasons: no full Markdown gem on Ruby side,
snarkdown is inline-only (no block elements), dir="auto" RTL support can't be
expressed in Markdown, and PDF text contains raw * _ # characters that would
corrupt Markdown rendering.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 .../accessibility-implementation-progress.md  | 25 +++++++++
 .../pdf-text-html-vs-markdown-analysis.md     | 56 +++++++++++++++++++
 2 files changed, 81 insertions(+)
 create mode 100644 .reports/pdf-text-html-vs-markdown-analysis.md
diff --git a/.plans/accessibility-implementation-progress.md b/.plans/accessibility-implementation-progress.md
index a4ce45e1..1125dd93 100644
--- a/.plans/accessibility-implementation-progress.md
+++ b/.plans/accessibility-implementation-progress.md
@@ -262,3 +262,28 @@ a3109c63 - Add ARIA labels to icon-only buttons across the application
 4. **Gate test**: Use scanned PDF → verify no tab switcher shown
 5. **VoiceOver test**: Announce tabs and panel content
 6. **Next feature**: ARIA live regions for form validation errors (Phase 2 roadmap)
+
+---
+
+## Session Summary - 2026-02-25 (Architecture decision: Markdown intermediate)
+
+### Decision: Keep direct Text → HTML approach in pdf_text_to_html parsers
+
+**Analysis**: Evaluated whether `lib/pdf_text_to_html.rb` and `app/javascript/template_builder/pdf_text_to_html.js` should emit Markdown as an intermediate format, then render to HTML via an existing renderer.
+
+**Conclusion: No change warranted.** Reasons:
+- No full Markdown renderer on the Ruby side without adding a new gem (e.g. `kramdown`)
+- `snarkdown` (the only JS Markdown lib in the bundle) is inline-only — no block-level heading/list support
+- `<p dir="auto">` for RTL support cannot be expressed in standard Markdown
+- PDF text contains `*`, `_`, `[ref]`, `#3` naturally — a Markdown renderer would corrupt them
+- Heuristic detection logic is identical regardless of output format; no complexity reduction
+
+**Report**: `.reports/pdf-text-html-vs-markdown-analysis.md`
+**Code changes**: None
+**Commit**: n/a (documentation-only session)
+
+### Next Session Recommendations
+
+1. **Manual verification** of tab switcher (items 1–5 above)
+2. **Phase 2**: ARIA live regions for form validation errors
+3. **Future parser improvement**: Font-size–aware heading detection using Pdfium `text_nodes` bounding boxes (better than ALL_CAPS heuristic, works for non-Latin scripts)
diff --git a/.reports/pdf-text-html-vs-markdown-analysis.md b/.reports/pdf-text-html-vs-markdown-analysis.md
new file mode 100644
index 00000000..c48fde31
--- /dev/null
+++ b/.reports/pdf-text-html-vs-markdown-analysis.md
@@ -0,0 +1,56 @@
+# Analysis: Text → Markdown → HTML vs. Text → HTML Direct
+
+**Date:** 2026-02-25
+**Branch:** extract-content-from-pdf
+**Decision:** Keep current direct text → HTML approach — no code changes warranted
+
+---
+
+## Context
+
+We have two heuristic parsers that convert Pdfium-extracted page text to HTML for the Text View:
+- `lib/pdf_text_to_html.rb` (Ruby — used in ERB views)
+- `app/javascript/template_builder/pdf_text_to_html.js` (JS — used in document.vue)
+
+The question was whether it would be better to emit Markdown from the heuristics and render it to HTML using an existing renderer, rather than emitting HTML directly.
+
+---
+
+## Recommendation: Keep the current direct text → HTML approach
+
+### Why the Markdown intermediate doesn't help here
+
+**1. No full Markdown renderer in Ruby**
+`MarkdownToHtml.rb` is not a Markdown parser — it's a single-line link regex converter. Using Markdown on the Ruby side would require adding a new gem (`kramdown`, `redcarpet`, `commonmarker`). That's a meaningful dependency for no functional gain.
+
+**2. `snarkdown` is inline-only**
+The only Markdown library in the JS bundle (`snarkdown` v2.0.0) handles inline syntax (bold, italic, code, links) but has no block-level support — no headings rendered from `##`, no unordered list rendering from `- item`. It cannot replace the list/heading logic in the current heuristic.
+
+**3. `dir="auto"` can't be expressed in Markdown**
+The current parsers emit `<p dir="auto">` on every body paragraph for RTL language support. Standard Markdown has no mechanism for this HTML attribute. A Markdown renderer would produce `<p>` without it, breaking Arabic/Hebrew/Persian documents.
+
+**4. PDF text contains Markdown-significant characters**
+Legal and business PDFs routinely contain `*`, `_`, `[ref]`, `#3`, `&` in their natural text. Running these through a Markdown renderer would corrupt the output (e.g., `Clause *3* applies` → `Clause <em>3</em> applies`). Escaping all Markdown metacharacters before conversion would make the heuristic code more complex, not simpler.
+
+**5. No reduction in complexity**
+The heuristic logic (detect ALL_CAPS headings, numbered headings, bullet lines) is the same regardless of output format. Emitting `## HEADING` instead of `<h2>HEADING</h2>` saves a few characters but changes nothing meaningful. Two parallel implementations (Ruby + JS) remain necessary either way.
+
+---
+
+## What would actually improve the parsers
+
+Instead of changing the output format, future improvements should focus on detection quality:
+
+1. **Font-size–aware headings** — Pdfium exposes `text_nodes` with bounding-box metadata. Larger font → heading, regardless of ALL_CAPS or numbering. This is a future enhancement.
+
+2. **Numbered list items vs. section headings** — Currently `1. Item` always becomes `<h3>`, even if it's a true numbered list item. This could be disambiguated by line length or context. Low priority.
+
+3. **Multi-language heading detection** — ALL_CAPS doesn't work in languages without case (Arabic, CJK). Font-size detection would fix this too.
+
+---
+
+## Decision
+
+No code changes. The current implementations in `lib/pdf_text_to_html.rb` and `app/javascript/template_builder/pdf_text_to_html.js` are correct and well-suited for this use case.
+
+If a future requirement emerges to store Markdown rather than raw text in the metadata (e.g. for integration with external tools), the conversion should happen at extraction time in `lib/templates/process_document.rb`, and a full Markdown gem would need to be added. That is out of scope.