Document architecture decision: keep direct text→HTML in pdf_text_to_html parsers

Analyzed whether emitting Markdown as an intermediate format would improve lib/pdf_text_to_html.rb and app/javascript/template_builder/pdf_text_to_html.js. Decision: no change warranted. Key reasons: no full Markdown gem on Ruby side, snarkdown is inline-only (no block elements), dir="auto" RTL support can't be expressed in Markdown, and PDF text contains raw * _ # characters that would corrupt Markdown rendering. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2 months ago · 5395a74600
parent 797fb32a37
commit 5395a74600
2 changed files with 81 additions and 0 deletions
--- a/.plans/accessibility-implementation-progress.md
+++ b/.plans/accessibility-implementation-progress.md
@ -262,3 +262,28 @@ a3109c63 - Add ARIA labels to icon-only buttons across the application
 4. **Gate test**: Use scanned PDF → verify no tab switcher shown
 5. **VoiceOver test**: Announce tabs and panel content
 6. **Next feature**: ARIA live regions for form validation errors (Phase 2 roadmap)
+
+---
+
+## Session Summary - 2026-02-25 (Architecture decision: Markdown intermediate)
+
+### Decision: Keep direct Text → HTML approach in pdf_text_to_html parsers
+
+**Analysis**: Evaluated whether `lib/pdf_text_to_html.rb` and `app/javascript/template_builder/pdf_text_to_html.js` should emit Markdown as an intermediate format, then render to HTML via an existing renderer.
+
+**Conclusion: No change warranted.** Reasons:
+- No full Markdown renderer on the Ruby side without adding a new gem (e.g. `kramdown`)
+- `snarkdown` (the only JS Markdown lib in the bundle) is inline-only — no block-level heading/list support
+- `<p dir="auto">` for RTL support cannot be expressed in standard Markdown
+- PDF text contains `*`, `_`, `[ref]`, `#3` naturally — a Markdown renderer would corrupt them
+- Heuristic detection logic is identical regardless of output format; no complexity reduction
+
+**Report**: `.reports/pdf-text-html-vs-markdown-analysis.md`
+**Code changes**: None
+**Commit**: n/a (documentation-only session)
+
+### Next Session Recommendations
+
+1. **Manual verification** of tab switcher (items 1–5 above)
+2. **Phase 2**: ARIA live regions for form validation errors
+3. **Future parser improvement**: Font-size–aware heading detection using Pdfium `text_nodes` bounding boxes (better than ALL_CAPS heuristic, works for non-Latin scripts)
--- a/.reports/pdf-text-html-vs-markdown-analysis.md
+++ b/.reports/pdf-text-html-vs-markdown-analysis.md
@ -0,0 +1,56 @@
+# Analysis: Text → Markdown → HTML vs. Text → HTML Direct
+
+**Date:** 2026-02-25
+**Branch:** extract-content-from-pdf
+**Decision:** Keep current direct text → HTML approach — no code changes warranted
+
+---
+
+## Context
+
+We have two heuristic parsers that convert Pdfium-extracted page text to HTML for the Text View:
+- `lib/pdf_text_to_html.rb` (Ruby — used in ERB views)
+- `app/javascript/template_builder/pdf_text_to_html.js` (JS — used in document.vue)
+
+The question was whether it would be better to emit Markdown from the heuristics and render it to HTML using an existing renderer, rather than emitting HTML directly.
+
+---
+
+## Recommendation: Keep the current direct text → HTML approach
+
+### Why the Markdown intermediate doesn't help here
+
+**1. No full Markdown renderer in Ruby**
+`MarkdownToHtml.rb` is not a Markdown parser — it's a single-line link regex converter. Using Markdown on the Ruby side would require adding a new gem (`kramdown`, `redcarpet`, `commonmarker`). That's a meaningful dependency for no functional gain.
+
+**2. `snarkdown` is inline-only**
+The only Markdown library in the JS bundle (`snarkdown` v2.0.0) handles inline syntax (bold, italic, code, links) but has no block-level support — no headings rendered from `##`, no unordered list rendering from `- item`. It cannot replace the list/heading logic in the current heuristic.
+
+**3. `dir="auto"` can't be expressed in Markdown**
+The current parsers emit `<p dir="auto">` on every body paragraph for RTL language support. Standard Markdown has no mechanism for this HTML attribute. A Markdown renderer would produce `<p>` without it, breaking Arabic/Hebrew/Persian documents.
+
+**4. PDF text contains Markdown-significant characters**
+Legal and business PDFs routinely contain `*`, `_`, `[ref]`, `#3`, `&` in their natural text. Running these through a Markdown renderer would corrupt the output (e.g., `Clause *3* applies` → `Clause <em>3</em> applies`). Escaping all Markdown metacharacters before conversion would make the heuristic code more complex, not simpler.
+
+**5. No reduction in complexity**
+The heuristic logic (detect ALL_CAPS headings, numbered headings, bullet lines) is the same regardless of output format. Emitting `## HEADING` instead of `<h2>HEADING</h2>` saves a few characters but changes nothing meaningful. Two parallel implementations (Ruby + JS) remain necessary either way.
+
+---
+
+## What would actually improve the parsers
+
+Instead of changing the output format, future improvements should focus on detection quality:
+
+1. **Font-size–aware headings** — Pdfium exposes `text_nodes` with bounding-box metadata. Larger font → heading, regardless of ALL_CAPS or numbering. This is a future enhancement.
+
+2. **Numbered list items vs. section headings** — Currently `1. Item` always becomes `<h3>`, even if it's a true numbered list item. This could be disambiguated by line length or context. Low priority.
+
+3. **Multi-language heading detection** — ALL_CAPS doesn't work in languages without case (Arabic, CJK). Font-size detection would fix this too.
+
+---
+
+## Decision
+
+No code changes. The current implementations in `lib/pdf_text_to_html.rb` and `app/javascript/template_builder/pdf_text_to_html.js` are correct and well-suited for this use case.
+
+If a future requirement emerges to store Markdown rather than raw text in the metadata (e.g. for integration with external tools), the conversion should happen at extraction time in `lib/templates/process_document.rb`, and a full Markdown gem would need to be added. That is out of scope.