Document architecture decision: keep direct text→HTML in pdf_text_to_html parsers

Analyzed whether emitting Markdown as an intermediate format would improve
lib/pdf_text_to_html.rb and app/javascript/template_builder/pdf_text_to_html.js.
Decision: no change warranted. Key reasons: no full Markdown gem on Ruby side,
snarkdown is inline-only (no block elements), dir="auto" RTL support can't be
expressed in Markdown, and PDF text contains raw * _ # characters that would
corrupt Markdown rendering.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pull/599/head
Marcelo Paiva 3 weeks ago
parent 797fb32a37
commit 5395a74600

@ -262,3 +262,28 @@ a3109c63 - Add ARIA labels to icon-only buttons across the application
4. **Gate test**: Use scanned PDF → verify no tab switcher shown
5. **VoiceOver test**: Announce tabs and panel content
6. **Next feature**: ARIA live regions for form validation errors (Phase 2 roadmap)
---
## Session Summary - 2026-02-25 (Architecture decision: Markdown intermediate)
### Decision: Keep direct Text → HTML approach in pdf_text_to_html parsers
**Analysis**: Evaluated whether `lib/pdf_text_to_html.rb` and `app/javascript/template_builder/pdf_text_to_html.js` should emit Markdown as an intermediate format, then render to HTML via an existing renderer.
**Conclusion: No change warranted.** Reasons:
- No full Markdown renderer on the Ruby side without adding a new gem (e.g. `kramdown`)
- `snarkdown` (the only JS Markdown lib in the bundle) is inline-only — no block-level heading/list support
- `<p dir="auto">` for RTL support cannot be expressed in standard Markdown
- PDF text contains `*`, `_`, `[ref]`, `#3` naturally — a Markdown renderer would corrupt them
- Heuristic detection logic is identical regardless of output format; no complexity reduction
**Report**: `.reports/pdf-text-html-vs-markdown-analysis.md`
**Code changes**: None
**Commit**: n/a (documentation-only session)
### Next Session Recommendations
1. **Manual verification** of tab switcher (items 15 above)
2. **Phase 2**: ARIA live regions for form validation errors
3. **Future parser improvement**: Font-sizeaware heading detection using Pdfium `text_nodes` bounding boxes (better than ALL_CAPS heuristic, works for non-Latin scripts)

@ -0,0 +1,56 @@
# Analysis: Text → Markdown → HTML vs. Text → HTML Direct
**Date:** 2026-02-25
**Branch:** extract-content-from-pdf
**Decision:** Keep current direct text → HTML approach — no code changes warranted
---
## Context
We have two heuristic parsers that convert Pdfium-extracted page text to HTML for the Text View:
- `lib/pdf_text_to_html.rb` (Ruby — used in ERB views)
- `app/javascript/template_builder/pdf_text_to_html.js` (JS — used in document.vue)
The question was whether it would be better to emit Markdown from the heuristics and render it to HTML using an existing renderer, rather than emitting HTML directly.
---
## Recommendation: Keep the current direct text → HTML approach
### Why the Markdown intermediate doesn't help here
**1. No full Markdown renderer in Ruby**
`MarkdownToHtml.rb` is not a Markdown parser — it's a single-line link regex converter. Using Markdown on the Ruby side would require adding a new gem (`kramdown`, `redcarpet`, `commonmarker`). That's a meaningful dependency for no functional gain.
**2. `snarkdown` is inline-only**
The only Markdown library in the JS bundle (`snarkdown` v2.0.0) handles inline syntax (bold, italic, code, links) but has no block-level support — no headings rendered from `##`, no unordered list rendering from `- item`. It cannot replace the list/heading logic in the current heuristic.
**3. `dir="auto"` can't be expressed in Markdown**
The current parsers emit `<p dir="auto">` on every body paragraph for RTL language support. Standard Markdown has no mechanism for this HTML attribute. A Markdown renderer would produce `<p>` without it, breaking Arabic/Hebrew/Persian documents.
**4. PDF text contains Markdown-significant characters**
Legal and business PDFs routinely contain `*`, `_`, `[ref]`, `#3`, `&` in their natural text. Running these through a Markdown renderer would corrupt the output (e.g., `Clause *3* applies``Clause <em>3</em> applies`). Escaping all Markdown metacharacters before conversion would make the heuristic code more complex, not simpler.
**5. No reduction in complexity**
The heuristic logic (detect ALL_CAPS headings, numbered headings, bullet lines) is the same regardless of output format. Emitting `## HEADING` instead of `<h2>HEADING</h2>` saves a few characters but changes nothing meaningful. Two parallel implementations (Ruby + JS) remain necessary either way.
---
## What would actually improve the parsers
Instead of changing the output format, future improvements should focus on detection quality:
1. **Font-sizeaware headings** — Pdfium exposes `text_nodes` with bounding-box metadata. Larger font → heading, regardless of ALL_CAPS or numbering. This is a future enhancement.
2. **Numbered list items vs. section headings** — Currently `1. Item` always becomes `<h3>`, even if it's a true numbered list item. This could be disambiguated by line length or context. Low priority.
3. **Multi-language heading detection** — ALL_CAPS doesn't work in languages without case (Arabic, CJK). Font-size detection would fix this too.
---
## Decision
No code changes. The current implementations in `lib/pdf_text_to_html.rb` and `app/javascript/template_builder/pdf_text_to_html.js` are correct and well-suited for this use case.
If a future requirement emerges to store Markdown rather than raw text in the metadata (e.g. for integration with external tools), the conversion should happen at extraction time in `lib/templates/process_document.rb`, and a full Markdown gem would need to be added. That is out of scope.
Loading…
Cancel
Save