# Analysis: Text → Markdown → HTML vs. Text → HTML Direct **Date:** 2026-02-25 **Branch:** extract-content-from-pdf **Decision:** Keep current direct text → HTML approach — no code changes warranted --- ## Context We have two heuristic parsers that convert Pdfium-extracted page text to HTML for the Text View: - `lib/pdf_text_to_html.rb` (Ruby — used in ERB views) - `app/javascript/template_builder/pdf_text_to_html.js` (JS — used in document.vue) The question was whether it would be better to emit Markdown from the heuristics and render it to HTML using an existing renderer, rather than emitting HTML directly. --- ## Recommendation: Keep the current direct text → HTML approach ### Why the Markdown intermediate doesn't help here **1. No full Markdown renderer in Ruby** `MarkdownToHtml.rb` is not a Markdown parser — it's a single-line link regex converter. Using Markdown on the Ruby side would require adding a new gem (`kramdown`, `redcarpet`, `commonmarker`). That's a meaningful dependency for no functional gain. **2. `snarkdown` is inline-only** The only Markdown library in the JS bundle (`snarkdown` v2.0.0) handles inline syntax (bold, italic, code, links) but has no block-level support — no headings rendered from `##`, no unordered list rendering from `- item`. It cannot replace the list/heading logic in the current heuristic. **3. `dir="auto"` can't be expressed in Markdown** The current parsers emit `
` on every body paragraph for RTL language support. Standard Markdown has no mechanism for this HTML attribute. A Markdown renderer would produce `
` without it, breaking Arabic/Hebrew/Persian documents. **4. PDF text contains Markdown-significant characters** Legal and business PDFs routinely contain `*`, `_`, `[ref]`, `#3`, `&` in their natural text. Running these through a Markdown renderer would corrupt the output (e.g., `Clause *3* applies` → `Clause 3 applies`). Escaping all Markdown metacharacters before conversion would make the heuristic code more complex, not simpler. **5. No reduction in complexity** The heuristic logic (detect ALL_CAPS headings, numbered headings, bullet lines) is the same regardless of output format. Emitting `## HEADING` instead of `