A developer shipped something quietly useful this week. It’s called LiteDoc, and it converts PDFs to clean Markdown entirely inside your browser. No Python. No pip install. No server touching your files.
That part sounds straightforward. Here’s the twist: it doesn’t just extract text. It fingerprints fonts.
Some PDFs use private encoding schemes that make the text layer look like gibberish. Most tools choke on this silently and spit out garbage. LiteDoc detects those corrupted font sections and automatically falls back to canvas-based image rendering for just those pages. The rest stays as clean text. It knows when to give up on text and switch strategies mid-document.
That’s not a small detail. That’s actually hard to get right.
Why this matters for your LLM workflow
Uploading a raw PDF to Claude or GPT-4o triggers internal rasterization. That costs roughly 850 tokens per page. A 50-page research paper eats 40,000+ tokens before you even ask a question. LiteDoc bypasses that entirely. You get raw text plus only the images you actually need, packed in a ZIP.
How to use it 🔧
- Go to litedoc.xyz (runs 100% in your browser)
- Drop your PDF. It unpacks locally, no upload to any server
- It extracts text, handles LaTeX math natively, strips repeating headers and footers
- 📄 Download the ZIP: a clean
.mdfile plus an optimized image folder - Paste the Markdown text into Claude. Attach only the images you need
Pro tip
For academic papers with heavy math notation, the LaTeX-aware extraction is the real win here. The equations come out as proper $$...$$ blocks instead of broken symbol soup. Claude reads those cleanly without any preprocessing on your end.
The token drop on a dense technical PDF is not marginal. It’s an order of magnitude. 🚀
If you’re doing any kind of document analysis, contract review, or research summarization with LLMs, this one’s worth bookmarking.
Try it at litedoc.xyz and see what your PDFs actually look like underneath.
Frequently Asked Questions
Q: How accurate is LiteDoc? Where does it struggle?
LiteDoc handles academic papers, reports, and structured documents well. It struggles with scanned PDFs (image-only), heavily designed layouts like brochures, and docs mixing languages with nonstandard fonts. When extraction fails, it falls back to image rendering , readable but still costs tokens at the LLM.
Q: How is LiteDoc different from MarkItDown?
MarkItDown is a Python library that handles Word, PowerPoint, and Excel too. LiteDoc is browser-only and PDF-specific, with zero install friction and built-in smarts for corrupted fonts. Pick MarkItDown if you’re in a Python workflow already; pick LiteDoc if you want a quick, token-efficient PDF converter in your browser.
Q: What happens if a PDF has corrupted or custom fonts?
If LiteDoc detects a corrupted font, it stops trying to extract text and renders that part as an image instead. You get readable output, but that section costs tokens at the LLM. Still better than corrupted gibberish text.
Q: How much do you actually save in tokens?
Claude charges roughly 850 tokens per page for vision. LiteDoc can extract the same content as text for ~300, 500 tokens. Biggest wins on text-dense PDFs; image-heavy docs save less since images still cost tokens.
I built a local PDF-to-Markdown converter so you don’t have to burn LLM tokens.
by u/mxsus in PromptEngineering