Show HN: Smelt – Extract structured data from PDFs and HTML using LLM
Hacker News Show AI
•
Generative AI
Computer Vision
I built a CLI tool in Go that extracts structured data (JSON, CSV, Parquet) from messy PDFs and HTML pages. The core idea: LLMs are great at understanding structure but wasteful for bulk data extraction. So smelt uses a two-pass architecture: 1. A fast Go capture layer parses the document and detects table-like regions 2. Those regions (not the whole document) get sent to Claude for schema inference - column names, types, nesting 3. The Go layer then does deterministic extraction using the inferred schema This means the LLM is never in the hot path of actual data processing.