Table formats under the lens: CSV, Markdown, YAML, JSON—Which do LLMs actually understand best across models?

Table formats under the lens: CSV, Markdown tables, KV-Markdown, YAML, JSON/HTML/XML—LLMs’ understanding varies more than you think. A focused discussion tracks how different tabular encodings land in accuracy and token cost across models ^[1].

Format-by-format snapshot:

• KV-Markdown – a dict-like wrap; high semantic context and a top performer in the thread ^[1]. • INI – similar high scoring, per the same discussion ^[1]. • CSV and Markdown tables – among the weakest, described as index-based formats ^[1]. • JSON – middle ground: high context but more syntactic noise and fewer clear record labels ^[1]. • HTML – surprisingly decent: uses th and td yet lands better than JSON in this look ^[1]. • XML – like JSON but with even more noise; places above INI, and short element names can save tokens (a noted tactic) ^[1].

Practical prompts and tips surfaced in the discussion:

• Short XML element names (for example, f instead of function, c instead of class) can trim token use; top/bottom legends help mapping without overloading context ^[1]. • The idea of testing with the OpenAI tokenizer is raised, underscoring token-count awareness ^[1]. • Some notes touch on tools like Tree-sitter for project structure tasks, hinting at how tooling choices intersect with data formatting thoughts ^[1].

Model variability and caveats:

• Results are described as highly parameter- and architecture-dependent; the same formats can shift in accuracy and cost across model families ^[1]. • Tests referenced GPT-4.1 nano; authors warn results would differ with other models such as Claude ^[1].

Closing thought: there’s no one-size-fits-all—experiment with formats per model and keep an eye on token budgets as models evolve ^[1].

References

[1]

HackerNews

Which table format do LLMs understand best?

Explores tabular formats (CSV, Markdown, KV, YAML, JSON/HTML/XML) for LLM understanding; discusses model variation, accuracy, tokens and comparisons across models.

View source

References

Which table format do LLMs understand best?

Want to track your own topics?