Datasets

Purpose

Define, curate, and document datasets used to evaluate FerroTeX.

This document is intentionally separate from evaluation-plan.md to keep:

  • dataset definitions stable
  • evaluation methodology explicit

Dataset Categories

1) Real-World Projects

Target characteristics:

  • multi-file structure
  • diverse package usage
  • presence of common warnings and occasional errors

Required metadata per project:

  • source origin and license
  • build instructions
  • engine/distribution versions used

2) Synthetic Stress Fixtures

Synthetic fixtures should be generated to isolate failure modes:

  • deep nesting of \input
  • parentheses and spaces in filenames
  • forced wrap scenarios
  • noise that resembles log tokens

3) Labeled Diagnostic Subset

For correctness scoring, maintain a labeled subset:

  • log excerpt
  • expected file
  • expected line (if applicable)
  • notes on ambiguity

Redaction Policy

Real-world logs may contain absolute paths and usernames.

  • Replace absolute paths with workspace-relative placeholders.
  • Preserve structure needed for parsing (parentheses nesting, line refs).
  • Document any semantic changes introduced by redaction.