Evaluation Plan (Research-Grade)
Purpose
This document defines how FerroTeX will be evaluated as both:
- an engineering system (correctness, performance, UX)
- a research contribution (methods, reproducibility, validity)
Hypotheses
- H1 (Localization accuracy): FerroTeX improves file/line mapping accuracy over regex baselines.
- H2 (Robustness): FerroTeX maintains parsing correctness across engine/distribution variations.
- H3 (Performance): FerroTeX achieves lower latency to usable diagnostics and better incremental behavior.
Datasets
Real-World Corpora
- multi-file theses and dissertations
- package-heavy projects (TikZ, minted, biblatex)
- arXiv-style sources (when licensing permits)
Synthetic Corpora
Generate fixtures targeting known failure modes:
- long path segments forcing wrap
- parentheses in filenames
- deep nesting of
\input - interleaved warnings and errors
Ground Truth
Ground truth is defined as:
- the correct file containing the line referenced by the engine (when present)
- the correct line number (from
l.<n>)
For diagnostics without explicit line references, ground truth may be:
- human-labeled
- or excluded from strict line-level scoring
Metrics
Localization
- File@1 accuracy
- Line exact accuracy
- Line±k accuracy (k = 1, 2, 5)
- Unmapped rate (how often the system declines to guess)
Calibration
- Compare confidence to empirical correctness:
- high-confidence predictions should be correct more often than low-confidence predictions
Performance
- time to first diagnostic
- time to stable diagnostic set
- incremental update latency as a function of appended bytes
- peak RSS and allocation counts
UX (optional study)
- time-to-fix on controlled tasks
- subjective usability questionnaires
Baselines
At least two baselines should be used:
- a regex-based parser from common editor tooling
- a widely-used build tool’s parsing output (where applicable)
Experimental Design
- Same corpus across all tools.
- Multiple TeX engines (pdfTeX/XeTeX/LuaTeX) when feasible.
- Multiple distributions (TeX Live, MiKTeX) when feasible.
Threats to Validity
- Engine logs are not a complete execution trace.
- Packages can emit text that mimics log tokens.
- Human labeling can introduce bias.
Mitigations:
- keep datasets and labeling criteria explicit
- report failure modes and unmapped cases