Evaluation Plan (Research-Grade)

Purpose

This document defines how FerroTeX will be evaluated as both:

an engineering system (correctness, performance, UX)
a research contribution (methods, reproducibility, validity)

Hypotheses

H1 (Localization accuracy): FerroTeX improves file/line mapping accuracy over regex baselines.
H2 (Robustness): FerroTeX maintains parsing correctness across engine/distribution variations.
H3 (Performance): FerroTeX achieves lower latency to usable diagnostics and better incremental behavior.

Datasets

Real-World Corpora

multi-file theses and dissertations
package-heavy projects (TikZ, minted, biblatex)
arXiv-style sources (when licensing permits)

Synthetic Corpora

Generate fixtures targeting known failure modes:

long path segments forcing wrap
parentheses in filenames
deep nesting of \input
interleaved warnings and errors

Ground Truth

Ground truth is defined as:

the correct file containing the line referenced by the engine (when present)
the correct line number (from l.<n>)

For diagnostics without explicit line references, ground truth may be:

human-labeled
or excluded from strict line-level scoring

Metrics

Localization

File@1 accuracy
Line exact accuracy
Line±k accuracy (k = 1, 2, 5)
Unmapped rate (how often the system declines to guess)

Calibration

Compare confidence to empirical correctness:
- high-confidence predictions should be correct more often than low-confidence predictions

Performance

time to first diagnostic
time to stable diagnostic set
incremental update latency as a function of appended bytes
peak RSS and allocation counts

UX (optional study)

time-to-fix on controlled tasks
subjective usability questionnaires

Baselines

At least two baselines should be used:

a regex-based parser from common editor tooling
a widely-used build tool’s parsing output (where applicable)

Experimental Design

Same corpus across all tools.
Multiple TeX engines (pdfTeX/XeTeX/LuaTeX) when feasible.
Multiple distributions (TeX Live, MiKTeX) when feasible.

Threats to Validity

Engine logs are not a complete execution trace.
Packages can emit text that mimics log tokens.
Human labeling can introduce bias.

Mitigations:

keep datasets and labeling criteria explicit
report failure modes and unmapped cases