Notes2Tone

Open-Source OMR in Direct Benchmark Comparison

685

SMB Samples

4

System variants

OMR-NED

Primary metric (lower is better)

Comparison Setup

Dataset

SMB with 685 sheet music images across multiple texture categories and ground truth in kern format.

Models

Audiveris, Audiveris-Scaled, HOMR, and OeMeR. Audiveris-Scaled is Audiveris with pre-processing upscaling.

Evaluation

OMR-NED as a distance metric between prediction and ground truth, complemented by error-type analysis.

Key Findings

Four-model comparison (n=104)

Median OMR-NED: HOMR 0.243, Audiveris-Scaled 0.250, Audiveris 0.321, OeMeR 0.715.

HOMR leads, while OeMeR clearly trails.

Three-model comparison (n=453)

With more challenging Pianoform samples, all values increase: HOMR 0.603, Audiveris-Scaled 0.660, OeMeR 0.753.

Performance ranking remains stable: HOMR > Audiveris-Scaled > OeMeR.

Coverage

Successfully processed SMB samples: Audiveris 29.3%, Audiveris-Scaled 78.5%, HOMR 99.3%, OeMeR 84.8%.

Upscaling makes Audiveris usable on significantly more pages.

Result Plots

4 Models (n=104)

3 Models (n=453)

Conclusion

HOMR delivers the best overall benchmark performance. Audiveris-Scaled is a practical compromise with significantly higher coverage than Audiveris. In this setup, OeMeR shows consistently weaker results.

Key takeaway for future work: error profiles are model-specific, so improvements should be designed per model rather than with a single global fix.

Paper