Testing suite
Tests verify that the code does what was written; physical correctness is judged by data, not by tests. The suite catches regressions in equilibrium chemistry, solubility laws, oxygen-fugacity buffers, and the hybrid solver, but it cannot certify that those formulae match nature. That judgement belongs to the validation runs and the published comparisons against measured magma-ocean outgassing data.
This page describes the suite as a whole. Contributors writing or modifying tests should read it together with the Build a new test how-to.
Test quality contract
Five layers enforce test rigor across the suite:
- A four-marker tier scheme (
unit,smoke,integration,slow) selects what runs in the PR gate versus the nightly. - Two validation markers (
physics_invariant,reference_pinned) tag tests that carry physical meaning beyond pure code coverage. - A 1:1 mirroring rule pairs every physics source with a same-named test file.
- An AST linter (
tools/check_test_quality.py) rejects seven weak-test patterns on every PR. - A coverage ratchet capped at 90 % keeps the gate moving upward over time.
Layers 1, 4, and 5 are blocking on PRs. Layers 2 and 3 are advisory: the linter reports gaps but does not fail the build.
The four-marker tier scheme
Every test in the suite carries exactly one tier marker, applied either at module level (pytestmark = pytest.mark.X) or per class (@pytest.mark.X).
| Marker | What it tests | Per-test budget | CI surface |
|---|---|---|---|
unit |
Python logic, individual helpers, equilibrium-constant fits, solubility laws, structure formulae. No real equilibrium_atmosphere call. |
< 100 ms | PR + nightly |
smoke |
Real equilibrium_atmosphere invocations on minimal configurations (single composition, default species set, one solve). |
< 30 s | PR + nightly |
integration |
Full multi-species CHNS solves with mass-conservation invariants, all eleven species active. | minutes | Nightly only |
slow |
Long parameter sweeps and convergence studies (authoritative-O monotonicity regimes, hypothesis fuzz, cross-buffer property checks). | up to an hour | Nightly only |
skip |
Placeholder, deliberately disabled. | n/a | Never |
Live counts per tier are shown by the tests badge at the top of this page (the total) and by pytest -m <tier> --collect-only -q locally.
Tests without a tier marker are invisible to CI.
The PR gate runs pytest -m "(unit or smoke) and not skip"; the nightly runs everything in (unit or smoke or integration or slow) and not skip.
Module-level marker and timeout
Every test file declares its tier and its wall-time ceiling at module top:
pytestmark = [pytest.mark.unit, pytest.mark.timeout(30)]
| Tier | Timeout |
|---|---|
unit |
30 s |
smoke |
60 s |
integration |
300 s |
slow |
3600 s |
The timeout is a defensive ceiling, not a target.
A unit test that takes 25 s of wall time has either picked the wrong tier or has a leak somewhere; the ceiling catches future regressions that introduce a hang.
Per-function markers (for example @pytest.mark.skip on one stale parametrisation) are additive and do not replace the module-level marker.
Physics-invariant tiering
A unit test on any physics source must assert at least one of four invariant families:
- Conservation: per-species mass closure (
kg_atm + kg_liquid + kg_solid ≈ kg_totalwithin rtol), stoichiometric closure (sum(mole_fractions) == 1.0), element closure across the H/C/N/O/S budgets. - Positivity or boundedness:
T > 0,P > 0, mole fractions in[0, 1],log10(fO2)finite over the calibration window, partial pressures non-negative. - Monotonicity or symmetry:
log10(fO2)decreasing with1/Talong an isobaric buffer; CO\(_2\) solubility increasing with \(P\) at fixed \(T\); swapping two non-reacting species leaving outputs unchanged. - Pinned numeric value with a discrimination guard: a closed-form value or published table entry pinned via
pytest.approx, plus an explicit follow-up assertion showing that the most plausible wrong formula would differ from the correct one by more than the tolerance.
Tests that meet one or more of these are tagged @pytest.mark.physics_invariant.
The marker is per-function, not module-level: structural tests in the same file (for example, an ordering or pass-through-assignment check) should not carry it.
Physics sources in CALLIOPE are the five files chemistry.py, oxygen_fugacity.py, solubility.py, solve.py, structure.py.
Utility sources (__init__.py, _version.py, constants.py) are exempt from the invariant requirement but remain subject to the anti-happy-path rules below.
Reference-pinned validation
Tests that pin behaviour against an external anchor are tagged @pytest.mark.reference_pinned.
The anchor is one of:
- a published benchmark (cite paper + figure + table),
- an analytical limit (for example, the ideal-gas limit at low pressure, or the Stefan-Boltzmann black-body limit),
- a cross-implementation check (CALLIOPE vs atmodeller at a shared Earth fiducial).
Each of the five physics sources carries at least one reference-pinned test, recorded on the matching docs/Validation/<file>.md page.
The current anchors:
| Source | Anchor | Test |
|---|---|---|
chemistry.py |
JANAF Thermochemical Tables 1 (4th ed.), \(K_{eq}\) for \(\mathrm{H_2O} \to \mathrm{H_2} + 0.5\,\mathrm{O_2}\) at 2000 K with the O'Neill & Eggins 2002 4 buffer | tests/test_chemistry.py::test_modified_keq_janaf_H2_matches_closed_form_at_2000K_with_oneill |
oxygen_fugacity.py |
Fischer et al. 2011 2 (EPSL 304, 496) IW buffer at 2000 K | tests/test_oxygen_fugacity.py::test_oxygen_fugacity_fischer_value_at_2000K_matches_published_fit |
solubility.py |
Sossi et al. 2023 5 peridotite H\(_2\)O fit; Gaillard et al. 2022 3 (EPSL 117255) S\(_2\) fit | tests/test_solubility.py::TestSolubilityH2O::test_peridotite_default_matches_sossi_2023_fit and TestSolubilityS2_xFeO::test_default_call_matches_gaillard_2022_earth_mantle_value |
solve.py |
Self-consistency between equilibrium_atmosphere (buffered) and equilibrium_atmosphere_authoritative_O (the authoritative-O entry point) at the Earth fiducial |
tests/test_solve.py::test_round_trip_self_consistency_at_earth_fiducial |
structure.py |
Wang, Lineweaver & Ireland 2018 6 (arxiv:1708.08718) Earth core mass fraction 0.325 | tests/test_structure.py::test_calculate_mantle_mass_recovers_wang_2018_earth_core_fraction |
The marker is not the same thing as physical correctness: a reference-pinned test certifies that this implementation reproduces that anchor; it does not certify that the anchor is the right physics for every astrophysical regime.
One-to-one source-to-test mirroring
Each source file in src/calliope/ has a same-named companion in tests/:
| Source | Test |
|---|---|
src/calliope/chemistry.py |
tests/test_chemistry.py |
src/calliope/oxygen_fugacity.py |
tests/test_oxygen_fugacity.py |
src/calliope/solubility.py |
tests/test_solubility.py |
src/calliope/solve.py |
tests/test_solve.py |
src/calliope/structure.py |
tests/test_structure.py |
src/calliope/__init__.py |
tests/test_init.py |
src/calliope/constants.py |
(covered in tests/test_core.py; constants are imported across many sources) |
Cross-cutting tests are the documented exception, not the rule:
tests/test_invariants.py,tests/test_invariants_hypothesis.py: contract clauses that span multiple sources (mass closure end-to-end, partial-species behaviour, stoichiometry of an arbitrary recipe).tests/test_authoritative_O.py,tests/test_authoritative_O_validation.py,tests/test_authoritative_O_monotonicity.py: the authoritative-O entry point, which touchessolve.py,chemistry.py, andoxygen_fugacity.pytogether.tests/test_equilibrium_paths.py,tests/test_partial_species.py,tests/test_stoichiometry.py,tests/test_targets.py: solver-architecture tests that span the same three files.
When a new physics source is added, its 1:1 test file is created at the same time; the matching docs/Validation/<file>.md page is added when the first reference-pinned test for that source lands.
AST test-quality linter
tools/check_test_quality.py walks tests/test_*.py as an AST and enforces seven rules:
| Rule | What it flags |
|---|---|
missing_module_pytestmark |
Test file with no module-level pytestmark (a tier marker is required). |
missing_docstring |
Test function with no docstring. |
single_assert |
Function with exactly one assertion (anti-happy-path: a single assert is rarely enough to discriminate the correct formula from plausible wrong ones). |
no_assertions |
Function with zero assertions (only valid for tests that exercise an exception path with pytest.raises). |
weak_assert_* (three sub-rules) |
Standalone assert result is not None, assert result > 0, assert len(result) > 0, etc., as the sole meaningful check. A weak assertion alongside a strong primary one (the sign guard in a discrimination pattern, for example) is not flagged. |
float_eq_literal |
== adjacent to a numeric literal in a test body (use pytest.approx). |
missing_importorskip |
An optional dependency (hypothesis, atmodeller) imported at module top without a preceding pytest.importorskip('<name>'). The PR Docker image uses pip install --no-deps; without importorskip, collection fails. |
The linter runs in two modes:
python tools/check_test_quality.py --baselinewalks the suite and writes the per-rule violation counts totools/test_quality_baseline.json. This is the floor. Regenerate the baseline only after a deliberate sweep that reduced violations.python tools/check_test_quality.py --check(CI mode) walks the suite, compares the current counts to the baseline, and exits non-zero if any rule's count exceeds the baseline. The CI workflow runs this and blocks the PR on regression.
The baseline ratchets one way: the linter refuses to regenerate it if the new total exceeds the old.
Override with CALLIOPE_TEST_QUALITY_ALLOW_REGRESS=1 only when a new rule was added that surfaces pre-existing violations.
Two advisory modes report gaps without failing CI:
python tools/check_test_quality.py --reference-pinned-status: lists physics sources whose matchingtests/test_<source>.pyhas no@pytest.mark.reference_pinnedtest.python tools/check_test_quality.py --physics-invariant-status: lists physics-source tests that assert no invariant and are not tagged@pytest.mark.physics_invariant.
Local commands
pytest -m unit # fast unit tests
pytest -m smoke # minimal-config solver tests
pytest -m integration # full multi-species CHNS solves
pytest -m slow # sweeps and hypothesis fuzz
pytest -m "(unit or smoke) and not skip" # PR-gate selection
pytest -m "not skip" # everything that should ever run
Coverage:
pytest --cov=calliope --cov-report=term -m "not skip"
pytest --cov=calliope --cov-report=html -m "not skip" # htmlcov/
Lint and structure:
bash tools/validate_test_structure.sh # module-level marker validator
python tools/check_test_quality.py --check # AST linter against baseline
python tools/check_test_quality.py --reference-pinned-status
python tools/check_test_quality.py --physics-invariant-status
Public-facing badges versus internal taxonomy
Public-facing badges (README, project website) collapse smoke + integration + slow into a single Integration Tests category, because a four-way taxonomy is confusing to non-developer readers.
The four-marker internal scheme remains for CI infrastructure granularity: the PR gate runs (unit or smoke), the nightly runs everything, and the test-count badge fetches the JSON files written into the documentation site during the docs deploy.
Badge system
The documentation deploy (.github/workflows/docs.yaml) regenerates three JSON files, tests-{total,unit,integration}.json, from pytest --collect-only and writes them into the published site under badges/.
Shields.io fetches them live from the site via the endpoint URL embedded in the test-count badge.
The counts refresh on every documentation build, so they track the suite without running it.
Coverage gates
Two gates are declared in pyproject.toml:
| Gate | Tests included | Threshold key | Where it runs |
|---|---|---|---|
| Fast | unit + smoke |
[tool.calliope.coverage_fast].fail_under |
PR Run unit + smoke tests step |
| Full | unit + smoke + integration + slow |
[tool.coverage.report].fail_under |
Nightly coverage report |
Both gates ratchet toward 90 %, capped at 90 % (tools/update_coverage_threshold.py enforces ECOSYSTEM_CEILING = 90.0); neither may be manually decreased.
The PR gate has a pre-flight step that fetches the base branch's pyproject.toml and rejects any PR that drops [tool.coverage.report].fail_under below min(base, 90.0).
A one-time ratchet down to the 90 % ceiling is allowed; any drop below 90 % is blocked.
Coverage union estimation
The PR gate downloads the most recent nightly-coverage artifact from main (coverage.xml + coverage.json + nightly-timestamp.txt, 14-day retention) and line-ORs the unit-tier coverage with the nightly's full coverage to produce a union estimate.
The result is written to $GITHUB_STEP_SUMMARY as informational output; it does not gate the PR.
A staleness threshold of 48 hours and a grace band of 0.3 % apply to the warn / fail / ok decision.
PR validation pipeline
.github/workflows/tests.yaml runs on every PR (if: draft == false) over a 2-OS by 2-Python matrix (ubuntu-latest, macos-latest x 3.12, 3.13).
The full step sequence:
- Validate test markers (
bash tools/validate_test_structure.sh): rejects any test file without a module-levelpytestmark. - Run test-quality lint (
python tools/check_test_quality.py --check): blocking; rejects regression againsttools/test_quality_baseline.json. - Pre-flight fail_under ratchet check: rejects any PR that drops
[tool.coverage.report].fail_underbelowmin(base, 90.0). - Run unit + smoke tests:
pytest -m "(unit or smoke) and not skip" --cov=calliope --cov-fail-under=${FAST_FAIL_UNDER}, whereFAST_FAIL_UNDERis read live from[tool.calliope.coverage_fast].fail_under.
Steps 1, 2, and 3 are gated to ubuntu-latest, python 3.12 to avoid four redundant runs; step 4 runs across the full matrix.
Nightly (.github/workflows/nightly.yml) runs the full suite, uploads coverage to Codecov, and enforces --cov-fail-under=90.
Canonical specification
The repository-wide rules that every PROTEUS-ecosystem submodule follows are at proteus-framework.org/PROTEUS/Explanations/ecosystem_testing_standard/.
References
-
M. W. Chase, NIST-JANAF Thermochemical Tables, 4th edition, Journal of Physical and Chemical Reference Data Monograph 9, 1998. ↩
-
R. A. Fischer, A. J. Campbell, G. A. Shofner, O. T. Lord, P. Dera, V. B. Prakapenka, Equation of state and phase diagram of FeO, Earth and Planetary Science Letters, 304, 496-502, 2011. SciX. ↩
-
F. Gaillard, F. Bernadou, M. Roskosz, M. A. Bouhifd, Y. Marrocchi, G. Iacono-Marziano, M. Moreira, B. Scaillet, G. Rogerie, Redox controls during magma ocean degassing, Earth and Planetary Science Letters, 577, 117255, 2022. SciX. ↩
-
H. St. C. O'Neill, S. M. Eggins, The effect of melt composition on trace element partitioning: an experimental investigation of the activity coefficients of FeO, NiO, CoO, MoO\(_2\) and MoO\(_3\) in silicate melts, Chemical Geology, 186, 151-181, 2002. SciX. ↩
-
P. A. Sossi, P. M. E. Tollan, J. Badro, D. J. Bower, Solubility of water in peridotite liquids and the prevalence of steam atmospheres on rocky planets, Earth and Planetary Science Letters, 601, 117894, 2023. SciX. ↩
-
H. S. Wang, C. H. Lineweaver, T. R. Ireland, The elemental abundances (with uncertainties) of the most Earth-like planet, Icarus, 299, 460-474, 2018. SciX. ↩