Skip to content

MRG: adjust warnings around tax abund and provide v5 upgrades to tax metagenome #3711

Merged
ctb merged 71 commits intolatestfrom
update_tax_abund
Jul 31, 2025
Merged

MRG: adjust warnings around tax abund and provide v5 upgrades to tax metagenome #3711
ctb merged 71 commits intolatestfrom
update_tax_abund

Conversation

@ctb
Copy link
Copy Markdown
Contributor

@ctb ctb commented Jun 27, 2025

Add new command line arguments to tax metagenome around abundances. --use-abundances turns on the use of sketch abundances for krona and maybe other output formats, while --ignore-abundances/--no-abundances turns them off. To add to the complexity, by default, the human and kreport output formats do use abundances when present 😭 .

Here's a chart to disambiguate 😆

output format v4 behavior v5 default behavior notes
human abund abund default output in v5
csv_summary provides both provides both default output in v4
lineage_summary no abund abund
krona no abund abund
kreport abund abund

This PR provides a warning if no abundances are available for tax metagenome to use, or if --use-abundances/--ignore-abundances is not explicitly set (for sourmash v4).

For sourmash v5, --use-abundances becomes default and it is an error to not have abundances. This can be overridden with --ignore-abundances.

For sourmash v5, this PR also changes the default output format for tax genome and tax metagenome to -F human from -F csv_summary (#2162).

Uses --v4/--v5 command line switches and test infrastructure, ref #3076.

As of this PR,

sourmash v4 - all tax metagenome commands warn if abundances are not available.
sourmash v5 - all tax metagenome commands require abund, unless --ignore-abund is set.

TODO:

  • finish writing tests
  • test: warning if no abundances provided/used (v4)
  • test: no warning if --ignore-abund, and no abundances used (v4/v5)
  • test: error if no abundances provided & no ignore (v5)
  • test: error if no abundances provided and --use-abund
  • test: abundances used if --use-abund (v4/v5)
  • test: abundances used if v5
  • test: default output format for metagenome is human (v5)
  • test: default output format for genome is csv_summary (v4)
  • test: abundances not used with tax genome
  • fix @ctb and XXX notes stuff
  • actually test & examine the krona and kreport output formats ;)
  • test: default output format for metagenome is human (v5)
  • test: default output format for genome is csv_summary (v4)
  • add tests and checks for human output format: --use-abund, --ignore-abund, and defaults (use abund).
  • add comments to Weirdly high unclassified proportion in defined metagenome #3577 and sourmash tax metagenome krona and kreports are vastly different? #3598
  • add documentation
  • check bioboxes and lingroup output formats to see if changes are required

Fixes #2162
Fixes #3577
Fixes #3598

UPDATE: I created a validation workflow here: https://github.com/sourmash-bio/2025-validate-sourmash-tax-formats. It converts everything to the new taxburst JSON format and then does a tree comparison. The new krona format passes 🎉 !

ctb and others added 23 commits March 9, 2024 10:50
@codecov
Copy link
Copy Markdown

codecov bot commented Jun 27, 2025

Codecov Report

❌ Patch coverage is 96.10390% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.19%. Comparing base (c94f143) to head (7ec27ae).
⚠️ Report is 54 commits behind head on latest.

Files with missing lines Patch % Lines
src/sourmash/tax/tax_utils.py 92.85% 1 Missing and 1 partial ⚠️
src/sourmash/tax/__main__.py 97.56% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           latest    #3711      +/-   ##
==========================================
+ Coverage   88.17%   88.19%   +0.01%     
==========================================
  Files         137      137              
  Lines       22491    22552      +61     
  Branches     2281     2298      +17     
==========================================
+ Hits        19831    19889      +58     
- Misses       2347     2348       +1     
- Partials      313      315       +2     
Flag Coverage Δ
hypothesis-py 25.27% <9.09%> (-0.12%) ⬇️
python 92.66% <96.10%> (+0.01%) ⬆️
rust 81.62% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bluegenes
Copy link
Copy Markdown
Contributor

  • tax metagenome in v5 (and with --v5) should REQUIRE abundances unless an option (--no-abund?) is provided;
  • tax metagenome in v4 should WARN or ERROR if (a) no abundances are provided (b) abundances are available but not being used

I like it!

@ctb ctb changed the base branch from latest to v5_sig_manifest June 28, 2025 15:11
@ctb
Copy link
Copy Markdown
Contributor Author

ctb commented Jul 29, 2025

See new validation workflow here: https://github.com/sourmash-bio/2025-validate-sourmash-tax-formats.

@ctb ctb changed the title WIP: adjust warnings around tax abund and provide v5 upgrades to tax metagenome MRG: adjust warnings around tax abund and provide v5 upgrades to tax metagenome Jul 31, 2025
@ctb
Copy link
Copy Markdown
Contributor Author

ctb commented Jul 31, 2025

Ready for review @bluegenes!

This got really complicated from a code and testing perspective. My suggestion for detailed review efforts are to confirm that the various test_metagenome_abund_reporting_* tests match the PR descr and docs... Those tests should be comprehensive for the abundance-related changes in this PR.

Copy link
Copy Markdown
Contributor

@bluegenes bluegenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ctb ctb merged commit 33222d1 into latest Jul 31, 2025
43 checks passed
@ctb ctb deleted the update_tax_abund branch July 31, 2025 15:44
ctb added a commit that referenced this pull request Aug 7, 2025
Major new features:

* start writing v4->v5 migration docs (#3721)
* adjust warnings around tax abund and provide v5 upgrades to `tax
metagenome` (#3711)

Minor new features:

* try setting up --v4 and --v5 behavior differences for `sig check`
(#3072)
* update `sig manifest` default rebuilding behavior for v5. (#3074)
* handle (ignore) empty taxids for `bioboxes` format (#3748)
* improve summary_csv for lingroups (#3758)

Cleanup and documentation updates:

* use auto-generated database list (#3754)

Developer updates:

* CI: fix dependabot config syntax, and clippy beta lints (#3762)
* CI: update to cibuildwheel 3.1.1 (#3738)
* ci: group dependabot updates by language (#3749)
* Remove docutils dep (#3769)
* bump version to 4.9.4-dev (#3715)
* disable WebAssembly builds, for now (#3724)

Dependabot updates:

* Build(ci): Bump actions/download-artifact from 4 to 5 (#3766)
* Build(deps): Bump DeterminateSystems/nix-installer-action from 17 to
18 (#3727)
* Build(deps): Bump DeterminateSystems/nix-installer-action from 18 to
19 (#3746)
* Build(deps): Bump criterion from 0.6.0 to 0.7.0 (#3741)
* Build(deps): Bump md5 from 0.7.0 to 0.8.0 (#3719)
* Build(deps): Bump memmap2 from 0.9.5 to 0.9.7 (#3732)
* Build(deps): Bump prefix-dev/setup-pixi from 0.8.10 to 0.8.11 (#3733)
* Build(deps): Bump prefix-dev/setup-pixi from 0.8.11 to 0.8.14 (#3747)
* Build(deps): Bump rand from 0.9.1 to 0.9.2 (#3743)
* Build(deps): Bump serde_json from 1.0.140 to 1.0.141 (#3742)
* [pre-commit.ci] pre-commit autoupdate (#3718)
* [pre-commit.ci] pre-commit autoupdate (#3725)
* [pre-commit.ci] pre-commit autoupdate (#3731)
* [pre-commit.ci] pre-commit autoupdate (#3737)
* [pre-commit.ci] pre-commit autoupdate (#3740)
* [pre-commit.ci] pre-commit autoupdate (#3756)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants