Skip to content

Conversation

@kloetzl
Copy link
Contributor

@kloetzl kloetzl commented Aug 10, 2025

This commit fixes two small issues.

  • The lowercase b might be interpreted as bit.
  • Be slightly more precise that the 64KB limit applies to compressed and uncompressed data both.

@kloetzl kloetzl changed the title Fix BGZF block to be 64 kilobyte Fix BGZF block size to be 64 kilobyte Aug 10, 2025
@github-actions
Copy link

Changed PDFs as of e58c902: SAMv1 (diff).

@jmarshall
Copy link
Member

jmarshall commented Aug 10, 2025

The maximum uncompressed and compressed block sizes are clearly described in the paragraph after the table (“The random access method to be described next […]”), so this is just a clarification rather than a fix.

We use such abbreviations in a number of these specifications:

  • CRAMv2.1.tex and CRAMv3.tex correctly distinguish MB (megabytes, though could use MiB) and kb/Mb (kilobases and megabases, correctly decimal, and the correct abbreviation for “bases” in a bioinformatics context).
  • This occurrence is the only one in SAMv1.tex, and you are correct that it should be uppercase B. The resulting 64KB would be correct JEDEC notation but personally I'd rather use IEC notation (KiB) or spell it out as kilobytes (sigh) or kibibytes (probably not this one, as it's even less familiar than KiB!).
  • VCFv4.[123].tex each use Kb once which should be corrected to kb as it's referring to (decimal) kilobases.
  • VCFv4.[1-5].tex also use GB, which could be changed to GiB or just spelt out as gigabytes as it is in informal prose.
  • tabix.tex has one occurrence of kb which should be corrected to KB or KiB as per the SAMv1.tex occurrence correctly refers to kilobases.
  • crypt4gh.tex uses KiB (alongside 65536-byte segments, thus explaining the notation).

Here in 2025, do we feel that the kibi/mebi/etc prefixes are familiar enough for these specifications' readers that we can use them without comment? Then we can be very clear in distinguishing the radix difference between binary mebibytes and decimal megabases.

IMHO “and” and “or” are equivalent in this prose. But happy to change it for clarity, perhaps to “each no larger
than 64 KB both before and after compression“ for even more clarity.

@jmarshall jmarshall added the sam label Aug 10, 2025
@kloetzl
Copy link
Contributor Author

kloetzl commented Aug 11, 2025

Thanks for your effort of checking all of the other specifications, too. KiB is probably an even more precise notation. I'd be happy to change it.

@jkbonfield
Copy link
Contributor

I'm happy with KiB notations where relevant. Also thanks for this PR, but like John I'm less sure on the "or" / "and" text. I'm easy on that though.

@kloetzl
Copy link
Contributor Author

kloetzl commented Aug 11, 2025

Thank you for your comments; I have integrated the suggestions.

@jmarshall
Copy link
Member

jmarshall commented Aug 12, 2025

That was more of a commentary for the maintainers of the other specifications than a request for you to make all those changes… 😄

It's actually easier to consider changes to the different format specifications separately (in particular, because they need to be reviewed and merged by different people), so I have split this into additional #840 and #841 PRs accordingly.

I"ll change this PR back to contain only the SAMv1.tex changes; after this, I believe we have enough agreement from both @jkbonfield and myself to merge this trivial wording change.

@jmarshall jmarshall force-pushed the bam-64byte branch 2 times, most recently from 4e71cd1 to b5341fb Compare August 12, 2025 11:53
@jmarshall jmarshall merged commit b5341fb into samtools:master Aug 12, 2025
@jkbonfield
Copy link
Contributor

Given the confusion of bases vs bytes, some people use bp for base-pairs. I accept Kb vs kb is different, so maybe it's not an issue. However see https://en.wikipedia.org/wiki/Base_pair#As_a_unit_of_length which suggest bp. Note technically it's not true for single stranded things such as RNA and they suggest nt there, but I think that just makes it even more confusing and generally we're talking about DNA here.

@kloetzl
Copy link
Contributor Author

kloetzl commented Aug 12, 2025

so I have split this into additional #840 and #841 PRs accordingly.

Sure, fine with me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants