Skip to content

fastx_quality_stats bug in median calculation #15

@winni2k

Description

@winni2k

Hi, when I run the command

fastx_quality_stats -i input.fastq -o output.stats

, where input.fastq consists of

@0 <unknown description>
A
+
]
@1 <unknown description>
A
+
]

, then output.stats consists of

column  count   min     max     sum     mean    Q1      med     Q3      IQR     lW      rW      A_Count C_Count G_Count T_Count N_Count Max_count
1       2       60      60      120     60.00   60      50      50      -10     75      35      2       0       0       0
       0       2

. Note the med column has a value of 50, whereas the mean column has a value of 60, and the two quality scores in input.fastq are identical (]).

I believe this is a bug?

PS fastx_quality_stats -h prints:

usage: fastx_quality_stats [-h] [-N] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.14 by A. Gordon ([email protected])

   [-h] = This helpful help screen.
   [-i INFILE]  = FASTQ input file. default is STDIN.
   [-o OUTFILE] = TEXT output file. default is STDOUT.
   [-N]         = New output format (with more information per nucleotide/cycle).

The *OLD* output TEXT file will have the following fields (one row per column):
	column	= column number (1 to 36 for a 36-cycles read solexa file)
	count   = number of bases found in this column.
	min     = Lowest quality score value found in this column.
	max     = Highest quality score value found in this column.
	sum     = Sum of quality score values for this column.
	mean    = Mean quality score value for this column.
	Q1	= 1st quartile quality score.
	med	= Median quality score.
	Q3	= 3rd quartile quality score.
	IQR	= Inter-Quartile range (Q3-Q1).
	lW	= 'Left-Whisker' value (for boxplotting).
	rW	= 'Right-Whisker' value (for boxplotting).
	A_Count	= Count of 'A' nucleotides found in this column.
	C_Count	= Count of 'C' nucleotides found in this column.
	G_Count	= Count of 'G' nucleotides found in this column.
	T_Count	= Count of 'T' nucleotides found in this column.
	N_Count = Count of 'N' nucleotides found in this column.
	max-count = max. number of bases (in all cycles)


The *NEW* output format:
	cycle (previously called 'column') = cycle number
	max-count
	For each nucleotide in the cycle (ALL/A/C/G/T/N):
		count   = number of bases found in this column.
		min     = Lowest quality score value found in this column.
		max     = Highest quality score value found in this column.
		sum     = Sum of quality score values for this column.
		mean    = Mean quality score value for this column.
		Q1	= 1st quartile quality score.
		med	= Median quality score.
		Q3	= 3rd quartile quality score.
		IQR	= Inter-Quartile range (Q3-Q1).
		lW	= 'Left-Whisker' value (for boxplotting).
		rW	= 'Right-Whisker' value (for boxplotting).


Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions