Histfit doc #258

jonas-kleck · 2025-09-18T09:45:20Z

Added documentation for Histogram Fits. Added an example that shows, why we can't fit a histogram using XYFit so easily, so why to use HistFit for histogrammed data, since @GuenterQuast suggested that histogram fits are often problematic for students. I am a bit unsure about the example, should I include more/other problems?

…to use histogram fits

GuenterQuast · 2025-09-18T10:23:26Z

Thank you for addressing this important issue ! In addition to the problems of an xy-fit, which you mention correctly, there is another one: - tanking the uncertainty as sqrt(n_i_observed) in a bin leads to a bias, because downward- fluctuations are assigned smaller uncertainties than upward-fluctuations. As a remedy, the uncertainties may be taken from the prediction of a pre-fit, if one insists to use an xy-fit. Using the Poisson-Likelihood of histfit also fixes this issue. You man want to address and illustrate this point as well ? Am 18.09.25 um 11:45 schrieb jonas-kleck:

…

Added documentation for Histogram Fits. Added an example that shows, why we can't fit a histogram using XYFit so easily, so why to use HistFit for histogrammed data, since @GuenterQuast <https://github.com/GuenterQuast> suggested that histogram fits are often problematic for students. I am a bit unsure about the example, should I include more/other problems? ------------------------------------------------------------------------ You can view, comment on, or merge this pull request online at: #258 Commit Summary * 29d82ea <29d82ea> added documentation for histogram fits, added an example showing why to use histogram fits * 3dcaf2c <3dcaf2c> edited Histogram Fit section File Changes (2 files <https://github.com/PhiLFitters/kafe2/pull/258/files>) * *M* doc/src/parts/user_guide.rst <https://github.com/PhiLFitters/kafe2/pull/258/files#diff-3b835a49964ee8e16de7da16729e8034c6cec33ab20ea79760471209d6ce85f2> (23) * *A* examples/009_histogram_fit/04_pitfalls.py <https://github.com/PhiLFitters/kafe2/pull/258/files#diff-7f410467fb771a66e6ab72917a1ea5076ea871ab4605d1397c7c6df1c912c715> (115) Patch Links: * https://github.com/PhiLFitters/kafe2/pull/258.patch * https://github.com/PhiLFitters/kafe2/pull/258.diff — Reply to this email directly, view it on GitHub <#258>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEIAGSNMHIWKX4EGO4TPD433TJ5MNAVCNFSM6AAAAACG3C4DJCVHI2DSMVQWIX3LMV43ASLTON2WKOZTGQZDSNJUGY3TINY>. You are receiving this because you were mentioned.Message ID: ***@***.***>

…o example

JohannesGaessler

I think it's fine to demonstrate the prefit technique but in the context of kafe2 I think we should recommend the use of either a Poisson likelihood or its Gaussian approximation (depending on whether additional y errors are needed).

JohannesGaessler · 2025-09-22T12:16:56Z

doc/src/parts/user_guide.rst

+Histogram Fits
+---------------
+
+A very common fit type is the histogram fit. *kafe2* provides a dedicated fitting class for histogram fits,


Suggested change

A very common fit type is the histogram fit. *kafe2* provides a dedicated fitting class for histogram fits,

In physics experiments data is frequently histogrammed in order to reduce the data to a manageable number of bins.

*kafe2* provides a dedicated fitting class for histogram fits,

JohannesGaessler · 2025-09-22T12:27:31Z

doc/src/parts/user_guide.rst

+Depending on whether the modelfunction is already normalised, or has a normalisation constant,
+that is also supposed to be estimated in the fit the `density` keyword can be used, during creation
+of the `HistFit` object.


Suggested change

Depending on whether the modelfunction is already normalised, or has a normalisation constant,

that is also supposed to be estimated in the fit the `density` keyword can be used, during creation

of the `HistFit` object.

By default it is assumed that the model function for a `HistFit` object is a probability density function normalised to 1.

As a consequence the bin contents are also being normalized to 1.

To disable this behavior, set `density=False` in the `HistFit` constructor.

I think this explains the behavior more explicitly. We can still mention the fitting of normalization constants, though you could in principle also have a model function with a constant normalization that you're not fitting.

JohannesGaessler · 2025-09-22T12:49:04Z

examples/009_histogram_fit/04_pitfalls.py

+This example demonstrates why it is more convenient to use the HistFit class
+instead of XYFit when dealing with histogrammed data.


Suggested change

This example demonstrates why it is more convenient to use the HistFit class

instead of XYFit when dealing with histogrammed data.

This example demonstrates a scenario in which it is more convenient to use the

HistFit class rather than the XYFit class.

JohannesGaessler · 2025-09-22T12:52:15Z

examples/009_histogram_fit/04_pitfalls.py

+While it is in principle possible to perform such fits correctly using XYFit,
+it requires much more care. This example shows common mistakes that can occur
+when that necessary care is not taken and how this makes the fit results worse.


Suggested change

While it is in principle possible to perform such fits correctly using XYFit,

it requires much more care. This example shows common mistakes that can occur

when that necessary care is not taken and how this makes the fit results worse.

While it is in principle possible to perform such fits correctly using XYFit,

more care must be taken to avoid unusable or biased results.

This example shows common problems that can occur when using XYFit and how to fix them.

HistFit handles these problems automatically.

JohannesGaessler · 2025-09-22T12:54:04Z

examples/009_histogram_fit/04_pitfalls.py

+x_data = binmids = np.mean([binedges[:-1], binedges[1:]], axis=0) #use binmids as x values
+y_data = bincounts/(np.sum(bincounts)*np.diff(binedges))
+print(y_data) #use normalized histogram as y data
+x_error = 0.25 #use half binwidht as x_error


Suggested change

x_error = 0.25 #use half binwidht as x_error

x_error = 0.25 #use half binwidth as x_error

cverstege

Just some quick comments. Also, formatting is missing. You can run the black formatter to automatically fix the issues.

cverstege · 2025-09-22T14:44:50Z

examples/009_histogram_fit/04_pitfalls.py

+We will especially look at two scenarios that can arise from having only small amounts of data.
+This is, for example, a problem in the search for new rare processes in high-energy physics.
+Assume we are looking for a Gaussian signal peak, free from background for simplicity
+(the treatment of a signal over background is explained in example 03_SpluBfit.py).


Suggested change

(the treatment of a signal over background is explained in example 03_SpluBfit.py).

(the treatment of a signal over background is explained in example 03_SplusBfit.py).

cverstege · 2025-09-22T14:45:49Z

examples/009_histogram_fit/04_pitfalls.py

+x_data = binmids = np.mean([binedges[:-1], binedges[1:]], axis=0) #use binmids as x values
+y_data = bincounts/(np.sum(bincounts)*np.diff(binedges))
+print(y_data) #use normalized histogram as y data
+x_error = 0.25 #use half binwidht as x_error


Suggested change

x_error = 0.25 #use half binwidht as x_error

x_error = 0.25 #use half binwidth as x_error

cverstege · 2025-09-22T14:46:31Z

examples/009_histogram_fit/04_pitfalls.py

+binedges = np.array([-2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2])
+
+#Now naively perform a default XYFit
+x_data = binmids = np.mean([binedges[:-1], binedges[1:]], axis=0) #use binmids as x values


I'd rather call it bincenters than binmids

cverstege · 2025-09-22T14:47:51Z

examples/009_histogram_fit/04_pitfalls.py

+
+
+"""
+This first fit should give you warnings about the cost function being evaluated as infinite, which comes from the empty bin.


Suggested change

This first fit should give you warnings about the cost function being evaluated as infinite, which comes from the empty bin.

This first fit raises warnings about the cost function being evaluated as infinite, which originates from the empty bin.

cverstege · 2025-09-22T14:48:08Z

examples/009_histogram_fit/04_pitfalls.py

+
+"""
+This first fit should give you warnings about the cost function being evaluated as infinite, which comes from the empty bin.
+Also, the result of the fit should not return any good values. This is the result of the empty bin.


Suggested change

Also, the result of the fit should not return any good values. This is the result of the empty bin.

Also, the result of the fit does not return any good values. This is the result of the empty bin.

cverstege · 2025-09-22T14:48:50Z

examples/009_histogram_fit/04_pitfalls.py

+This first fit should give you warnings about the cost function being evaluated as infinite, which comes from the empty bin.
+Also, the result of the fit should not return any good values. This is the result of the empty bin.
+
+It can be seen in the following that even if we combine bins in order to get rid of the empty bin,


Suggested change

It can be seen in the following that even if we combine bins in order to get rid of the empty bin,

It can be seen in the following that even if the bins are combined in order to get rid of the empty bin,

cverstege · 2025-09-22T14:49:16Z

examples/009_histogram_fit/04_pitfalls.py

+binedges = np.array([-2, -1, -0.5, 0, 0.5, 1, 2])
+
+#Now again perform a default XYFit
+x_data = binmids = np.mean([binedges[:-1], binedges[1:]], axis=0) #use binmids as x values


bincenters instead of binmids

cverstege · 2025-09-22T16:03:47Z

examples/009_histogram_fit/04_pitfalls.py

+#Now again perform a default XYFit
+x_data = binmids = np.mean([binedges[:-1], binedges[1:]], axis=0) #use binmids as x values
+y_data = bincounts/(np.sum(bincounts)*np.diff(binedges)) #use normalized histogram as y data
+x_error = np.diff(binedges)/2 #use half binwidht as x_error


Suggested change

x_error = np.diff(binedges)/2 #use half binwidht as x_error

x_error = np.diff(binedges)/2 #use half binwidth as x_error

cverstege · 2025-09-22T16:05:18Z

examples/009_histogram_fit/04_pitfalls.py

+plt.show()
+
+"""
+The uncertainties on the result were already reduced slightly in comparison to the prefit. And by looking at the printed


Suggested change

The uncertainties on the result were already reduced slightly in comparison to the prefit. And by looking at the printed

The uncertainties on the result were already reduced slightly in comparison to the prefit. And by looking at the

…tion

…ing XYFit

… to accept also non integer binedges

jonas-kleck · 2025-10-01T08:08:07Z

After talking to @JohannesGaessler, I changed the example a bit. It should show more, how to recreate the Histfit and what the HistFit does internally. I also added a short part about when to use the gaussian approximation cost function.

JohannesGaessler

I think the example should be more clear that the "Guassian approximation" in the beginning and the "Gaussian approximation" cost function do different things: one is using the square root of the data, the other one is using the square root of the model.

The behavior that we should aim for is to use the Poisson likelihood cost function by default and to automatically switch to the Gaussian approximation if the user specifies Gaussian errors (I think I already implemented this).

JohannesGaessler · 2025-10-06T11:57:54Z

doc/src/parts/user_guide.rst

+---------------
+
+In physics experiments data is frequently histogrammed in order to reduce the data to a manageable number of bins.
+*kafe2* provides a dedicated fitting class for histogram fits, intendet to be used when datapoints are obtained


Suggested change

*kafe2* provides a dedicated fitting class for histogram fits, intendet to be used when datapoints are obtained

*kafe2* provides a dedicated fitting class for histogram fits, intended to be used when datapoints are obtained

JohannesGaessler · 2025-10-06T11:58:13Z

doc/src/parts/user_guide.rst

+
+In physics experiments data is frequently histogrammed in order to reduce the data to a manageable number of bins.
+*kafe2* provides a dedicated fitting class for histogram fits, intendet to be used when datapoints are obtained
+from a random distribution. Especiallywhen large numbers of datapoints are present it is more efficient to treat


Suggested change

from a random distribution. Especiallywhen large numbers of datapoints are present it is more efficient to treat

from a random distribution. Especially when large numbers of datapoints are present it is more efficient to treat

JohannesGaessler · 2025-10-06T12:08:07Z

doc/src/parts/user_guide.rst

+In physics experiments data is frequently histogrammed in order to reduce the data to a manageable number of bins.
+*kafe2* provides a dedicated fitting class for histogram fits, intendet to be used when datapoints are obtained
+from a random distribution. Especiallywhen large numbers of datapoints are present it is more efficient to treat
+the data as a histogram. To perform a histogram fit, the datapoints have to be filled into a :py:obj:`HistContainer`.


Suggested change

the data as a histogram. To perform a histogram fit, the datapoints have to be filled into a :py:obj:`HistContainer`.

the data as a histogram. To perform a histogram fit using raw data, the datapoints have to be filled into a :py:obj:`HistContainer`.

JohannesGaessler · 2025-10-06T12:12:27Z

doc/src/parts/user_guide.rst

+By default it is assumed that the model function for a `HistFit` object is a probability density function normalised to 1.
+As a consequence the bin contents are also being normalized to 1.


Suggested change

By default it is assumed that the model function for a `HistFit` object is a probability density function normalised to 1.

As a consequence the bin contents are also being normalized to 1.

By default it is assumed that the model function for a `HistFit` object is a probability density function normalized to 1.

As a consequence the bin contents are also being normalized to 1.

I don't particularly care which spelling convention we use but we should be consistent.

JohannesGaessler · 2025-10-06T12:13:10Z

examples/009_histogram_fit/04_pitfalls.py

+HistFit class rather than the XYFit class.
+
+While it is in principle possible to perform such fits correctly using XYFit,
+more care must be taken to avoid unusable or biased results.


Suggested change

more care must be taken to avoid unusable or biased results.

more care must be taken to avoid unusable or biased fit results.

JohannesGaessler · 2025-10-06T12:23:39Z

examples/009_histogram_fit/04_pitfalls.py

+The default XYFit has some problems that have to be addressed, when histogrammed data is
+processed. First of all, when using an XYFit, the data is passed to the Fit object in a
+XYContainer. This container does not automatically fill the datapoints in bins. So this first step
+has to be done manually. In the following, 30 datapoints, sampled from a normal distribution, are used.


I think we should just start with something like "The XYFit does not have any built-in functionality for transforming raw data into a histogram. So if we want to use it we need to do this step manually."

JohannesGaessler · 2025-10-06T12:37:39Z

examples/009_histogram_fit/04_pitfalls.py

+    ]
+)
+
+binedges = np.array([-2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2])


Suggested change

binedges = np.array([-2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2])

bin_edges = np.array([-2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2])

I think it would make sense to put underscores between words for better legibility.

JohannesGaessler · 2025-10-06T12:40:05Z

examples/009_histogram_fit/04_pitfalls.py

+When performing the fit, errors like "The cost function was evaluated as infinite" appear in the output. 
+Furthermore, when looking at the plot of our fit result, it is clear that this fit didn't return the result we expected.
+The starting values for the fit are returned as best fit value, and the uncertainties are reported as NaN.
+What happened? The problem arises because Poisson uncertainties were assumed for the bin counts.
+If a bin is empty, the uncertainty is treated as zero by the fit. Thus the model function is forced to pass this datapoint
+exactly or the cost function will be infinite.
+This occures because the XYFit uses a χ²-cost function by default, which is only valid 
+for Gaussian uncertainties, but not in the case of Poisson uncertainties. To fix this, the
+cost function of the fit is changed to a Poisson negative log-likelhoood (NLL).


I think it would make sense to print a warning when using a cost function that needs errors and some of the errors are also == 0.

JohannesGaessler · 2025-10-06T12:43:33Z

examples/009_histogram_fit/04_pitfalls.py

+
+Another subtlety is the definition of the y_data: So far, simply the midpoint of each bin was used. This is only
+a linear approximation of the behaviour of the model function between the bin edges. The HistFit class of kafe2
+on the other hand uses "Simpsons rule", a method to approximate the behaviour quadratically for more accuracy.


Suggested change

on the other hand uses "Simpsons rule", a method to approximate the behaviour quadratically for more accuracy.

on the other hand uses "Simpson's rule", a method to approximate the behaviour quadratically for more accuracy.

JohannesGaessler · 2025-10-06T12:43:51Z

examples/009_histogram_fit/04_pitfalls.py

+on the other hand uses "Simpsons rule", a method to approximate the behaviour quadratically for more accuracy.
+The most accurate albeit computationally expensive method would be to integrate the model function over each bin.
+
+The implementation of Simpsons rule in our procedure using a XYFit will not be done here,


Suggested change

The implementation of Simpsons rule in our procedure using a XYFit will not be done here,

The implementation of Simpson's rule in our procedure using a XYFit will not be done here,

jonas-kleck added 2 commits September 18, 2025 11:32

added documentation for histogram fits, added an example showing why …

29d82ea

…to use histogram fits

edited Histogram Fit section

3dcaf2c

jonas-kleck added 2 commits September 18, 2025 20:07

Added prefit as solution to the incorrectly specified uncertainties t…

a548b9b

…o example

corrected documentation

d283fbc

JohannesGaessler reviewed Sep 22, 2025

View reviewed changes

cverstege reviewed Sep 22, 2025

View reviewed changes

jonas-kleck added 4 commits September 22, 2025 19:36

reformatting and reformulation of histogram fit example and documenta…

222df55

…tion

incorpoate feedback into histogram fit section section

ca710c8

changed the example. The focus lies more on 'recreating' a HistFit us…

337e427

…ing XYFit

changed the data compatibility check for XYFits using the poisson NLL…

46c9e44

… to accept also non integer binedges

JohannesGaessler reviewed Oct 6, 2025

View reviewed changes

jonas-kleck added 3 commits October 7, 2025 11:02

revised histogram fit section

c68cbe4

improved explanations in example

a0958b7

edited warning message for zero errors in chi2 cost function

08779ce

JohannesGaessler force-pushed the dev branch from 39ebd41 to 0cb478c Compare October 23, 2025 08:58

	A very common fit type is the histogram fit. kafe2 provides a dedicated fitting class for histogram fits,
	In physics experiments data is frequently histogrammed in order to reduce the data to a manageable number of bins.
	kafe2 provides a dedicated fitting class for histogram fits,

		This example demonstrates why it is more convenient to use the HistFit class
		instead of XYFit when dealing with histogrammed data.

	x_error = 0.25 #use half binwidht as x_error
	x_error = 0.25 #use half binwidth as x_error

	(the treatment of a signal over background is explained in example 03_SpluBfit.py).
	(the treatment of a signal over background is explained in example 03_SplusBfit.py).



		"""
		This first fit should give you warnings about the cost function being evaluated as infinite, which comes from the empty bin.

	This first fit should give you warnings about the cost function being evaluated as infinite, which comes from the empty bin.
	This first fit raises warnings about the cost function being evaluated as infinite, which originates from the empty bin.

	Also, the result of the fit should not return any good values. This is the result of the empty bin.
	Also, the result of the fit does not return any good values. This is the result of the empty bin.

	It can be seen in the following that even if we combine bins in order to get rid of the empty bin,
	It can be seen in the following that even if the bins are combined in order to get rid of the empty bin,

	x_error = np.diff(binedges)/2 #use half binwidht as x_error
	x_error = np.diff(binedges)/2 #use half binwidth as x_error

	The uncertainties on the result were already reduced slightly in comparison to the prefit. And by looking at the printed
	The uncertainties on the result were already reduced slightly in comparison to the prefit. And by looking at the

	from a random distribution. Especiallywhen large numbers of datapoints are present it is more efficient to treat
	from a random distribution. Especially when large numbers of datapoints are present it is more efficient to treat

	the data as a histogram. To perform a histogram fit, the datapoints have to be filled into a :py:obj:`HistContainer`.
	the data as a histogram. To perform a histogram fit using raw data, the datapoints have to be filled into a :py:obj:`HistContainer`.

		By default it is assumed that the model function for a `HistFit` object is a probability density function normalised to 1.
		As a consequence the bin contents are also being normalized to 1.

	more care must be taken to avoid unusable or biased results.
	more care must be taken to avoid unusable or biased fit results.

	binedges = np.array([-2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2])
	bin_edges = np.array([-2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2])

	on the other hand uses "Simpsons rule", a method to approximate the behaviour quadratically for more accuracy.
	on the other hand uses "Simpson's rule", a method to approximate the behaviour quadratically for more accuracy.

	The implementation of Simpsons rule in our procedure using a XYFit will not be done here,
	The implementation of Simpson's rule in our procedure using a XYFit will not be done here,

Histfit doc #258

Are you sure you want to change the base?

Histfit doc #258

Uh oh!

Conversation

jonas-kleck commented Sep 18, 2025

Uh oh!

GuenterQuast commented Sep 18, 2025 via email

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cverstege left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonas-kleck commented Oct 1, 2025

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants