Skip to content

Conversation

@jonas-kleck
Copy link
Collaborator

Added documentation for Histogram Fits. Added an example that shows, why we can't fit a histogram using XYFit so easily, so why to use HistFit for histogrammed data, since @GuenterQuast suggested that histogram fits are often problematic for students. I am a bit unsure about the example, should I include more/other problems?

@GuenterQuast
Copy link
Member

GuenterQuast commented Sep 18, 2025 via email

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to demonstrate the prefit technique but in the context of kafe2 I think we should recommend the use of either a Poisson likelihood or its Gaussian approximation (depending on whether additional y errors are needed).

Histogram Fits
---------------

A very common fit type is the histogram fit. *kafe2* provides a dedicated fitting class for histogram fits,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A very common fit type is the histogram fit. *kafe2* provides a dedicated fitting class for histogram fits,
In physics experiments data is frequently histogrammed in order to reduce the data to a manageable number of bins.
*kafe2* provides a dedicated fitting class for histogram fits,

Comment on lines 412 to 414
Depending on whether the modelfunction is already normalised, or has a normalisation constant,
that is also supposed to be estimated in the fit the `density` keyword can be used, during creation
of the `HistFit` object.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Depending on whether the modelfunction is already normalised, or has a normalisation constant,
that is also supposed to be estimated in the fit the `density` keyword can be used, during creation
of the `HistFit` object.
By default it is assumed that the model function for a `HistFit` object is a probability density function normalised to 1.
As a consequence the bin contents are also being normalized to 1.
To disable this behavior, set `density=False` in the `HistFit` constructor.

I think this explains the behavior more explicitly. We can still mention the fitting of normalization constants, though you could in principle also have a model function with a constant normalization that you're not fitting.

Comment on lines 7 to 8
This example demonstrates why it is more convenient to use the HistFit class
instead of XYFit when dealing with histogrammed data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This example demonstrates why it is more convenient to use the HistFit class
instead of XYFit when dealing with histogrammed data.
This example demonstrates a scenario in which it is more convenient to use the
HistFit class rather than the XYFit class.

Comment on lines 10 to 12
While it is in principle possible to perform such fits correctly using XYFit,
it requires much more care. This example shows common mistakes that can occur
when that necessary care is not taken and how this makes the fit results worse.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
While it is in principle possible to perform such fits correctly using XYFit,
it requires much more care. This example shows common mistakes that can occur
when that necessary care is not taken and how this makes the fit results worse.
While it is in principle possible to perform such fits correctly using XYFit,
more care must be taken to avoid unusable or biased results.
This example shows common problems that can occur when using XYFit and how to fix them.
HistFit handles these problems automatically.

x_data = binmids = np.mean([binedges[:-1], binedges[1:]], axis=0) #use binmids as x values
y_data = bincounts/(np.sum(bincounts)*np.diff(binedges))
print(y_data) #use normalized histogram as y data
x_error = 0.25 #use half binwidht as x_error
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
x_error = 0.25 #use half binwidht as x_error
x_error = 0.25 #use half binwidth as x_error

Copy link
Member

@cverstege cverstege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some quick comments. Also, formatting is missing. You can run the black formatter to automatically fix the issues.

We will especially look at two scenarios that can arise from having only small amounts of data.
This is, for example, a problem in the search for new rare processes in high-energy physics.
Assume we are looking for a Gaussian signal peak, free from background for simplicity
(the treatment of a signal over background is explained in example 03_SpluBfit.py).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(the treatment of a signal over background is explained in example 03_SpluBfit.py).
(the treatment of a signal over background is explained in example 03_SplusBfit.py).

x_data = binmids = np.mean([binedges[:-1], binedges[1:]], axis=0) #use binmids as x values
y_data = bincounts/(np.sum(bincounts)*np.diff(binedges))
print(y_data) #use normalized histogram as y data
x_error = 0.25 #use half binwidht as x_error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
x_error = 0.25 #use half binwidht as x_error
x_error = 0.25 #use half binwidth as x_error

binedges = np.array([-2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2])

#Now naively perform a default XYFit
x_data = binmids = np.mean([binedges[:-1], binedges[1:]], axis=0) #use binmids as x values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather call it bincenters than binmids



"""
This first fit should give you warnings about the cost function being evaluated as infinite, which comes from the empty bin.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This first fit should give you warnings about the cost function being evaluated as infinite, which comes from the empty bin.
This first fit raises warnings about the cost function being evaluated as infinite, which originates from the empty bin.


"""
This first fit should give you warnings about the cost function being evaluated as infinite, which comes from the empty bin.
Also, the result of the fit should not return any good values. This is the result of the empty bin.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Also, the result of the fit should not return any good values. This is the result of the empty bin.
Also, the result of the fit does not return any good values. This is the result of the empty bin.

This first fit should give you warnings about the cost function being evaluated as infinite, which comes from the empty bin.
Also, the result of the fit should not return any good values. This is the result of the empty bin.
It can be seen in the following that even if we combine bins in order to get rid of the empty bin,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It can be seen in the following that even if we combine bins in order to get rid of the empty bin,
It can be seen in the following that even if the bins are combined in order to get rid of the empty bin,

binedges = np.array([-2, -1, -0.5, 0, 0.5, 1, 2])

#Now again perform a default XYFit
x_data = binmids = np.mean([binedges[:-1], binedges[1:]], axis=0) #use binmids as x values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bincenters instead of binmids

#Now again perform a default XYFit
x_data = binmids = np.mean([binedges[:-1], binedges[1:]], axis=0) #use binmids as x values
y_data = bincounts/(np.sum(bincounts)*np.diff(binedges)) #use normalized histogram as y data
x_error = np.diff(binedges)/2 #use half binwidht as x_error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
x_error = np.diff(binedges)/2 #use half binwidht as x_error
x_error = np.diff(binedges)/2 #use half binwidth as x_error

plt.show()

"""
The uncertainties on the result were already reduced slightly in comparison to the prefit. And by looking at the printed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The uncertainties on the result were already reduced slightly in comparison to the prefit. And by looking at the printed
The uncertainties on the result were already reduced slightly in comparison to the prefit. And by looking at the

@jonas-kleck
Copy link
Collaborator Author

After talking to @JohannesGaessler, I changed the example a bit. It should show more, how to recreate the Histfit and what the HistFit does internally. I also added a short part about when to use the gaussian approximation cost function.

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the example should be more clear that the "Guassian approximation" in the beginning and the "Gaussian approximation" cost function do different things: one is using the square root of the data, the other one is using the square root of the model.

The behavior that we should aim for is to use the Poisson likelihood cost function by default and to automatically switch to the Gaussian approximation if the user specifies Gaussian errors (I think I already implemented this).

---------------

In physics experiments data is frequently histogrammed in order to reduce the data to a manageable number of bins.
*kafe2* provides a dedicated fitting class for histogram fits, intendet to be used when datapoints are obtained
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*kafe2* provides a dedicated fitting class for histogram fits, intendet to be used when datapoints are obtained
*kafe2* provides a dedicated fitting class for histogram fits, intended to be used when datapoints are obtained


In physics experiments data is frequently histogrammed in order to reduce the data to a manageable number of bins.
*kafe2* provides a dedicated fitting class for histogram fits, intendet to be used when datapoints are obtained
from a random distribution. Especiallywhen large numbers of datapoints are present it is more efficient to treat
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from a random distribution. Especiallywhen large numbers of datapoints are present it is more efficient to treat
from a random distribution. Especially when large numbers of datapoints are present it is more efficient to treat

In physics experiments data is frequently histogrammed in order to reduce the data to a manageable number of bins.
*kafe2* provides a dedicated fitting class for histogram fits, intendet to be used when datapoints are obtained
from a random distribution. Especiallywhen large numbers of datapoints are present it is more efficient to treat
the data as a histogram. To perform a histogram fit, the datapoints have to be filled into a :py:obj:`HistContainer`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the data as a histogram. To perform a histogram fit, the datapoints have to be filled into a :py:obj:`HistContainer`.
the data as a histogram. To perform a histogram fit using raw data, the datapoints have to be filled into a :py:obj:`HistContainer`.

Comment on lines 412 to 413
By default it is assumed that the model function for a `HistFit` object is a probability density function normalised to 1.
As a consequence the bin contents are also being normalized to 1.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
By default it is assumed that the model function for a `HistFit` object is a probability density function normalised to 1.
As a consequence the bin contents are also being normalized to 1.
By default it is assumed that the model function for a `HistFit` object is a probability density function normalized to 1.
As a consequence the bin contents are also being normalized to 1.

I don't particularly care which spelling convention we use but we should be consistent.

HistFit class rather than the XYFit class.
While it is in principle possible to perform such fits correctly using XYFit,
more care must be taken to avoid unusable or biased results.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
more care must be taken to avoid unusable or biased results.
more care must be taken to avoid unusable or biased fit results.

Comment on lines 19 to 22
The default XYFit has some problems that have to be addressed, when histogrammed data is
processed. First of all, when using an XYFit, the data is passed to the Fit object in a
XYContainer. This container does not automatically fill the datapoints in bins. So this first step
has to be done manually. In the following, 30 datapoints, sampled from a normal distribution, are used.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just start with something like "The XYFit does not have any built-in functionality for transforming raw data into a histogram. So if we want to use it we need to do this step manually."

]
)

binedges = np.array([-2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
binedges = np.array([-2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2])
bin_edges = np.array([-2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2])

I think it would make sense to put underscores between words for better legibility.

Comment on lines +87 to +95
When performing the fit, errors like "The cost function was evaluated as infinite" appear in the output.
Furthermore, when looking at the plot of our fit result, it is clear that this fit didn't return the result we expected.
The starting values for the fit are returned as best fit value, and the uncertainties are reported as NaN.
What happened? The problem arises because Poisson uncertainties were assumed for the bin counts.
If a bin is empty, the uncertainty is treated as zero by the fit. Thus the model function is forced to pass this datapoint
exactly or the cost function will be infinite.
This occures because the XYFit uses a χ²-cost function by default, which is only valid
for Gaussian uncertainties, but not in the case of Poisson uncertainties. To fix this, the
cost function of the fit is changed to a Poisson negative log-likelhoood (NLL).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to print a warning when using a cost function that needs errors and some of the errors are also == 0.

Another subtlety is the definition of the y_data: So far, simply the midpoint of each bin was used. This is only
a linear approximation of the behaviour of the model function between the bin edges. The HistFit class of kafe2
on the other hand uses "Simpsons rule", a method to approximate the behaviour quadratically for more accuracy.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
on the other hand uses "Simpsons rule", a method to approximate the behaviour quadratically for more accuracy.
on the other hand uses "Simpson's rule", a method to approximate the behaviour quadratically for more accuracy.

on the other hand uses "Simpsons rule", a method to approximate the behaviour quadratically for more accuracy.
The most accurate albeit computationally expensive method would be to integrate the model function over each bin.
The implementation of Simpsons rule in our procedure using a XYFit will not be done here,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The implementation of Simpsons rule in our procedure using a XYFit will not be done here,
The implementation of Simpson's rule in our procedure using a XYFit will not be done here,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants