-
Notifications
You must be signed in to change notification settings - Fork 14
Histfit doc #258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Histfit doc #258
Conversation
|
Thank you for addressing this important issue !
In addition to the problems of an xy-fit, which you mention correctly,
there is another one:
- tanking the uncertainty as sqrt(n_i_observed) in a bin leads to a
bias, because downward-
fluctuations are assigned smaller uncertainties than
upward-fluctuations. As a remedy,
the uncertainties may be taken from the prediction of a pre-fit, if one
insists to use
an xy-fit. Using the Poisson-Likelihood of histfit also fixes this issue.
You man want to address and illustrate this point as well ?
Am 18.09.25 um 11:45 schrieb jonas-kleck:
…
Added documentation for Histogram Fits. Added an example that shows,
why we can't fit a histogram using XYFit so easily, so why to use
HistFit for histogrammed data, since @GuenterQuast
<https://github.com/GuenterQuast> suggested that histogram fits are
often problematic for students. I am a bit unsure about the example,
should I include more/other problems?
------------------------------------------------------------------------
You can view, comment on, or merge this pull request online at:
#258
Commit Summary
* 29d82ea
<29d82ea>
added documentation for histogram fits, added an example showing
why to use histogram fits
* 3dcaf2c
<3dcaf2c>
edited Histogram Fit section
File Changes
(2 files <https://github.com/PhiLFitters/kafe2/pull/258/files>)
* *M* doc/src/parts/user_guide.rst
<https://github.com/PhiLFitters/kafe2/pull/258/files#diff-3b835a49964ee8e16de7da16729e8034c6cec33ab20ea79760471209d6ce85f2>
(23)
* *A* examples/009_histogram_fit/04_pitfalls.py
<https://github.com/PhiLFitters/kafe2/pull/258/files#diff-7f410467fb771a66e6ab72917a1ea5076ea871ab4605d1397c7c6df1c912c715>
(115)
Patch Links:
* https://github.com/PhiLFitters/kafe2/pull/258.patch
* https://github.com/PhiLFitters/kafe2/pull/258.diff
—
Reply to this email directly, view it on GitHub
<#258>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEIAGSNMHIWKX4EGO4TPD433TJ5MNAVCNFSM6AAAAACG3C4DJCVHI2DSMVQWIX3LMV43ASLTON2WKOZTGQZDSNJUGY3TINY>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
JohannesGaessler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to demonstrate the prefit technique but in the context of kafe2 I think we should recommend the use of either a Poisson likelihood or its Gaussian approximation (depending on whether additional y errors are needed).
doc/src/parts/user_guide.rst
Outdated
| Histogram Fits | ||
| --------------- | ||
|
|
||
| A very common fit type is the histogram fit. *kafe2* provides a dedicated fitting class for histogram fits, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| A very common fit type is the histogram fit. *kafe2* provides a dedicated fitting class for histogram fits, | |
| In physics experiments data is frequently histogrammed in order to reduce the data to a manageable number of bins. | |
| *kafe2* provides a dedicated fitting class for histogram fits, |
doc/src/parts/user_guide.rst
Outdated
| Depending on whether the modelfunction is already normalised, or has a normalisation constant, | ||
| that is also supposed to be estimated in the fit the `density` keyword can be used, during creation | ||
| of the `HistFit` object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Depending on whether the modelfunction is already normalised, or has a normalisation constant, | |
| that is also supposed to be estimated in the fit the `density` keyword can be used, during creation | |
| of the `HistFit` object. | |
| By default it is assumed that the model function for a `HistFit` object is a probability density function normalised to 1. | |
| As a consequence the bin contents are also being normalized to 1. | |
| To disable this behavior, set `density=False` in the `HistFit` constructor. |
I think this explains the behavior more explicitly. We can still mention the fitting of normalization constants, though you could in principle also have a model function with a constant normalization that you're not fitting.
| This example demonstrates why it is more convenient to use the HistFit class | ||
| instead of XYFit when dealing with histogrammed data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| This example demonstrates why it is more convenient to use the HistFit class | |
| instead of XYFit when dealing with histogrammed data. | |
| This example demonstrates a scenario in which it is more convenient to use the | |
| HistFit class rather than the XYFit class. |
| While it is in principle possible to perform such fits correctly using XYFit, | ||
| it requires much more care. This example shows common mistakes that can occur | ||
| when that necessary care is not taken and how this makes the fit results worse. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| While it is in principle possible to perform such fits correctly using XYFit, | |
| it requires much more care. This example shows common mistakes that can occur | |
| when that necessary care is not taken and how this makes the fit results worse. | |
| While it is in principle possible to perform such fits correctly using XYFit, | |
| more care must be taken to avoid unusable or biased results. | |
| This example shows common problems that can occur when using XYFit and how to fix them. | |
| HistFit handles these problems automatically. |
| x_data = binmids = np.mean([binedges[:-1], binedges[1:]], axis=0) #use binmids as x values | ||
| y_data = bincounts/(np.sum(bincounts)*np.diff(binedges)) | ||
| print(y_data) #use normalized histogram as y data | ||
| x_error = 0.25 #use half binwidht as x_error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| x_error = 0.25 #use half binwidht as x_error | |
| x_error = 0.25 #use half binwidth as x_error |
cverstege
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some quick comments. Also, formatting is missing. You can run the black formatter to automatically fix the issues.
| We will especially look at two scenarios that can arise from having only small amounts of data. | ||
| This is, for example, a problem in the search for new rare processes in high-energy physics. | ||
| Assume we are looking for a Gaussian signal peak, free from background for simplicity | ||
| (the treatment of a signal over background is explained in example 03_SpluBfit.py). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| (the treatment of a signal over background is explained in example 03_SpluBfit.py). | |
| (the treatment of a signal over background is explained in example 03_SplusBfit.py). |
| x_data = binmids = np.mean([binedges[:-1], binedges[1:]], axis=0) #use binmids as x values | ||
| y_data = bincounts/(np.sum(bincounts)*np.diff(binedges)) | ||
| print(y_data) #use normalized histogram as y data | ||
| x_error = 0.25 #use half binwidht as x_error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| x_error = 0.25 #use half binwidht as x_error | |
| x_error = 0.25 #use half binwidth as x_error |
| binedges = np.array([-2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2]) | ||
|
|
||
| #Now naively perform a default XYFit | ||
| x_data = binmids = np.mean([binedges[:-1], binedges[1:]], axis=0) #use binmids as x values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather call it bincenters than binmids
|
|
||
|
|
||
| """ | ||
| This first fit should give you warnings about the cost function being evaluated as infinite, which comes from the empty bin. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| This first fit should give you warnings about the cost function being evaluated as infinite, which comes from the empty bin. | |
| This first fit raises warnings about the cost function being evaluated as infinite, which originates from the empty bin. |
|
|
||
| """ | ||
| This first fit should give you warnings about the cost function being evaluated as infinite, which comes from the empty bin. | ||
| Also, the result of the fit should not return any good values. This is the result of the empty bin. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Also, the result of the fit should not return any good values. This is the result of the empty bin. | |
| Also, the result of the fit does not return any good values. This is the result of the empty bin. |
| This first fit should give you warnings about the cost function being evaluated as infinite, which comes from the empty bin. | ||
| Also, the result of the fit should not return any good values. This is the result of the empty bin. | ||
| It can be seen in the following that even if we combine bins in order to get rid of the empty bin, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| It can be seen in the following that even if we combine bins in order to get rid of the empty bin, | |
| It can be seen in the following that even if the bins are combined in order to get rid of the empty bin, |
| binedges = np.array([-2, -1, -0.5, 0, 0.5, 1, 2]) | ||
|
|
||
| #Now again perform a default XYFit | ||
| x_data = binmids = np.mean([binedges[:-1], binedges[1:]], axis=0) #use binmids as x values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bincenters instead of binmids
| #Now again perform a default XYFit | ||
| x_data = binmids = np.mean([binedges[:-1], binedges[1:]], axis=0) #use binmids as x values | ||
| y_data = bincounts/(np.sum(bincounts)*np.diff(binedges)) #use normalized histogram as y data | ||
| x_error = np.diff(binedges)/2 #use half binwidht as x_error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| x_error = np.diff(binedges)/2 #use half binwidht as x_error | |
| x_error = np.diff(binedges)/2 #use half binwidth as x_error |
| plt.show() | ||
|
|
||
| """ | ||
| The uncertainties on the result were already reduced slightly in comparison to the prefit. And by looking at the printed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The uncertainties on the result were already reduced slightly in comparison to the prefit. And by looking at the printed | |
| The uncertainties on the result were already reduced slightly in comparison to the prefit. And by looking at the |
… to accept also non integer binedges
|
After talking to @JohannesGaessler, I changed the example a bit. It should show more, how to recreate the Histfit and what the HistFit does internally. I also added a short part about when to use the gaussian approximation cost function. |
JohannesGaessler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the example should be more clear that the "Guassian approximation" in the beginning and the "Gaussian approximation" cost function do different things: one is using the square root of the data, the other one is using the square root of the model.
The behavior that we should aim for is to use the Poisson likelihood cost function by default and to automatically switch to the Gaussian approximation if the user specifies Gaussian errors (I think I already implemented this).
doc/src/parts/user_guide.rst
Outdated
| --------------- | ||
|
|
||
| In physics experiments data is frequently histogrammed in order to reduce the data to a manageable number of bins. | ||
| *kafe2* provides a dedicated fitting class for histogram fits, intendet to be used when datapoints are obtained |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| *kafe2* provides a dedicated fitting class for histogram fits, intendet to be used when datapoints are obtained | |
| *kafe2* provides a dedicated fitting class for histogram fits, intended to be used when datapoints are obtained |
doc/src/parts/user_guide.rst
Outdated
|
|
||
| In physics experiments data is frequently histogrammed in order to reduce the data to a manageable number of bins. | ||
| *kafe2* provides a dedicated fitting class for histogram fits, intendet to be used when datapoints are obtained | ||
| from a random distribution. Especiallywhen large numbers of datapoints are present it is more efficient to treat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| from a random distribution. Especiallywhen large numbers of datapoints are present it is more efficient to treat | |
| from a random distribution. Especially when large numbers of datapoints are present it is more efficient to treat |
doc/src/parts/user_guide.rst
Outdated
| In physics experiments data is frequently histogrammed in order to reduce the data to a manageable number of bins. | ||
| *kafe2* provides a dedicated fitting class for histogram fits, intendet to be used when datapoints are obtained | ||
| from a random distribution. Especiallywhen large numbers of datapoints are present it is more efficient to treat | ||
| the data as a histogram. To perform a histogram fit, the datapoints have to be filled into a :py:obj:`HistContainer`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| the data as a histogram. To perform a histogram fit, the datapoints have to be filled into a :py:obj:`HistContainer`. | |
| the data as a histogram. To perform a histogram fit using raw data, the datapoints have to be filled into a :py:obj:`HistContainer`. |
doc/src/parts/user_guide.rst
Outdated
| By default it is assumed that the model function for a `HistFit` object is a probability density function normalised to 1. | ||
| As a consequence the bin contents are also being normalized to 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| By default it is assumed that the model function for a `HistFit` object is a probability density function normalised to 1. | |
| As a consequence the bin contents are also being normalized to 1. | |
| By default it is assumed that the model function for a `HistFit` object is a probability density function normalized to 1. | |
| As a consequence the bin contents are also being normalized to 1. |
I don't particularly care which spelling convention we use but we should be consistent.
| HistFit class rather than the XYFit class. | ||
| While it is in principle possible to perform such fits correctly using XYFit, | ||
| more care must be taken to avoid unusable or biased results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| more care must be taken to avoid unusable or biased results. | |
| more care must be taken to avoid unusable or biased fit results. |
| The default XYFit has some problems that have to be addressed, when histogrammed data is | ||
| processed. First of all, when using an XYFit, the data is passed to the Fit object in a | ||
| XYContainer. This container does not automatically fill the datapoints in bins. So this first step | ||
| has to be done manually. In the following, 30 datapoints, sampled from a normal distribution, are used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just start with something like "The XYFit does not have any built-in functionality for transforming raw data into a histogram. So if we want to use it we need to do this step manually."
| ] | ||
| ) | ||
|
|
||
| binedges = np.array([-2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| binedges = np.array([-2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2]) | |
| bin_edges = np.array([-2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2]) |
I think it would make sense to put underscores between words for better legibility.
| When performing the fit, errors like "The cost function was evaluated as infinite" appear in the output. | ||
| Furthermore, when looking at the plot of our fit result, it is clear that this fit didn't return the result we expected. | ||
| The starting values for the fit are returned as best fit value, and the uncertainties are reported as NaN. | ||
| What happened? The problem arises because Poisson uncertainties were assumed for the bin counts. | ||
| If a bin is empty, the uncertainty is treated as zero by the fit. Thus the model function is forced to pass this datapoint | ||
| exactly or the cost function will be infinite. | ||
| This occures because the XYFit uses a χ²-cost function by default, which is only valid | ||
| for Gaussian uncertainties, but not in the case of Poisson uncertainties. To fix this, the | ||
| cost function of the fit is changed to a Poisson negative log-likelhoood (NLL). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make sense to print a warning when using a cost function that needs errors and some of the errors are also == 0.
| Another subtlety is the definition of the y_data: So far, simply the midpoint of each bin was used. This is only | ||
| a linear approximation of the behaviour of the model function between the bin edges. The HistFit class of kafe2 | ||
| on the other hand uses "Simpsons rule", a method to approximate the behaviour quadratically for more accuracy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| on the other hand uses "Simpsons rule", a method to approximate the behaviour quadratically for more accuracy. | |
| on the other hand uses "Simpson's rule", a method to approximate the behaviour quadratically for more accuracy. |
| on the other hand uses "Simpsons rule", a method to approximate the behaviour quadratically for more accuracy. | ||
| The most accurate albeit computationally expensive method would be to integrate the model function over each bin. | ||
| The implementation of Simpsons rule in our procedure using a XYFit will not be done here, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The implementation of Simpsons rule in our procedure using a XYFit will not be done here, | |
| The implementation of Simpson's rule in our procedure using a XYFit will not be done here, |
Added documentation for Histogram Fits. Added an example that shows, why we can't fit a histogram using XYFit so easily, so why to use HistFit for histogrammed data, since @GuenterQuast suggested that histogram fits are often problematic for students. I am a bit unsure about the example, should I include more/other problems?