-
Notifications
You must be signed in to change notification settings - Fork 69
Dynamical Factor Models (DFM) Implementation (GSOC 2025) #446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Looks interesting! Just say when you think it's ready for review |
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Thanks for the feedback! I'm still exploring the best approach for implementing Dynamic Factor Models. |
21560db
to
a459a1a
Compare
Some tests are failing due to missing constants. You might have lost some changes in the reset/rebasing process |
1c04f65
to
bc3fcf2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments. I didn't look over the tests because they still seem like WIP, but seem to be on the right track!
7846f15
to
e15cdd3
Compare
e15cdd3
to
3b8bfe4
Compare
6496f38
to
615960b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a deeper pass on everything except the build_symbolic_graph
method. I need to spend more time on that because it's gotten quite complex.
I'll finish ASAP.
In the notebook a comparison between the custom DFM and the implemented DFM (which has an hardcoded version of make_symbolic_graph, that work just in this case)
…pymc_extras/statespace/models/structural/components/regression.py
8a03197
to
ca1a86e
Compare
pymc_extras/statespace/models/DFM.py
Outdated
pt.zeros((self.k_endog, self.k_endog * (self.error_order - 1)), dtype=floatX) | ||
) | ||
if len(matrix_parts) == 1: | ||
design_matrix = factor_loadings * 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this line do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this line is quite messy. I was running into a parameter-naming issue in the test. When error_order=0 and factor_order=0, the design matrix consists only of the first matrix block, which is "factor_loadings". For some reason, during model construction, it expects a parameter named "design" instead of "factor_loadings". This, of course, leads to an error.
Do you have any suggestions? I forgot to mention this issue when I first encountered it.
self.ssm["state_cov", :, :] = factor_cov | ||
|
||
# Observation covariance matrix (H) | ||
if self.error_order > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems backwards? The first case looks like it assumes error_order == 0.
This comment is based on error_sigma
appearing in the else branch
sigma_obs = self.make_and_register_variable( | ||
"sigma_obs", shape=(self.k_endog,), dtype=floatX | ||
) | ||
total_obs_var = error_sigma**2 + sigma_obs**2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure error_sigma
should appear here? It's already appearing in Q, so is this double-counting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your last 2 comments I think are related to this doubt: how should we handle the error_sigma/error_cov when the error_order=0.
In our implementation—which replicates what is done in Stats—when error_order = 0 there is no error term that is included in the state vector. This means I cannot add the error_sigma term to the state_cov, since its shape does not account for error terms. My first thought was to add it to the observation equation instead, but I don’t think that’s the right approach.
Conceptually, when error_order = 0, we should interpret it as having only the standard innovations (i.e., normal errors) on each endogenous variable, without introducing an explicit term in the state.
I have seen that in Stats, when error_order = 0, no additional states or innovations are introduced for the error term.
# Calculate the number of states
k_states = self._factor_order
k_posdef = self.k_factors
if self.error_order > 0:
k_states += self._error_order
k_posdef += k_endog
* First pass on exogenous variables in VARMA * Adjust state names for API consistency * Allow exogenous variables in BayesianVARMAX * Eagerly simplify model where possible * Typo fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final tiny comments. This looks great!
design_matrix = pt.concatenate([design_matrix_time, Z_exog], axis=2) | ||
|
||
self.ssm["design"] = design_matrix | ||
|
||
# Transition matrix | ||
# auxiliary function to build transition matrix block | ||
# Transition matrix (T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make a little ascii diagram of how A B C fit together, or just write T = BlockDiag(A, B, C)
before you introduce the names block A block B. Reading it I felt like "hey wait did I miss something"
View / edit / reply to this conversation on ReviewNB jessegrabowski commented on 2025-08-30T09:48:40Z Ipman was also a pretty good movie, but the sequels sucked. |
View / edit / reply to this conversation on ReviewNB jessegrabowski commented on 2025-08-30T09:48:41Z Is this why you choose factor_order = 2 later? It would be nice to make a more obvious bridge between this section and the final statistical model. You don't actually model the data that this analysis is based on (you do the log differences ultimately), so it's a bit of a loose connection. andreacate commented on 2025-08-30T09:55:06Z Yes sure, you are right. No I have not made decision about parameters, since I wanted just to replicate what was done in Stats. Cointegration was not present in Stats, I was just curious at the beginning. I can just delete the cells about cointegration |
View / edit / reply to this conversation on ReviewNB jessegrabowski commented on 2025-08-30T09:48:42Z You can consider cutting this section IMO, and just say "Looking at the graphs, these time series are obviously non-stationary"
|
View / edit / reply to this conversation on ReviewNB jessegrabowski commented on 2025-08-30T09:48:42Z Here's a fancier ADF test function that I use if you want. It matches the output of STATA, and gives all 3 variants of the ADF test:
def ADF_test_summary(df, maxlag=None, autolag='BIC', missing='error'):
if missing == 'error':
if df.isna().any().any():
raise ValueError("df has missing data; handle it or pass missing='drop' to automatically drop it.")
if isinstance(df, pd.Series):
df = df.to_frame()
for series in df.columns:
data = df[series].copy()
if missing == 'drop':
data.dropna(inplace=True)
print(series.center(110))
print(('=' * 110))
line = 'Specification' + ' ' * 15 + 'Coeff' + ' ' * 10 + 'Statistic' + ' ' * 5 + 'P-value' + ' ' * 6 + 'Lags' + ' ' * 6 + '1%'
line += ' ' * 10 + '5%' + ' ' * 8 + '10%'
print(line)
print(('-' * 110))
spec_fixed = False
for i, (name, reg) in enumerate(zip(['Constant and Trend', 'Constant Only', 'No Constant'], ['ct', 'c', 'n'])):
stat, p, crit, regresult = sm.tsa.adfuller(data, regression=reg, regresults=True, maxlag=maxlag,
autolag=autolag)
n_lag = regresult.usedlag
gamma = regresult.resols.params[0]
names = make_var_names(series, n_lag, reg)
reg_coefs = pd.Series(regresult.resols.params, index=names)
reg_tstat = pd.Series(regresult.resols.tvalues, index=names)
reg_pvals = pd.Series(regresult.resols.pvalues, index=names)
line = f'{name:<21}{gamma:13.3f}{stat:15.3f}{p:13.3f}{n_lag:11}{crit["1%"]:10.3f}{crit["5%"]:12.3f}{crit["10%"]:11.3f}'
print(line)
for coef in reg_coefs.index:
if coef in name:
line = f"\t{coef:<13}{reg_coefs[coef]:13.3f}{reg_tstat[coef]:15.3f}{reg_pvals[coef]:13.3f}"
print(line)
|
View / edit / reply to this conversation on ReviewNB jessegrabowski commented on 2025-08-30T09:48:43Z Plot the transformed data and comment before you run the stationarity test |
View / edit / reply to this conversation on ReviewNB jessegrabowski commented on 2025-08-30T09:48:43Z They're not that simple, because you constrained the sign. You should comment on the prior choices here, in addition to in the comments. |
View / edit / reply to this conversation on ReviewNB jessegrabowski commented on 2025-08-30T09:48:45Z Add some commentary on what is being shown here |
View / edit / reply to this conversation on ReviewNB jessegrabowski commented on 2025-08-30T09:48:45Z Show this before sampling |
View / edit / reply to this conversation on ReviewNB jessegrabowski commented on 2025-08-30T09:48:46Z Use
|
View / edit / reply to this conversation on ReviewNB jessegrabowski commented on 2025-08-30T09:48:46Z The legend is wrong -- the gray are recessions, not HDI. Is the HDI plotted here, but it's just really tight? If so, comment on this. |
View / edit / reply to this conversation on ReviewNB jessegrabowski commented on 2025-08-30T09:48:47Z Commentary? What is state 0 (consider renaming the title to be clear, like Estimated Latent Factor 1)? Add recessions? |
View / edit / reply to this conversation on ReviewNB jessegrabowski commented on 2025-08-30T09:48:48Z typo: Statsmodels |
The notebook also looks great! Could you add some more headings, and motivate all the analysis that you're doing with some commentary? Make sure the piece connect together clearly. I'd move the Bayesian latent factor graph above the Coincident index. I would also suggest you comment on the fact that the optimizer doesn't converge in the statsmodels model. It ends up being "close enough", but make some hay out of the fact that MCMC doesn't "converge", it explores all equiprobable solutions, which in this case is important because the model is only weakly identified. |
Yes sure, you are right. No I have not made decision about parameters, since I wanted just to replicate what was done in Stats. Cointegration was not present in Stats, I was just curious at the beginning. I can just delete the cells about cointegration View entire conversation on ReviewNB |
Dynamical Factor Models (DFM) Implementation
This PR provides a first draft implementation of Dynamical Factor Models as part of my application proposal for the PyMC GSoC 2025 project. A draft of my application report can be found at this link.
Overview
DFM.py
with initial functionalityCurrent Status
This implementation is a work in progress and I welcome any feedback
Next Steps