Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question About Automatic Object Prediction in XMem/XMem++ #154

Open
Lililiu116 opened this issue Feb 5, 2025 · 6 comments
Open

Question About Automatic Object Prediction in XMem/XMem++ #154

Lililiu116 opened this issue Feb 5, 2025 · 6 comments

Comments

@Lililiu116
Copy link

Dear authors,

Thank you for your great work on XMem/XMem++! I have a question regarding its capabilities.

Is XMem or XMem++ able to predict a specific type of object without manual interaction? In my research, I aim to generate object-wise segmentation for surgical tools, and I am wondering whether this process can be fully automated. Would fine-tuning the model on surgical tool data help achieve this?

I would really appreciate any insights or recommendations on this.

Thank you for your time!

Best regards,
Lily

@hkchengrex
Copy link
Owner

Hello, thanks for your interest in our project!

Unfortunately, they could not. We have a follow-up work DEVA that attempts automatic segmentation. There are also a lot of recent works from other folks that try to automatic this with language interaction, such as https://github.com/magic-research/Sa2VA

@Lililiu116
Copy link
Author

Thank you so much for your quick response! I really appreciate the information.
I will take a look at them.

Best,
Lily

@Lililiu116
Copy link
Author

Lililiu116 commented Feb 14, 2025

Hello,

Thank you again for your work on XMem! After trying other methods, I find XMem to be the most suitable for my needs.

My goal is to predict masks for a specific domain and object with the highest possible accuracy while minimizing manual interaction. I am wondering if the following approach would be feasible:

  1. Fine-tune the model with a related dataset.
  2. For the target dataset, manually label a subset of frames and set them as reference.
  3. Perform inference on the remaining frames of the target dataset while providing the reference frames, but without requiring further manual interaction.

The key question is whether it is possible to provide reference frames in this manner. Could you share any insights or advice on this approach?

Thank you for your time!

@Lililiu116 Lililiu116 reopened this Feb 14, 2025
@hkchengrex
Copy link
Owner

You can run XMem/XMem++ via propagation as long as there is at least one labeled frame per video. Does this answer your question?

@Lililiu116
Copy link
Author

Thank you for your response!
Just to clarify, can I provide reference frames beforehand and let XMem/XMem++ propagate automatically across the entire video without requiring additional manual interaction during inference?
In other words, once I set my reference frames at the beginning, does the model allow me to run propagation without needing to interact with it further?

@hkchengrex
Copy link
Owner

Right, this is the standard setup for VOS.
Of course, there might be (will be) propagation errors, but this is pretty much unavoidable with current technologies.
There are newer algorithms like Cutie or SAM2 but all of them make mistakes still.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants