Question About Automatic Object Prediction in XMem/XMem++ #154

Lililiu116 · 2025-02-05T15:54:32Z

Dear authors,

Thank you for your great work on XMem/XMem++! I have a question regarding its capabilities.

Is XMem or XMem++ able to predict a specific type of object without manual interaction? In my research, I aim to generate object-wise segmentation for surgical tools, and I am wondering whether this process can be fully automated. Would fine-tuning the model on surgical tool data help achieve this?

I would really appreciate any insights or recommendations on this.

Thank you for your time!

Best regards,
Lily

hkchengrex · 2025-02-05T16:55:04Z

Hello, thanks for your interest in our project!

Unfortunately, they could not. We have a follow-up work DEVA that attempts automatic segmentation. There are also a lot of recent works from other folks that try to automatic this with language interaction, such as https://github.com/magic-research/Sa2VA

Lililiu116 · 2025-02-06T12:03:40Z

Thank you so much for your quick response! I really appreciate the information.
I will take a look at them.

Best,
Lily

Lililiu116 · 2025-02-14T12:39:53Z

Hello,

Thank you again for your work on XMem! After trying other methods, I find XMem to be the most suitable for my needs.

My goal is to predict masks for a specific domain and object with the highest possible accuracy while minimizing manual interaction. I am wondering if the following approach would be feasible:

Fine-tune the model with a related dataset.
For the target dataset, manually label a subset of frames and set them as reference.
Perform inference on the remaining frames of the target dataset while providing the reference frames, but without requiring further manual interaction.

The key question is whether it is possible to provide reference frames in this manner. Could you share any insights or advice on this approach?

Thank you for your time!

hkchengrex · 2025-02-18T19:15:12Z

You can run XMem/XMem++ via propagation as long as there is at least one labeled frame per video. Does this answer your question?

Lililiu116 · 2025-02-19T15:19:21Z

Thank you for your response!
Just to clarify, can I provide reference frames beforehand and let XMem/XMem++ propagate automatically across the entire video without requiring additional manual interaction during inference?
In other words, once I set my reference frames at the beginning, does the model allow me to run propagation without needing to interact with it further?

hkchengrex · 2025-02-19T18:18:16Z

Right, this is the standard setup for VOS.
Of course, there might be (will be) propagation errors, but this is pretty much unavoidable with current technologies.
There are newer algorithms like Cutie or SAM2 but all of them make mistakes still.

Lililiu116 closed this as completed Feb 6, 2025

Lililiu116 reopened this Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question About Automatic Object Prediction in XMem/XMem++ #154

Question About Automatic Object Prediction in XMem/XMem++ #154

Lililiu116 commented Feb 5, 2025

hkchengrex commented Feb 5, 2025

Lililiu116 commented Feb 6, 2025

Lililiu116 commented Feb 14, 2025 •

edited

Loading

hkchengrex commented Feb 18, 2025

Lililiu116 commented Feb 19, 2025

hkchengrex commented Feb 19, 2025

Question About Automatic Object Prediction in XMem/XMem++ #154

Question About Automatic Object Prediction in XMem/XMem++ #154

Comments

Lililiu116 commented Feb 5, 2025

hkchengrex commented Feb 5, 2025

Lililiu116 commented Feb 6, 2025

Lililiu116 commented Feb 14, 2025 • edited Loading

hkchengrex commented Feb 18, 2025

Lililiu116 commented Feb 19, 2025

hkchengrex commented Feb 19, 2025

Lililiu116 commented Feb 14, 2025 •

edited

Loading