-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Choices of Implementation wrt your Original Paper #7
Comments
Hi Olivier, Thanks for looking through the implementation. You are spot-on about the differences between the code and the algorithm in the paper. The differences are mainly optimizations, which we deemed too tedious to explain in a 6-page workshop paper. Let me address each of the points you brought up.
If you look closely, the workings are not very different from what is described in the paper:
Each context-local worklist uses reverse post-order (i.e. topological sort) and the worklist of contexts is a set of ordered items, from which the newest item is removed for processing. This makes the two worklist implementations equivalent.
The
You are absolutely right about this one. We did not use this flag in the paper at all. However, this is an optimization and not crucial to the correctness of the algorithm. Consider a mutually recursive program with call chain
For convergence of data-flow analysis, the OUT value at each node should only change in one direction (i.e. it can become weaker -- change towards BOTTOM). This monotonicity is crucial to proving convergence, and is described in the book you referenced. If the flow functions are themselves monotonic, this operation doesn't change anything because However, statically proving that flow functions implemented in Java are monotonic is infeasible. By forcing the meet with Hope this helps. Let me know if something was not clear or if something that I said seems incorrect. Rohan |
Hi Rohan, thanks for your swift, detailed answer! I definitely did not notice the description of the worklist removal order in the example description. I understand now this is an implementation of the "interprocedurally eager" constraint. The "forced monotonicity" meet is an interesting trade off, considering dataflow problems are arbitrarily specified in Java. I suppose this would not have been needed should you have provided a closed set of provably-monotonic problems, but then this would not be a general framework anymore... Out of curiosity, did you evaluate the cost of those extra meets when using an actually monotonic Java specification (which as you describe should be the general case)? My understanding of the " I suppose an "on demand" (instead of "newest") evaluation criteria for contexts would not cut it also, due to the recursive nature of the example you are providing. By this I mean performing an analysis of I have an extra question with respect to call strings. The "Efficiency, Precision, Simplicity, and Generality..." paper and the "Data Flow Analysis" book go to great lengths to describe criterias for "representing" (factorizing at entries) and "regenerating" (expanding at exits) related call strings, by taking advantage of value-based equivalence classes. Even if the call strings number is massively reduced due to this approach, it seems like they are still kept as actual strings (ie, explicit lists of call nodes). Interestingly, no such explicit lists seem present in your paper and implementation. Instead, you are tabulating Olivier. |
Hi Olivier,
No, I have not done that. It would be interesting to see what the performance impact of that would be. I should point out that there is a subtle case that actually requires the self-meet on return values. The problem arises when the value context just before a function invocation changes (perhaps the invocation was inside a loop). The new context could have a stronger value at its exit (TOP by default, or something else even if some path to its exit was analyzed) than the exit value of the old context. I'm not sure I can come up with a detailed example off the top of my head, but the general pattern is something like:
To summarize, there is likely an error on line 32 in Figure 1 of the VASCO paper. The correct statement should be This meet can be avoided by creating new contexts using a clone operation instead of fresh construction whenever the calling context at a given invocation node changes (i.e. becomes weaker). The clone operation would initialize all nodes in a value context's control-flow graph with data-flow values from its source context... Thereby preserving monotonicity throughout.
Possibly. The resulting effect on the order of pulling nodes out of the worklist appears to be same as the current implementation of VASCO. However, you would now need client analyses to be aware of these two separate values in their definition of the data-flow value domain (generic type
You are absolutely right. The value-contexts formulation is an implementation of the functional approach, which is a dual of the call strings approach, as outlined by Sharir and Pnueli in the classic (aptly-named) 1978 article Two approaches to interprocedural analaysis. However, the reason I prefer the value-contexts approach is that it gives you the best of both worlds. The context-transition graph constructed at the end of the analysis is a data structure that allows querying for the context for any call string, which can be arbitrarily long. Think of the graph as a finite state machine where each node represents a regular expression of all call strings for which it corresponds. I believe this is more powerful (in terms of querying power) than the extended call strings approach (Khedker & Karkare, 2008). You don't necessarily have to do a meet over all interprocedurally valid paths -- you can keep the contexts and their analysis results separate until you really need them. For example, if you were to use the results of pointer analysis in a subsequent analysis, you don't have to ask "do This ability to query the context for any call string can also be powerful if you were to use the results of data-flow analysis at run-time. For example, if you were to pause a program in a debugger and query its call stack, you would know exactly what value context that stack frame belongs to because you could walk the context-transition graph starting from |
Rohan, thanks for having taken the time to detail those points for me.
Now that I have represented basic problems in Vasco (#8), I can definitely appreciate the simplicity of the framework, and the fact that optimizations are not a concern when implementing solvers.
Does this mean that the Vasco algorithm is equivalent to the "full" call strings approach, and, as a consequence, is able to solve the non-separable problems that "restricted contexts" approaches cannot handle (as discussed on page 301 of "Data Flow Analysis")? I am asking this because superficially, the "value contexts" approach feels like "restricting the length of calling context [sigma] to 1" (Data Flow analysis, p 296), but I assume the
Frankly, those results feel like a significant improvement over previous functional and call string approaches (especially from an usability/ease of implementation point of view), and would deserve more exposure. I don't mean to pry, but is there any chance you may publish more about those? Or are you considering other approaches/problems now? Olivier. |
Thanks.
I don't have the book with me at the moment to check those references, but I can tell you that the value contexts approach is definitely not the 1-calling-context approach (also known as 1-CFA in functional programming), which is much simpler and far more imprecise. The value contexts approach is indeed fully context-sensitive, equivalent to a hypothetical ∞-CFA or unbounded call strings in the classical case or full call strings when using value-based termination. It is designed to work with non-separable problems: If your data-flow problems have distributive flow functions, you are much better off using an implementation of the IFDS framework, such as Heros, since it terminates in asymptotically fewer steps.
You are right in that the value-context formulation makes the functional approach more usable, but that is subjective and more of an analysis implementor's concern than a theoretical contribution; therefore, I do not have plans to publish more than the material that is already out there (i.e. the SOAP paper and this GitHub repo). In my opinion, the main limitation with using either the value-context approach or the functional approach or the full call-strings approach is that in the worst case, the run-time is proportional to the size of the data-flow lattice, which is often exponential in the size of the program (or its variables). It also requires the whole program to be analyzed at once, which is sometimes infeasible when using dynamic linking or reflection or partial JIT-compilation. This type of framework is suitable for performing heavy-weight offline analysis (e.g. where you don't mind spending several hours on analyzing important security properties) for small, self-contained programs (e.g. Android apps or WebAssembly modules). |
Nice, thanks for the confirmation.
Oh! Sadly I tried Heros before, and I am investigating other interprocedural analysis approaches since it did not scale well on the kind of problems I tried to solve. I'll see how it goes... Regards, |
Hi Rohan, I too have a doubt in the case when we do not have the results for a method since it hasn't been analyzed completely yet (i.e., we are in an SCC involving the method, While analyzing I am not sure, but can this create a problem (as such an analysis might make unsound assumptions for the successors)? Or is it the case that we expect that Regards, |
Hi Manas, You were right when you said:
A cycle in the context transition graph is of course quite unfortunate from a performance perspective, but there is really nothing we can do other than just start with an optimistic assumption in one of the nodes and propagate information back and forth, weakening the data-flow values, until a fixed point is reached. To see why why The only case where this might appear to be unsound is something like the following: function f(x) {
g(x+1);
// everything is TOP here
}
function g(y) {
f(y-1);
// everything is TOP here
} Here, the call chain on a call to Hope this makes sense. |
Thanks a lot for the clear explanation, Rohan! This really helps. |
While studying your "Interprocedural Data Flow Analysis in Soot using Value Contexts" paper, I took the liberty to compare the algorithm described in Figure 1 with the Java implementation your provided in this repository.
Most differences are Java implementation details, extensions (such as the multiple entry/exit points, hinted in the original paper) or classically implied behaviors (such as the topological sort to ensure fast convergence). Two of them stroke me as more fundamental, though.
The first one is the choice of associating a worklist to each context, while retaining the global contexts worklist. This significantly impacts the algorithm by introducing a
null
guard object and ananalyzed
flag on contexts. This flag, as I understand it, alters the criteria on line 28 of the original algorithm by excludingX'
contexts whose worklist has not already been processed.Is this change merely for implementation convenience, or does it have a deeper meaning? It seems to relate to the "intraprocedurally eager" criteria of the "Efficiency, Precision, Simplicity and Generality in Interprocedural Data Flow Analysis" paper.
Could you shed some light on this?
A second, minor difference is the lines of code stating "Merge with previous OUT to force monotonicity". I do not know this result (at least I did not notice it in the "Data Flow Analysis - Theory and Practice" book).
Do you have any pointer to a paper describing this technique?
Regards,
Olivier.
The text was updated successfully, but these errors were encountered: