-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out-of-the-box protection from deep nested and complex queries attacks #1310
Comments
@mxstbr PTAL. |
The full text of final version of “Semantics and Complexity of GraphQL” paper can be found here: https://dl.acm.org/citation.cfm?id=3186014 |
I am one of the two authors of the paper mentioned above. I always wanted to write a less math-heavy blog post that describes the contributions of the paper (including the algorithm) informally. Unfortunately, I have not found the time yet, and at the moment I am on vacation and I will have a considerable backlog to deal with when I come back. Anyways, in the mean time you may want to see the video of my recent talk at GraphQL Europe: Towards the end of the talk I go through example that illustrates how the algorithm works. If you want to check the slides of the talk, find them at: Regarding an actual implementation of the algorithm, I recently had a student who has built a prototype for the purpose of conducting some experiments. I am still waiting for him to document the code and then we will make the implementation available as open source. However, it really is just a prototype rather than a well-engineered, production-ready library. By the way, the preprint of the paper and the final version published in the ACM DL are equivalent in terms of content. |
Unless I'm missing something, both my library, That said, I don't know that there's a good one-size-fits-all solution; it's pretty natural to want to tweak the costs, and the two libraries mentioned above have APIs that focus on sufficiently different things – mine on working well with the programmatic API, @pa-bru's on working well with the SDL. |
That's what I've also found: https://github.com/LiUGraphQL/apollo-server-core. |
Let me add a bit more context to the recent comments by @FluorescentHallucinogen First, I now have finished the blog post I promised. This post provides an informal summary of our research paper, including an example that demonstrates how our proposed algorithm works. Find the post at: Moreover, a student of mine has written a prototypical implementation of the algorithm. That's what you find at: In addition, to creating this implementation, the student has also done an initial experimental evaluation. The code for the GraphQL server used in the experiment is at: This GraphQL server uses a couple of packages that had to be extended for the experiments. These extensions can be found at |
@taion the main difference between the libraries that you mention in your comment above and our algorithm is that these libraries can only return an estimate of the query result size whereas our algorithm returns the exact size |
@hartig What about plugin for https://github.com/prisma/graphql-middleware? See https://www.prisma.io/blog/graphql-middleware-zie3iphithxy/ for more info. @maticzav Is middleware (resolvers) the right place to calculate query result size? |
Conceptually, the result-size check comes in between the validation step and the execution step. I will read the post about the middleware. |
Okay, yes, but the point of the heuristics is that it's possibly the calls to the backend that are expensive, i.e. in practice we don't want our GraphQL server to make requests to all the backend services in the first place. In your implementation, you need to actually query the database to get the result set estimation (e.g. in the If you replace that with a heuristic, you recover the same thing as the already-implemented complexity cost estimation approaches. With the current implementation, you end up essentially doubling the load to the database, and still don't protect back ends from malicious, expensive queries. |
It compute the size of a GraphQL query result (response) without generating this result. IIUC, it don't need to actually query the database. @hartig Please correct me, if I'm wrong. @hartig If it really don't need to query the database, can it calculate the size on the client, not server? (This is a theoretical question, because checking on the client does not protect the server.) @hartig How is it possible to calculate the exact size (in bytes?) without access to the actual data from the database? |
You trivially can't calculate the exact size without knowing e.g. how long a list actually is, or without knowing whether specific nodes are null or not. Those things aren't possible to determine statically because they depend on the actual data. The point of the static complexity calculation validation is to do this fully statically, from just the query and possibly the variables, without using any actual data. |
@FluorescentHallucinogen, @taion is right, our algorithm needs to access the database back-end. @taion Some comments on your sentence "With the current implementation, you end up essentially doubling the load to the database, and still don't protect back ends from malicious, expensive queries."
|
@hartig What about the time (speed) of calculation? Does the check add a latency to every request? Do we really need high accuracy (exact size)? |
Yes, the check adds latency. In the experiments with our prototype (and I should emphasize again that this prototype is a very straightforward implementation and can certainly be improved) the latency roughly doubles for "safe queries"; that is, queries whose results do not blow up exponentially. On the other hand, for non-safe queries (whose results do blow up exponentially), the added latency is negligible because producing these large query results takes significantly more time. So, the goal of our new ongoing work that I mentioned is to find a way to identify whether a given query is safe/non-safe by looking only at the schema (and the query itself of course); then, our size-calculation algorithm would have to be run only for the safe queries (which, again, are the cases in which running our algorithm doubles the latency rather than being negligible).
It depends I guess. Estimates can always be way off, and I believe that there is no study of the accuracy of the estimation-based approaches. |
@hartig I have looked at the implementation of the algorithm. As already pointed out by @taion this significantly adds load to the database backend(s). I also think that it is theoretically impossible to calculate the exact size of the query without resolving the actual data. If you don't know which values are Take the following schema for example: type Query {
listNodes(limit: Int): [MyUnion]
}
type ExpensiveType {
# Complexity 1000
complexField: String
}
type CheapType {
# Complexity 1
cheapField: String
}
union MyUnion = ExpensiveType | CheapType Then try to get the size for this query without loading the data: {
listNodes(limit: 10) {
...on ExpensiveType {
complexField
}
...on CheapType {
cheapField
}
}
} This query could have a complexity of 0, if the array has no results, it could also be 10000 if all results are of type You would have to essentially duplicate / execute the data resolving logic for the validation rule, which defeats the purpose of protecting the server. Even if you duplicate the data resolving logic or replace it with some other more performant data loading operations (like Correct me if I am wrong, but I am 99.9 % sure that it is not possible to get the exact result size without resolving the data. |
For complexity calculation heuristics you care about worst cases anyway, or realistic worst cases. Queries get exponentially more expensive if and only if there are nested lists, and as long as you make some assumption that the list returns more than 1 node, you're okay. For a well-behaved paginating client, taking the requested page size as the assumed return length is a pretty good but conservative assumption. Otherwise you likely can derive a good enough static estimate for the length of the list. The contribution of your approach is that, instead of using a heuristic to estimate the length of lists (and whatever it does for nullable values), it queries the backing data store (which in many cases will not just be a database!) – but that's exactly what we do not want. |
You are right. As I have mentioned above, our algorithm does look into the data. |
I see. |
To make this more concrete, imagine you're building a product like Facebook, and you're faced with a query like: viewer {
friends(first: 10) {
friends(first: 10) {
friends(first: 10) {
friends(first: 10) {
friends(first: 10) {
friends(first: 10) {
friends(first: 10) {
name
}
}
}
}
}
}
}
} "Database size" here could be enormous, so actually running the queries is likely quite unattractive. On the other hand, the simple heuristic of assuming that a list query for |
+1 for integrating this functionality directly into Sangria also has this builtin. While there are already solutions like graphql-cost-analysis, they are not working with Apollo Server 2 without some hacks. |
FYI, query cost/complexity validation/analysis is not enough. See the following thread for more info: ravangen/graphql-rate-limit#48. TL;DR: Cost analysis won't care if you receive one request per second or a thousand. A malicious actor could craft an operation that remains under a cost limit, but send a large number of requests. You need a fixed window rate limiting in addition to query cost/complexity validation/analysis. PTAL at https://github.com/ravangen/graphql-rate-limit and https://github.com/teamplanes/graphql-rate-limit. |
👍🏻 One should also validate that for fetching a list of nodes, there is always a limiting argument provided. The GitHub API introduces all of this. Would be great to have this all build into apollo-server as it is crucial for any public GraphQL server. |
If you're doing the cost controls in your GraphQL server (as opposed to in some proxy layer upstream), you ideally want to combine both – assign some fixed query cost budget per time window, and limit on the total cost of queries executed. The difference of course is that doing rate limiting well requires something like Redis, while complexity validation can be done purely within each instance of your GraphQL server. Additionally, there's often a good argument for doing the fixed rate limiting upstream of your GraphQL server anyway, in e.g. Nginx or whatever. But that's definitely a good point. See also 4Catalyzer/graphql-validation-complexity#6 (comment). The extension APIs available in Apollo Server v2 allow a nicer way of incorporating variable values into cost calculations. I haven't had the time to update things to take advantage of that, though. |
Hi. Thanks for creating this issue. Apollo server implements a query whitelisting, which is a method through which developers can protect against malicious queries. In the future, you can integrate above heuristics with the request pipeline, which is discussed in the Apollo Server 3.0 Roadmap. |
To calculate query complexity we need cost values. But where can we find them? Today I came across https://blog.apollographql.com/the-new-trace-view-in-apollo-engine-566b25bdfdb0 blog post and one idea come to my mind: what about automatically getting data from Apollo Managed Graph via API and use it as cost values in query cost/complexity validation/analysis npm packages? Is anyone interested in integrating Apollo Managed Graph API? I've found the API URL: https://engine-graphql.apollographql.com/api/graphql. |
sorta neat. i think you'd have to pull down the values ahead of time as part of the build process, though. you wouldn't want the validation e.g. making API/db calls to get that value. but that would be a really cool idea |
@taion Just remembered that Apollo VSCode extension shows field stats: After a little research, I've found this: |
@taion Agree, caching the stats data and periodically updating the cache would be a great idea! What about collaborating on implementing this idea into your npm package? I'll help as much as I can. 😉 |
so... how much do you care about list queries? without 4Catalyzer/graphql-validation-complexity#6, we're not going to be able to meaningfully account for query args
so to have something really good, someone is going to have to make something that (i think) works as an apollo extension, so we can observe the variable values when running the cost analysis normally i'd say that it'd be better to have a base library that can integrate into this sort of timing/cost data, but... if we need to assume apollo anyway for things to work for batched queries in apollo server, then maybe the cost tracing stats should be built in thoughts? |
cc @pa-bru – i don't mean to misrepresent what graphql-cost-analysis can do, in case there is batch support that i just happened to miss |
This should work with slicknode/graphql-query-complexity without changes to the core lib. You can just create a custom estimator that contains and processes the latencies: This also works as an apollo plugin as implemented in type-graphql: slicknode/graphql-query-complexity#7 (comment) |
how do you get the variables when running a batch in apollo? |
After additional research I've found that for development purposes:
|
A more interesting question: should we use the duration in milliseconds as cost values "as is"? Is this correct? |
It is exciting to see some more ideas shared around this topic! 😄 If Apollo Graph Manager data was to be used to inform a complexity analysis mechanism, it is probably reasonably (at least initially) to have a solution that targets Apollo Server with setup documentation. One notable thing with >= Apollo Server 2.4 is its use of an in-memory store that caches (LRU) successfully parsed and validated documents for future requests (#2111). See the warning at pa-bru/graphql-cost-analysis#32. In withspectrum/spectrum#2874 (comment), Lee Byron notes that the rules tend to apply in both server and client (tooling) scenarios. He suggests not implementing rate limiting as a validation rule, but instead as it’s own step. This echo's the graphql.org validation docs introduction of validation focusing on the type system and conformity to the specification, not runtime validation checks (such as min/max values). I would recommend exploring the usage of Apollo Server Plugins instead of a validation rule. I am open to collaborating as I am not happy with how I currently apply |
I mean the values from Apollo Graph Manager may be less than 1 (e.g. 0.01926215781919789 ms). Should we normalize these values somehow? 🤔 |
I see a few problems with the approach of using the latencies directly:
I'd rather use the latencies to make an informed decision about how to implement the maximum query complexity, which I would do on a worst-case basis to have a stable query validation. |
@FluorescentHallucinogen If the timings are in e.g. fractional seconds, then it seems reasonable to impose a maximum cost in terms of e.g. total amount of time a query is allowed to take, rather than in arbitrary cost units. @ivome I think the idea is to use some suitable long-running average to set per-field costs more intelligently, though. It's a nice way to, say, measure which objects require extra REST calls or whatever. |
IMO, Apollo Server should have a protection from deep nested and complex queries attacks by default (out-of-the-box), especially since many developers might not be aware of these concerns.
The algorithm described in “Semantics and Complexity of GraphQL” paper by @hartig and @jorgeperezrojas (see preprint at: http://olafhartig.de/files/HartigPerez_WWW2018_Preprint.pdf) looks very promising. It compute the size of a GraphQL query result (response) without generating this result. The good thing: this algorithm is polynomial in the size of the data and the query/request.
See also:
The text was updated successfully, but these errors were encountered: