-
Notifications
You must be signed in to change notification settings - Fork 166
memory leak #32
Comments
cool, I was doing this with npm data set, which is only 30k, and at first it didn't work then I fixed it so that it did. https://github.com/dominictarr/JSONStream/blob/master/index.js#L76 made it work, but to be honest - I was only guessing. |
I'm currently running 0.6.4 which includes that line, but I'm still having serious memory leak issues. I did run an interesting experiment though. I think the biggest culprit is the underlying jsonparse library. When running the same data set against a custom streamer plus the parser, I observed similar memory leak behavior: request = require('request'),
Stream = require('stream').Stream,
Parser = require('jsonparse');
var p = new Parser();
p.onValue = function(value) {
console.log(value)
}
var down = new Stream()
down.writable = true
down.write = function(data) {
p.write(data)
return true
}
down.end = function() {
console.log('end')
}
var host = process.env.DB_HOST
var path = '_all_docs?include_docs=true'
var url = host + '/' + dbName + '/' + path
request(url).pipe(down) Memory leak steadily increases from 50 to 260MB's, then midway through it jumps to 500-600MB's. I feel like the json parser queues up data for something then does something else with the queued data afterwards. I'm going to post a ticket with that project, but do you have any ideas why this might be happening? |
well, that is by design actually! it's json-parse is trying to collect the whole object! we usually want to throw out the other stuff... |
Do you know why it was implemented this way? Bountysource.com looks really awesome! Browsing it now.. :) |
hmm, I guess that just seemed like the right idea at the time... |
BTW, I'm looking at the Sax parser gist referenced on the jsonparse project README. From my initial testing, it has much better performance characteristics. Do you think it would be possible to port JSONStream to using the sax parser instead? |
hmm, you'd just have to match the paths... noting when you go into or come out of an object, and whether that is into a path your are interested in... (and only collecting a whole object when it matches the end of your path!) seems reasonable - If that had been exposed at the time I probably would have done it like that... Will be happy to merge if you where to undertake this rewrite! |
Looking into this right now. Will keep you posted! |
BTW, do you think it would be useful to have another fixture that has maybe 20k objects or more to more easily expose big data memory leaks? |
absolutely, but it's best to generate it, rather than check it in though. this script generates a large test object: https://github.com/dominictarr/JSONStream/blob/master/test/gen.js |
I just updated all the dependencies and devDependencies in package.json and ran the gen.js test at 5 million and it passed without running out of memory. |
@nickpoorman would you mind making a pull request with the updated dependencies? thanks |
GOOD NEWS. I have a fix to the memory leak on the memory-leak branch. To test, curl 'http://isaacs.iriscouch.com/registry/_all_docs?include_docs=true' > npm.json then, if you pipe that to JSONStream, and filter out a deep path (this part is important) cd JSONStream
./index.js rows.*.doc.versions.*._npmUser < npm.json and then use The only problem is that the tests for @PaulMougel 's #33 |
Is this still an issue with JSONStream? |
there is still a little bit of work to do here, as described in my last comment. |
Sorry about that, let me look into it. |
It's been a long time since I looked at this chunk of code; I don't quite remember why This dumb diff diff --git a/index.js b/index.js
index 6ecd87b..9d1bd5c 100755
--- a/index.js
+++ b/index.js
@@ -68,7 +68,8 @@ exports.parse = function (path, map) {
if (!c) return
if (check(nextKey, c.key)) {
i++;
- this.stack[j].value = null
+ if (j < this.stack.length)
+ this.stack[j].value = null
break
}
j++ makes the tests on
@dominictarr any idea? |
I just had this issue, any workarounds more than welcome! |
Since this issue is quite old, still not fixed, and caused by a badly-documented and buggy dependency (creationix/jsonparse#8), I would recommend to use a more tested library, like https://www.npmjs.com/package/clarinet or https://www.npmjs.com/package/stream-json |
Okay, so I have decided that the memory leak is more important than some of the little use-cases (multiple objects and decent operator) people have contributed. In the interest of parsing large objects I have merged my memory leak fixing changes, but disabled the tests for the features #33 and #19, and bumped the major version. Of course I would be happy to merge a pull request that brings these features back. |
Oh never mind... I can't publish a new npm version because of npm/npm#7260 |
@dominictarr I've been investigating why |
@santigimeno great work. I think we should drop features (like root object) if we need to solve the memory problem. People who need those features can just use JSONStream@0 |
@dominictarr agreed |
- After releasing v1.0.0 this functionality is not working correctly anymore. - As commented in dominictarr#32 (comment) people needing this feature should stick to [email protected] versions.
I tried to start using JSONStream in an application where I'm passing ~100mb of json around, but I would always run into out of memory crashes. I've switched away from JSONStream by changing the json to be newline-separated and used readline and JSON.parse on each line and haven't had any more issues. |
Wait for writes to finish before doing more writes.
Hey @dominictarr any news on this? I would love to be able to keep using JSONStream but it's unfortunately not feasible because of the memory leak. It seems like everything is almost in place to be able to have a better memory footprint? Or should we all move to https://github.com/uhop/stream-json? |
@vvo this is fixed as well as it can be. it depends on the path being used, it's not really a memory leak if you ask it to parse literally everything, so there are still certain ways to run out of memory with JSONStream, but it's not JSONStream's fault. JSONStream doesn't even parse anything, it just wraps a nicer api around json-parse. There have been other streaming parsers implemented since then - as linked above, but I don't actually know wether they have better memory/cpu performance. What I would love to see is to make JSONStream take pluggable back ends (json-parse, clarinet, stream-json would be options) then we could actually do a side-by-side comparison, and choose the best one. Would very gladly take a PR for this! |
@dominictarr I was completely doing the "wrong" thing on my end. Basically I have an http request that I pipe into JSONStream, then for every object I would queue it to be synced to a backend. But the backend is slower than the http request so I quickly run out of memory. I guess I would have to wrap my mind around how to handle this difference of processing power using streams. Anyway, thanks a lot for this module, very efficient! |
@vvo sounds like you need some backpressure! this is what streams are all about! you just need to pause the stream when there are too many items in the processing queue. I usually use https://github.com/pull-stream/pull-paramap for something like that (although sorry to drop a completely different stream api on you, I've switched to using pull-streams because they make things like back pressure somewhat easier to reason about). I'm sure you can find a node-stream version if you dig around a bit. |
@vvo https://github.com/mafintosh/parallel-transform should work for your use case. You just have to be wary of what the |
This may be relevant here, its not the same leak, but there does seem to be an excessive memory use bug in jsonparse that might be affecting some people reading this thread. see creationix/jsonparse#31 |
I have not been able to figure out where exactly in the chain I get this memory leak, but whenever I pipe request's JSONStream from couchdb _all_docs (about 85k docs), it eats up my memory like crazy. Any one else experience this? From my limited experience creating custom stream objects, the memory should stay fairly consistent as it is parsing I think?
Example code:
The text was updated successfully, but these errors were encountered: