You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I did some digging into the challenges with Bluesky data using jq as an independent JSON parsing tool. The conclusion of my findings is that the current data set consists of only 999,999,600 JSON documents.
Details
The blog posts about JSONBench disclose the "challenges" with the data set and the results disclose that all tested systems reported ingesting fewer than the expected 1-billion "rows". However, once seeing how widely the systems differed in their ultimate count I was curious to better understand the validity of the data itself vs. any given system's ability to cope with what might be legal JSON but happens to be difficult to parse/store.
jq has been around since 2012 and is pretty universally respected for its JSON handling. Since it's not a tool like the ones actually tested by JSONBench it seems like a fair choice to independently assess the JSON data.
My findings with the JSONBench test data ultimately fall into two buckets:
There's a handful of objects that are legal JSON but happen to have an extreme depth that even out-of-the-box jq struggles to parse in their original form.
$ gzcat file_0178.json.gz | jq . > /dev/null
jq: parse error: Exceeds depth limit for parsing at line 388828, column 532
$ gzcat file_0178.json.gz | sed -n "388828p" > file_0178_line_388828.json
This "challenge" is actually easy to work around though, as per increase depth limit jqlang/jq#2846 it's easy to recompile jq with an increased value of MAX_PARSING_DEPTH, so I did that before proceeding with isolating the other challenging data.
The remaining "challenges" are all situations where a JSON object is actually split across multiple lines (i.e., un-escaped newlines in the middle), which is a violation of the JSON spec. Going in numeric order through the data, the first place where this occurs is in file_0005.json.gz.
$ gzcat file_0005.json.gz | jq . > /dev/null
jq: parse error: Invalid string: control characters from U+0000 through U+001F must be escaped at line 91841, column 80
The specific error message from jq varies depending on whether the split happens in the middle of a string value, a numeric value, etc., but they can all be fixed in the same way: Concatenate together the multiple lines while removing the newlines. I put together the attached fix-json-errors.sh that does precisely this and creates a "healed" version of a given test file that's now 100% successfully parse-able with jq.
I've attached an output.log that came from running the fix script across the entire data set.
$ for num in $(seq 1 1000); do ./fix-json-errors.sh "file_$(echo "000${num}" | tail -c5).json.gz" | tee -a output.log; done
The bottom line is that it ended up finding one or more JSON objects to "heal" in 188 out of the 1000 original files. After GIZP'ing the 188 healed ones and getting a final count:
$ for file in *.gz; do echo "$file: $(gzcat $file | jq -c -s '. | length')" | tee -a counts.txt; done
$ cat counts.txt | cut -f2 -d: | awk '{sum+=$1} END {print sum}'
999999600
Next Steps
I'd be curious to know what action the project maintainers might think is justified based on what's shown here. As I have an interest in the project and have just gone through the exercise of cleaning up the data, I'll start with offering some opinions of my own.
For the "line split" class of problems, I do feel like it would be worth weeding these out of the data set. For starters, strictly speaking, as long as those are in there, the data set comes up just short of hitting the stated "billion docs JSON challenge", i.e., there's actually 999,999,600 JSON docs. One might argue that a system-under-test's ability to cope with such "bad data" and still successfully ingest the remaining "good data" is a valid test. Furthermore, if a system-under-test actually had code that specifically recognized this particular violation of the JSON spec (e.g., if it's common enough of a problem to justify a built-in workaround) and "auto-healed" the data during ingest that might be a useful enhancement. However, I'd argue that all of these are orthogonal to the goals of a benchmark project for which observing performance is the primary motivation.
For this reason I'd be in favor of fixing the data set to remove the problem data. If there's interest in using the 188 "healed" files I'd be happy to make them available somewhere for download, after which hopefully the project maintainers could find an additional 400 "good" Bluesky events to add to the data set, perhaps selectively appending them to the "healed" files that are now each a few events short of the original million-JSON-objects-per-file count.
I'm a little less sure if action is necessary for the "parsing depth" objects in the data set. I've not gone through the exercise of seeing how every system-under-test reacts to these JSON objects, but considering out-of-the-box jq choked on them I could see a case for eliminating those as well, once again based on the argument that using such degenerate data is orthogonal to the goal of benchmarking. However, I know other challenges with the data are already identified (e.g., inconsistent event structure) and the project maintainers seem content to let those ride, and I'm personally of that same mindset when thinking about the "parsing depth" challenge. That said, if there's interest in isolating/eliminating all those I do have logic in some scripts that I could use to identify each file and line number where those exist and make "healed" files that drop those, but I don't have it close at hand so let me know if there's interest and I can provide.
Acknowledgement
Just want to close to say "thanks!" for making JSONBench and the test data available. Very useful!
The text was updated successfully, but these errors were encountered:
Thanks @philrz , this is an impressive investigation.
The original author @tom-clickhouse had some thoughts about the ability of each database-under-test to ingest the data: https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql#approximate-dataset-counts-are-allowed . I would say that except for MongoDB and Postgres, the "data quality" (rate of ingested documents) is pretty good. The metric reflects how well the databases are able to deal with "unusual" JSONs (extreme nesting, unexpected line breaks), and comparing databases against it has some value in itself (in my view). Even if that makes it a "one billion JSON minus a delta of non-parseable documents" challenge.
My personal opinion is that we can keep things as is but perhaps other people have different views?
tl;dr
I did some digging into the challenges with Bluesky data using
jq
as an independent JSON parsing tool. The conclusion of my findings is that the current data set consists of only 999,999,600 JSON documents.Details
The blog posts about JSONBench disclose the "challenges" with the data set and the results disclose that all tested systems reported ingesting fewer than the expected 1-billion "rows". However, once seeing how widely the systems differed in their ultimate count I was curious to better understand the validity of the data itself vs. any given system's ability to cope with what might be legal JSON but happens to be difficult to parse/store.
jq
has been around since 2012 and is pretty universally respected for its JSON handling. Since it's not a tool like the ones actually tested by JSONBench it seems like a fair choice to independently assess the JSON data.My findings with the JSONBench test data ultimately fall into two buckets:
There's a handful of objects that are legal JSON but happen to have an extreme depth that even out-of-the-box
jq
struggles to parse in their original form.To observe one of them that's attached here as
file_0178_line_388828.json
:This "challenge" is actually easy to work around though, as per increase depth limit jqlang/jq#2846 it's easy to recompile
jq
with an increased value ofMAX_PARSING_DEPTH
, so I did that before proceeding with isolating the other challenging data.The remaining "challenges" are all situations where a JSON object is actually split across multiple lines (i.e., un-escaped newlines in the middle), which is a violation of the JSON spec. Going in numeric order through the data, the first place where this occurs is in
file_0005.json.gz
.The specific error message from
jq
varies depending on whether the split happens in the middle of a string value, a numeric value, etc., but they can all be fixed in the same way: Concatenate together the multiple lines while removing the newlines. I put together the attachedfix-json-errors.sh
that does precisely this and creates a "healed" version of a given test file that's now 100% successfully parse-able withjq
.I've attached an
output.log
that came from running the fix script across the entire data set.The bottom line is that it ended up finding one or more JSON objects to "heal" in 188 out of the 1000 original files. After GIZP'ing the 188 healed ones and getting a final count:
Next Steps
I'd be curious to know what action the project maintainers might think is justified based on what's shown here. As I have an interest in the project and have just gone through the exercise of cleaning up the data, I'll start with offering some opinions of my own.
For the "line split" class of problems, I do feel like it would be worth weeding these out of the data set. For starters, strictly speaking, as long as those are in there, the data set comes up just short of hitting the stated "billion docs JSON challenge", i.e., there's actually 999,999,600 JSON docs. One might argue that a system-under-test's ability to cope with such "bad data" and still successfully ingest the remaining "good data" is a valid test. Furthermore, if a system-under-test actually had code that specifically recognized this particular violation of the JSON spec (e.g., if it's common enough of a problem to justify a built-in workaround) and "auto-healed" the data during ingest that might be a useful enhancement. However, I'd argue that all of these are orthogonal to the goals of a benchmark project for which observing performance is the primary motivation.
For this reason I'd be in favor of fixing the data set to remove the problem data. If there's interest in using the 188 "healed" files I'd be happy to make them available somewhere for download, after which hopefully the project maintainers could find an additional 400 "good" Bluesky events to add to the data set, perhaps selectively appending them to the "healed" files that are now each a few events short of the original million-JSON-objects-per-file count.
I'm a little less sure if action is necessary for the "parsing depth" objects in the data set. I've not gone through the exercise of seeing how every system-under-test reacts to these JSON objects, but considering out-of-the-box
jq
choked on them I could see a case for eliminating those as well, once again based on the argument that using such degenerate data is orthogonal to the goal of benchmarking. However, I know other challenges with the data are already identified (e.g., inconsistent event structure) and the project maintainers seem content to let those ride, and I'm personally of that same mindset when thinking about the "parsing depth" challenge. That said, if there's interest in isolating/eliminating all those I do have logic in some scripts that I could use to identify each file and line number where those exist and make "healed" files that drop those, but I don't have it close at hand so let me know if there's interest and I can provide.Acknowledgement
Just want to close to say "thanks!" for making JSONBench and the test data available. Very useful!
The text was updated successfully, but these errors were encountered: