Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Isolating challenging/invalid input data #16

Open
philrz opened this issue Mar 10, 2025 · 1 comment
Open

Isolating challenging/invalid input data #16

philrz opened this issue Mar 10, 2025 · 1 comment

Comments

@philrz
Copy link

philrz commented Mar 10, 2025

tl;dr

I did some digging into the challenges with Bluesky data using jq as an independent JSON parsing tool. The conclusion of my findings is that the current data set consists of only 999,999,600 JSON documents.

Details

The blog posts about JSONBench disclose the "challenges" with the data set and the results disclose that all tested systems reported ingesting fewer than the expected 1-billion "rows". However, once seeing how widely the systems differed in their ultimate count I was curious to better understand the validity of the data itself vs. any given system's ability to cope with what might be legal JSON but happens to be difficult to parse/store.

jq has been around since 2012 and is pretty universally respected for its JSON handling. Since it's not a tool like the ones actually tested by JSONBench it seems like a fair choice to independently assess the JSON data.

My findings with the JSONBench test data ultimately fall into two buckets:


  1. There's a handful of objects that are legal JSON but happen to have an extreme depth that even out-of-the-box jq struggles to parse in their original form.

    To observe one of them that's attached here as file_0178_line_388828.json:

    $ gzcat file_0178.json.gz | jq . > /dev/null
    jq: parse error: Exceeds depth limit for parsing at line 388828, column 532
    
    $ gzcat file_0178.json.gz | sed -n "388828p" > file_0178_line_388828.json
    

    This "challenge" is actually easy to work around though, as per increase depth limit jqlang/jq#2846 it's easy to recompile jq with an increased value of MAX_PARSING_DEPTH, so I did that before proceeding with isolating the other challenging data.

  2. The remaining "challenges" are all situations where a JSON object is actually split across multiple lines (i.e., un-escaped newlines in the middle), which is a violation of the JSON spec. Going in numeric order through the data, the first place where this occurs is in file_0005.json.gz.

    $ gzcat file_0005.json.gz | jq . > /dev/null
    jq: parse error: Invalid string: control characters from U+0000 through U+001F must be escaped at line 91841, column 80
    

    The specific error message from jq varies depending on whether the split happens in the middle of a string value, a numeric value, etc., but they can all be fixed in the same way: Concatenate together the multiple lines while removing the newlines. I put together the attached fix-json-errors.sh that does precisely this and creates a "healed" version of a given test file that's now 100% successfully parse-able with jq.


I've attached an output.log that came from running the fix script across the entire data set.

$ for num in $(seq 1 1000); do ./fix-json-errors.sh "file_$(echo "000${num}" | tail -c5).json.gz" | tee -a output.log; done

The bottom line is that it ended up finding one or more JSON objects to "heal" in 188 out of the 1000 original files. After GIZP'ing the 188 healed ones and getting a final count:

$ for file in *.gz; do   echo "$file: $(gzcat $file | jq -c -s '. | length')" | tee -a counts.txt; done

$ cat counts.txt | cut -f2 -d: | awk '{sum+=$1} END {print sum}'
999999600

Next Steps

I'd be curious to know what action the project maintainers might think is justified based on what's shown here. As I have an interest in the project and have just gone through the exercise of cleaning up the data, I'll start with offering some opinions of my own.

For the "line split" class of problems, I do feel like it would be worth weeding these out of the data set. For starters, strictly speaking, as long as those are in there, the data set comes up just short of hitting the stated "billion docs JSON challenge", i.e., there's actually 999,999,600 JSON docs. One might argue that a system-under-test's ability to cope with such "bad data" and still successfully ingest the remaining "good data" is a valid test. Furthermore, if a system-under-test actually had code that specifically recognized this particular violation of the JSON spec (e.g., if it's common enough of a problem to justify a built-in workaround) and "auto-healed" the data during ingest that might be a useful enhancement. However, I'd argue that all of these are orthogonal to the goals of a benchmark project for which observing performance is the primary motivation.

For this reason I'd be in favor of fixing the data set to remove the problem data. If there's interest in using the 188 "healed" files I'd be happy to make them available somewhere for download, after which hopefully the project maintainers could find an additional 400 "good" Bluesky events to add to the data set, perhaps selectively appending them to the "healed" files that are now each a few events short of the original million-JSON-objects-per-file count.

I'm a little less sure if action is necessary for the "parsing depth" objects in the data set. I've not gone through the exercise of seeing how every system-under-test reacts to these JSON objects, but considering out-of-the-box jq choked on them I could see a case for eliminating those as well, once again based on the argument that using such degenerate data is orthogonal to the goal of benchmarking. However, I know other challenges with the data are already identified (e.g., inconsistent event structure) and the project maintainers seem content to let those ride, and I'm personally of that same mindset when thinking about the "parsing depth" challenge. That said, if there's interest in isolating/eliminating all those I do have logic in some scripts that I could use to identify each file and line number where those exist and make "healed" files that drop those, but I don't have it close at hand so let me know if there's interest and I can provide.

Acknowledgement

Just want to close to say "thanks!" for making JSONBench and the test data available. Very useful!

@rschu1ze
Copy link
Member

Thanks @philrz , this is an impressive investigation.

The original author @tom-clickhouse had some thoughts about the ability of each database-under-test to ingest the data: https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql#approximate-dataset-counts-are-allowed . I would say that except for MongoDB and Postgres, the "data quality" (rate of ingested documents) is pretty good. The metric reflects how well the databases are able to deal with "unusual" JSONs (extreme nesting, unexpected line breaks), and comparing databases against it has some value in itself (in my view). Even if that makes it a "one billion JSON minus a delta of non-parseable documents" challenge.

My personal opinion is that we can keep things as is but perhaps other people have different views?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants