Isolating challenging/invalid input data

## tl;dr

I did some digging into the [challenges with Bluesky data](https://clickhouse.com/blog/building-a-medallion-architecture-for-bluesky-json-data-with-clickhouse#challenges-with-bluesky-data) using [`jq`](https://github.com/jqlang/jq) as an independent JSON parsing tool. The conclusion of my findings is that the current data set consists of only 999,999,600 JSON documents.

## Details

The blog posts about JSONBench disclose the "challenges" with the data set and the [results](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql#approximate-dataset-counts-are-allowed) disclose that all tested systems reported ingesting fewer than the expected 1-billion "rows". However, once seeing how widely the systems differed in their ultimate count I was curious to better understand the validity of the data itself vs. any given system's ability to cope with what might be legal JSON but happens to be difficult to parse/store.

`jq` has been around since 2012 and is pretty universally respected for its JSON handling. Since it's not a tool like the ones actually tested by JSONBench it seems like a fair choice to independently assess the JSON data.

My findings with the JSONBench test data ultimately fall into two buckets:

---

1. There's a handful of objects that _are_ legal JSON but happen to have an extreme depth that even out-of-the-box `jq` struggles to parse in their original form.

   To observe one of them that's attached here as [`file_0178_line_388828.json`](https://github.com/user-attachments/files/19170554/file_0178_line_388828.json):

   ```
   $ gzcat file_0178.json.gz | jq . > /dev/null
   jq: parse error: Exceeds depth limit for parsing at line 388828, column 532
   
   $ gzcat file_0178.json.gz | sed -n "388828p" > file_0178_line_388828.json
   ```

   This "challenge" is actually easy to work around though, as per https://github.com/jqlang/jq/issues/2846 it's easy to recompile `jq` with an increased value of `MAX_PARSING_DEPTH`, so I did that before proceeding with isolating the other challenging data.

2. The remaining "challenges" are all situations where a JSON object is actually split across multiple lines (i.e., un-escaped newlines in the middle), which is a violation of the JSON spec. Going in numeric order through the data, the first place where this occurs is in `file_0005.json.gz`.

   ```
   $ gzcat file_0005.json.gz | jq . > /dev/null
   jq: parse error: Invalid string: control characters from U+0000 through U+001F must be escaped at line 91841, column 80
   ```

   The specific error message from `jq` varies depending on whether the split happens in the middle of a string value, a numeric value, etc., but they can all be fixed in the same way: Concatenate together the multiple lines while removing the newlines. I put together the attached [`fix-json-errors.sh`](https://github.com/user-attachments/files/19170598/fix-json-errors.sh.txt) that does precisely this and creates a "healed" version of a given test file that's now 100% successfully parse-able with `jq`.

---

I've attached an [`output.log`](https://github.com/user-attachments/files/19170606/output.log) that came from running the fix script across the entire data set.

```
$ for num in $(seq 1 1000); do ./fix-json-errors.sh "file_$(echo "000${num}" | tail -c5).json.gz" | tee -a output.log; done
```

The bottom line is that it ended up finding one or more JSON objects to "heal" in 188 out of the 1000 original files. After GIZP'ing the 188 healed ones and getting a final count:

```
$ for file in *.gz; do   echo "$file: $(gzcat $file | jq -c -s '. | length')" | tee -a counts.txt; done

$ cat counts.txt | cut -f2 -d: | awk '{sum+=$1} END {print sum}'
999999600
```

## Next Steps

I'd be curious to know what action the project maintainers might think is justified based on what's shown here. As I have an interest in the project and have just gone through the exercise of cleaning up the data, I'll start with offering some opinions of my own.

For the "line split" class of problems, I do feel like it would be worth weeding these out of the data set. For starters, strictly speaking, as long as those are in there, the data set comes up just short of hitting the stated "billion docs JSON challenge", i.e., there's actually 999,999,600 JSON docs. One might argue that a system-under-test's ability to cope with such "bad data" and still successfully ingest the remaining "good data" is a valid test. Furthermore, if a system-under-test actually had code that specifically recognized this particular violation of the JSON spec (e.g., if it's common enough of a problem to justify a built-in workaround) and "auto-healed" the data during ingest that might be a useful enhancement. However, I'd argue that all of these are orthogonal to the goals of a benchmark project for which observing performance is the primary motivation.

For this reason I'd be in favor of fixing the data set to remove the problem data. If there's interest in using the 188 "healed" files I'd be happy to make them available somewhere for download, after which hopefully the project maintainers could find an additional 400 "good" Bluesky events to add to the data set, perhaps selectively appending them to the "healed" files that are now each a few events short of the original million-JSON-objects-per-file count.

I'm a little less sure if action is necessary for the "parsing depth" objects in the data set. I've not gone through the exercise of seeing how every system-under-test reacts to these JSON objects, but considering out-of-the-box `jq` choked on them I could see a case for eliminating those as well, once again based on the argument that using such degenerate data is orthogonal to the goal of benchmarking. However, I know other [challenges](https://clickhouse.com/blog/building-a-medallion-architecture-for-bluesky-json-data-with-clickhouse#challenges-with-bluesky-data) with the data are already identified (e.g., inconsistent event structure) and the project maintainers seem content to let those ride, and I'm personally of that same mindset when thinking about the "parsing depth" challenge. That said, if there's interest in isolating/eliminating all those I do have logic in some scripts that I could use to identify each file and line number where those exist and make "healed" files that drop those, but I don't have it close at hand so let me know if there's interest and I can provide.

## Acknowledgement

Just want to close to say "thanks!" for making JSONBench and the test data available. Very useful!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Isolating challenging/invalid input data #16

tl;dr

Details

Next Steps

Acknowledgement

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Isolating challenging/invalid input data #16

Description

tl;dr

Details

Next Steps

Acknowledgement

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions