Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

json unfolding: parse incrementally instead of using cheshire #42638

Merged
merged 5 commits into from
May 15, 2024

Conversation

piranha
Copy link
Contributor

@piranha piranha commented May 14, 2024

This lowers memory and time usage 100x for 50kb strings (our previous limit). For example, those are results for parsing 42kb JSON string (already in memory):

Before: Time per call: 6.32 ms    Alloc per call: 8,732,850b
After:  Time per call: 55.16 us   Alloc per call: 83,254b

Nuances:

  • I've dropped 50k limit, but not 100 keys limit; so in theory some weird really big JSON with low amount of keys could take a bit of time; in practice, it takes about 455ms and 147mb of allocations to go through 13Mb JSON file. Seems acceptable.
  • We could descend into (top-level only?) arrays, but I have no understanding on how to represent them in a resulting sequence; any pointers?
  • It seems that outside of enabling this option specifically, there is no way for Jackson to read float from JSON as BigDecimal; OTOH I don't see cheshire reading them as such, so I'm not too sure which way to go.

Resolves #34798

this lowers memory usage 100x for 50kb strings (our previous limit) and speeds up parsing by an order of magnitude
@piranha piranha added the backport Automatically create PR on current release branch on merge label May 14, 2024
@piranha piranha requested a review from a team May 14, 2024 12:38
@piranha piranha self-assigned this May 14, 2024
@piranha piranha requested a review from camsaul as a code owner May 14, 2024 12:38
@metabase-bot metabase-bot bot added the .Team/BackendComponents also known as BEC label May 14, 2024
Copy link

replay-io bot commented May 14, 2024

Status Complete ↗︎
Commit fcf7a89
Results
⚠️ 2 Flaky
2506 Passed

@crisptrutski
Copy link
Contributor

We could descend into (top-level only?) arrays, but I have no understanding on how to represent them in a resulting sequence; any pointers?

If you're not worried about representing empty arrays, then you can just represent them implicitly by having index integers in the paths of their values.

Copy link
Contributor

@crisptrutski crisptrutski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add some guard rails to the case macro, but don't want to hold this up.

Comment on lines 376 to 381
JsonParser$NumberType/INT Integer
JsonParser$NumberType/LONG Long
JsonParser$NumberType/FLOAT Float
JsonParser$NumberType/DOUBLE Double
JsonParser$NumberType/BIG_DECIMAL BigDecimal
JsonParser$NumberType/BIG_INTEGER clojure.lang.BigInt))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we care about exposing Integer and Float? We could be opinionated like Clojure and just promote them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, maybe; I've tried to repeat whatever behavior we had in place and was just eager to finally get PR up. 😁 I'll look into making this simpler.

~@(concat
(mapcat (fn [[test result]]
#_ {:clj-kondo/ignore [:discouraged-var]}
[(eval (enum-ordinal test)) result])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn this is a pity. Wonder whether we can't use reflection on the class instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably also ensure that all the dispatch enums and the input enum have the same type.

Probably jumping the shark a bit, but it would be nice if we could effectively generate a Java switch statement - especially cool if we enforce exhaustiveness too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah generating switch would be cool, but absolutely no idea how to get there. As for dropping eval - if you have any ideas, I'm all ears. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some checks/tests

Comment on lines 398 to 400
;; we could be more precise here and issue warning about nested fields (the one in `describe-json-fields`),
;; but this limit could be hit by multiple json fields rather than only by this one. So for the sake of
;; issuing only a single warning in logs we'll spill over limit by a single entry (instead of doing `<=`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a some subtle stuff - I had to read it twice to get the gist and, this is a bit embarrassing, I still don't see how we can hit it multiple times. Don't we stop reading tokens and then pop off the entire loop's tail call stack?

I think it's worth explicitly saying that you're leaving the item to trigger the warning later, and that the last1 item will be lopped off again then too.

It's a bit hacky, but I think you could avoid this trick by only logging if res is still a transient, although that means reflection. Alternatively you could use another loop parameter as a latch.

Footnotes

  1. Well, a random item will be removed. A pity that we forget the order by then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't see how we can hit it multiple times

If there are a few fields with JSON, even if you got them under the limit - they could be over 100 keys in total. So not in this function directly, I'll probably add some more words here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha. I hadn't looked much at the surrounding code.

(let [data (into {} (for [i (range (* 2 @#'sql-jdbc.describe-table/max-nested-field-columns))]
[(str "key" i) i]))]
;; +2 to limit since we go 1 over the limit, see comment in `json->types`
(is (> (+ 2 @#'sql-jdbc.describe-table/max-nested-field-columns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not test exactly what size it is? I think it's fine to update the test if we fix the "1 over" thing later.

Comment on lines 516 to 519
(is (not (case Month/MAY
Month/APRIL false
Month/MAY true
false))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd prefer this test with distinct values for all three branches so it's clearer.

Suggested change
(is (not (case Month/MAY
Month/APRIL false
Month/MAY true
false))))
(is (= 3 (case Month/MAY
Month/APRIL 1
Month/MAY 2
3))))

Copy link
Contributor

@crisptrutski crisptrutski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvements, double ticked

"Like `case`, but explicitly dispatch on Java enum ordinals."
[value & clauses]
#_ {:clj-kondo/ignore [:discouraged-var]}
(let [types (map (comp type eval first) (partition 2 clauses))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there another way to look up the type from a symbol? 😢

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this 🤢

(let [sym 'JsonToken/VALUE_STRING]
  (clojure.lang.Reflector/invokeStaticMethod (resolve (symbol (namespace sym))) 
                                             "valueOf" 
                                             (to-array [(name sym)])))

Still better than eval though, I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(defn sym->enum [s] 🤮)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as they say, be wrong on the internet and learn 😁

Month/MAY true
false)))
(testing "checks for type"
(is (thrown? Exception #"`case-enum` only works.*"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great that we check this, but it still leaves the slightly more likely bug which is having a mismatch just in runtime value we take, e.g.

(u/case-enum DayOfWeek/SUNDAY
  Month/JANUARY 1)

We can't protect against this without reflection or a more advanced technique, so I think this is fine, but perhaps worth a test documenting the sad as well.

~@(concat
(mapcat (fn [[test result]]
#_ {:clj-kondo/ignore [:discouraged-var]}
[(eval (enum-ordinal test)) result])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think this can then be

[(.ordinal (sym->enum test)) result])

PS: do you know how to make inline code fragments use Clojure syntax highlighting?


Passing the same enum type as the ones you're checking in is on you, this is not checked."
[value & clauses]
(let [types (map (comp type sym->enum first) (partition 2 clauses))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You could calculate the type a bit more directly with just (comp resolve symbol namespace first)

@piranha piranha merged commit 5cfc079 into master May 15, 2024
109 checks passed
@piranha piranha deleted the json-unfolding branch May 15, 2024 10:08
Copy link

@piranha Did you forget to add a milestone to the issue for this PR? When and where should I add a milestone?

piranha added a commit that referenced this pull request May 15, 2024
this lowers memory usage 100x for 50kb strings (our limit before this change) and speeds up parsing by an order of magnitude.

For example, results for parsing 42kb string (already in memory):

Before: Time per call: 6.32 ms    Alloc per call: 8,732,850b
After:  Time per call: 55.16 us   Alloc per call: 83,254b
piranha pushed a commit that referenced this pull request May 15, 2024
#42682)

this lowers memory usage 100x for 50kb strings (our limit before this change) and speeds up parsing by an order of magnitude.

For example, results for parsing 42kb string (already in memory):

Before: Time per call: 6.32 ms    Alloc per call: 8,732,850b
After:  Time per call: 55.16 us   Alloc per call: 83,254b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Automatically create PR on current release branch on merge .Team/BackendComponents also known as BEC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make json parsing more efficient on the JSON unfolding feature
2 participants