Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Faster unions (vs z.switch()) #3407

Open
colinhacks opened this issue Apr 16, 2024 · 13 comments
Open

RFC: Faster unions (vs z.switch()) #3407

colinhacks opened this issue Apr 16, 2024 · 13 comments

Comments

@colinhacks
Copy link
Owner

colinhacks commented Apr 16, 2024

This is a followup discussion to #2106

Okay, I'm glad I didn't rush a z.switch() implementation, because I now think there's a better way forward. This is definitely my own brilliant idea and not something @gcanti suggested in a Twitter DM.

It's pretty simple, we just...make z.union() better. Here's the case for sticking with plain z.union():

  • It's introspectable, unlike z.switch(). Code generation tools need a way to enumerate all elements of a union, and that isn't possible with z.switch().
  • It hews more closely to TypeScript conceptually (remember, TypeScript has no special concept of a "discriminated union")
  • Implemented well, it can approach the performance of z.discriminatedUnion

How? The idea is for Zod to do some "pre-computation" at the schema creation time to make parsing faster.

  • All Zod schemas will implement a .getLiterals(): Set<Primitive> method that returns the set of known literal values that will pass validation. For something like z.string(), this will be undefined (there are infinitely many valid inputs). For something like a ZodEnum or ZodLiteral, this will be a finite set of values.
    • This method can also be implemented by user-defined subclasses, so this doesn't break extensibility!
  • When you create a z.union(), Zod uses getLiterals() to extract a set of "fast checks" for each union element in the form {[k: string]: Set<Primitive>.
  • During parsing, Zod will do use these "fast checks" to quickly determine if the input has any chance of parsing properly against the schema.
  • Ideally in most cases, only one schema will remain after the fast checks. Zod will use that schema to validate the input.
  • If multiple union items pass the "fast check", the input will be validated against each one in succession (like a regular union).
  • Relatedly: Zod will implement a much-requested abortEarly mode to bail out ASAP in the event of validation errors. This mode will speed up ZodUnion performance even if there's no discrimination to be done. This was already on the Zod 4 roadmap but is made even more relevant now.

The API could also accept a "discriminator hint" to point the parser in the right direction.

z.union([ ... ], {
  discriminator: "someKey"
})

In retrospect, this is what the API always should have been. Discriminated unions are not a type unto themselves, just an optimization over unions, and the API should have reflected that.

@igalklebanov
Copy link
Contributor

igalklebanov commented Apr 16, 2024

LGTM! 💪

All Zod schemas will implement a .getLiterals(): Set method that returns the set of known literal values that will pass validation. For something like z.string(), this will be undefined (there are infinitely many valid inputs). For something like a ZodEnum or ZodLiteral, this will be a finite set of values.

Wondering if .getLiterals() should be optional (it's only relevant for types that should return something meaningful and not undefined), or inherited from base (returning undefined) and only overriden when it matters.

Am I missing something?

P.S. abort early! 🎉

@ssalbdivad
Copy link

ssalbdivad commented Apr 16, 2024

I think this makes a lot of sense! ArkType goes pretty far down the "let's create a runtime type system and handle discrimination robustly" rabbit hole, but I think using heuristics like checking only literals at the root of the object can get you most of the benefit and save a ton of complexity.

The other category of disjoints besides literals that might be worth considering is typeof, e.g. if a string, number and boolean are at the same path. This often occurs incidentally, so it's a nice way to discriminate more unions without users having to think about it.

If it's of any use, this is the algorithm AT uses to determine the "best" discriminator (or sequence of discriminators, if it requires multiple steps) for a union:

https://github.com/arktypeio/arktype/blob/d000f90e8950917db19f2af437eafb22e390485b/ark/type/types/discriminate.ts#L64

Happy to discuss in more detail any time if it would be useful! Definitely one of my favorite subjects if you don't mind hearing me nerd out a bit over it 😄

@jrysana
Copy link

jrysana commented Apr 16, 2024

Am I understanding this right? Before:

z.discriminatedUnion("kind", [
  z.object({ kind: z.literal("number"), amount: z.number() }),
  z.object({ kind: z.literal("boolean"), value: z.boolean() })
])

After:

z.union([
  z.object({ kind: z.literal("number"), amount: z.number() }),
  z.object({ kind: z.literal("boolean"), value: z.boolean() })
])

Optional:

z.union([
  z.object({ kind: z.literal("number"), amount: z.number() }),
  z.object({ kind: z.literal("boolean"), value: z.boolean() })
], {
  discriminator: "kind"
})

The "after" schema will initialize a bit slower(?) than it would today, but parse much faster than it would today and up to around as fast as the "before" schema does today? If so, I like it a lot, discriminatedUnion always felt a bit redundant to write out by comparison and the parse speed boost for union schemas that happen to be discriminated would be great.

@colinhacks
Copy link
Owner Author

Wondering if .getLiterals() should be optional

Yep this would be defined on ZodType with a default implementation, as you say. It would probably also be protected...not sure this is an internal we should expose. It's true that this method would apply to very few schema types.

If it's of any use, this is the algorithm AT uses to determine the "best" discriminator (or sequence of discriminators, if it requires multiple steps) for a union:

Thanks David I'll look into this! I haven't thought through how discriminator selection would work, so this is super useful.

The other category of disjoints besides literals that might be worth considering is typeof

I was thinking about this. Maybe a special symbol (z.STRING) to indicate that any string would pass.

The "after" schema will initialize a bit slower(?) than it would today, but parse much faster than it would today and up to around as fast as the "before" schema does today?

Precisely 💯

@rexfordessilfie
Copy link

LGTM and thanks for the breakdown of this neat pre-computation strategy 🚀 !

I have a few clarifications for better understanding:

When you create a z.union(), Zod uses getLiterals() to extract a set of "fast checks" for each union element in the form {[k: string]: Set

How would the fast-checks generalize for unions where the "discriminator" is a non-primitive e.g an object?

Here's an example building on top of @jrysana example but nesting kind one level deeper in an info object:

z.union([
  z.object({ info: z.object({ kind: z.literal("number") }), amount: z.number() }),
  z.object({ info: z.object({ kind: z.literal("boolean") }), value: z.boolean() })
])

This question is based on my understanding of k in {[k: string]: Set<Primitive>} as referring to a field of an object in the discriminated union - I may be misunderstanding.

Side note: this schema may not be the best, but it seems like something that could exist out in the wild.

Lastly if this has already been thought through or it'd be figured out at a later time, I'm happy to wait and see what it would look like! Maybe this is what you are referring to as discriminator selection by:

Thanks David I'll look into this! I haven't thought through how discriminator selection would work, so this is super useful.


The other category of disjoints besides literals that might be worth considering is typeof

I was thinking about this. Maybe a special symbol (z.STRING) to indicate that any string would pass.

This seems like a nice idea to use special symbols (like z.STRING), in order to avoid conflicts with (string) literals as typeof would evaluate to.


Also, PS. I am glad that z.union is going to be preferred over z.switch as code generation is much easier with the former!

@s-cerevisiae
Copy link

s-cerevisiae commented Apr 18, 2024

Edit: Ok now I see it will solve my problem automatically. Thanks again for your great work, looking forward to the implementation.


Nice to see this is going to be addressed. Thank you for your hard work and ideas.

One thing that I'd like to see being introduced to union is precise error messages.
Currently when using union for discriminated unions (as it's required for recursively defined data), the parser tries each branch thus gives a big error message:

const u = z.union([
  z.object({ type: z.literal("a"), value: z.number() }),
  z.object({ type: z.literal("b"), value: z.string() }),
]);

console.log(JSON.stringify(u.safeParse({ type: "a", value: "b" })));
long error message given by z.union
{
  "success": false,
  "error": {
    "issues": [{
      "code": "invalid_union",
      "unionErrors": [{
        "issues": [{
          "code": "invalid_type",
          "expected": "number",
          "received": "string",
          "path": ["value"],
          "message": "Expected number, received string",
        }],
        "name": "ZodError",
      }, {
        "issues": [{
          "received": "a",
          "code": "invalid_literal",
          "expected": "b",
          "path": ["type"],
          "message": 'Invalid literal value, expected "b"',
        }],
        "name": "ZodError",
      }],
      "path": [],
      "message": "Invalid input",
    }],
    "name": "ZodError",
  },
}

which is hard to digress and turns into wall of text when shown to the users.

OTOH since discriminatedUnion has an explicit discriminator assigned, it can precisely give the actual branch where the error happens:

const du = z.discriminatedUnion("type", [
  z.object({ type: z.literal("a"), value: z.number() }),
  z.object({ type: z.literal("b"), value: z.string() }),
]);

console.log(JSON.stringify(du.safeParse({ type: "a", value: "b" })));
short and precise error given by z.discriminatedUnion
{
  "success": false,
  "error": {
    "issues": [{
      "code": "invalid_type",
      "expected": "number",
      "received": "string",
      "path": ["value"],
      "message": "Expected number, received string",
    }],
    "name": "ZodError",
  },
}

Will the "discriminator hint" solve this problem? Thank you.

@lo1tuma
Copy link
Contributor

lo1tuma commented Apr 26, 2024

One thing that I'd like to see being introduced to union is precise error messages.

I second that. I use discriminatedUnion quite extensively and that‘s not because of its performance advantages, the most important reason is the precise error messages.

So far I really like the proposal, but I hope it addresses also the error messages, otherwise it will be unusable for me.

@wKovacs64
Copy link

Forgive my ignorance, but would this allow for nested discriminated unions unlike z.discriminatedUnion?

@arthurvanl
Copy link

arthurvanl commented Apr 30, 2024

I am using this discriminatedUnion but I require that all objects have multiple shared keys but by only 1 shared key you can know the difference between the unions.

example:

const myUnion = z.discriminatedUnion("status", [
  z.object({ status: z.literal("success"), data: z.string(), timestamp: z.date() }),
  z.object({ status: z.literal("failed"), error: z.instanceof(Error), timestamp: z.date() }),
]);

myUnion.parse({ status: "success", data: "yippie ki yay" });

This is maybe a dumb example with the timestamp but that's what I would need to do if I want multiple shared keys. I thought I could maybe use a merge or extend, except discriminated unions do not allow this.

EDIT

This can actually be done with .and()

@arthurvanl
Copy link

New problem:

The status value must be dynamically allowed to. This throws an error now because the value is of course not found in the union (other than success or failed). I would like that it will not throw an error and is parsed with the default properties that are added in the schema

@qraynaud
Copy link

qraynaud commented Apr 30, 2024

Why not instead of getLiterals() just make a more simple fastCheck(val): true | false | "unavailable" (or an enum) method instead?
By default it would return "unavailable" but for literal you could return val === this.value.
And it would allow an easy implementation on objects too with something like:

fastCheck(val) {
  let res: true | false | "unavailable" = "unavailable";
  for (const attr of this.attributes) {
     const fastCheck = attr.fastCheck(val[attr]);
     if (fastCheck === "unavailable") continue;
     else if (!fastCheck) return false;
     res = true;
  }
  return res;
}

Then to collect all possibilities in an union you could just do something like:

const possibleSchemas = allSchemas.filter(s => s.fastCheck(val) !== false);

Maybe there is not much value to "unavailable" and it could even be merged with true.

@arthurvanl
Copy link

I made my schema like this now:

export const store_order_item_schema = z.object({
    session_token: z.string().length(128),
    product_type: z.string(),
    product_name: z.string().max(200),
    value: z.string().max(200),
    period: z.number().positive().default(1),
    quantity: z.number().positive().default(1),
}).and(z.discriminatedUnion("product_type", [
    z.object({
        product_type: z.literal('domain_extension'),
        extension: z.string().refine((v) => !v.startsWith('.'), 'Domain extension must not start with a dot'),
        status: z.enum(['free', 'active']),
        transfer_code: z.string().optional(),
        //! you can not order multiple domains with the same name
        quantity: z.literal(1),
        period: z.literal(12).or(z.literal(24)).or(z.literal(36)).or(z.literal(48)).or(z.literal(60))
    }),
    z.object({
        product_type: z.literal('pax8_license'),
        term: z.enum(['Annual', 'Monthly']),
        license_id: z.string().uuid(),
        commitment_term_id: z.string().uuid(),
    }),
    z.object({
        product_type: z.literal('hosting'),
        hostname: z.string()
    }),
    z.object({product_type: z.literal('other')})
]))

The only thing what is bothering me is that I cannot use other values in this schema
I would like to have the props session_token, product_type, product_name, value, period, quantity defaulty be added to any product_type I set.

@AoiYamada
Copy link

How about multiple discriminators that cause nesting issue like #1884?

Would it be possible to accept an array of discriminators? for example:

z.union([ ... ], {
  discriminator: ["someKey1", "someKey2"]
})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests