Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: ModelKind with strum + derive_more #335

Merged
merged 2 commits into from
May 23, 2024

Conversation

polarathene
Copy link
Contributor

This is something to consider, but would be a bit awkward without replacing ModelKind as a breaking change.

It was something I explored after #334 and depends on it for strum, likewise I've not added the derive-more crate here either. Going forward with such a change for this type would be rather large, but can be done if deemed worthwhile.

I noticed quite a bit of logic to support the different kinds is mostly repeated with very little if any differences. Meanwhile some logic looks like it's already falling out of sync, thus reducing the noise throughout the codebase may help focus on minimizing that concern?


I'm not sure if Lora quantized kinds were intentionally excluded here, could simply be a call to kind.is_quantized()?:

if use_flash_attn
&& matches!(
loader.get_kind(),
ModelKind::QuantizedGGML
| ModelKind::QuantizedGGUF
| ModelKind::XLoraGGML
| ModelKind::XLoraGGUF
)
{
warn!("Using flash attention with a quantized model has no effect!")
}

There seems to be some logic to detect a difference between lora and xlora types? (the ModelKind could better indicate that or similar?):

ModelKind::LoraGGUF => {
is_lora = true;

let is_xlora = match &model {
Model::Llama(_) | Model::Phi2(_) | Model::Phi3(_) => false,
Model::XLoraLlama(_) | Model::XLoraPhi3(_) => !is_lora,
};

metadata: GeneralMetadata {
max_seq_len,
repeat_last_n: self.config.repeat_last_n,
tok_trie,
has_no_kv_cache: self.no_kv_cache,
is_xlora,
num_hidden_layers,
eos_tok: eos,
is_lora,
},

let is_xlora = model.is_xlora() && !is_lora;


Many of these would be collapsed to the compound type proposed by the PR:

let mut is_lora = false;
let mut model = match self.kind {
ModelKind::QuantizedGGUF => unreachable!(),
ModelKind::QuantizedGGML => unreachable!(),

ModelKind::XLoraGGUF => unreachable!(),
ModelKind::XLoraGGML => unreachable!(),
ModelKind::LoraGGUF => unreachable!(),
ModelKind::LoraGGML => unreachable!(),
ModelKind::Speculative {
target: _,
draft: _,
} => unreachable!(),

Whereas these would need additional refactoring, but all have the same shape for parameters that they could pass in a struct?

ModelKind::Normal => normal_model_loader!(
paths,
dtype,
default_dtype,
&load_device,
config,
self.inner,
self.config.use_flash_attn,
silent,
mapper,
in_situ_quant.is_some(),
device.clone()
),
ModelKind::XLoraNormal => xlora_model_loader!(
paths,
dtype,
default_dtype,
&load_device,
config,
self.inner,
self.config.use_flash_attn,
silent,
mapper,
in_situ_quant.is_some(),
device.clone()
),
ModelKind::LoraNormal => {
is_lora = true;
lora_model_loader!(
paths,
dtype,
default_dtype,
&load_device,
config,
self.inner,
self.config.use_flash_attn,
silent,
mapper,
in_situ_quant.is_some(),
device.clone()
)
}


The builder methods also seem rather suitable for deferring to a crate to derive, and that would work better with a breaking change away from these large parameterized methods funneling data through them.

I won't have time to tackle such any time soon, but something to consider as more models and features are added before this sprawls further 😅

@polarathene polarathene marked this pull request as draft May 19, 2024 14:04
Copy link

github-actions bot commented May 19, 2024

Code Metrics Report
  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 Dockerfile              1           34           25            0            9
 Happy                   1          442          369            0           73
 JSON                    5            9            9            0            0
 Python                 21          741          622           21           98
 TOML                   15          390          353            1           36
-------------------------------------------------------------------------------
 Jupyter Notebooks       1            0            0            0            0
 |- Markdown             1           60           30           22            8
 |- Python               1           96           87            1            8
 (Total)                            156          117           23           16
-------------------------------------------------------------------------------
 Markdown               15         1026            0          758          268
 |- BASH                 6          205          192            0           13
 |- Python               6          121          110            0           11
 |- Rust                 3          185          172            9            4
 (Total)                           1537          474          767          296
-------------------------------------------------------------------------------
 Rust                   84        27822        25504          356         1962
 |- Markdown            40          419            0          407           12
 (Total)                          28241        25504          763         1974
===============================================================================
 Total                 144        30464        26882         1136         2446
===============================================================================
  

@EricLBuehler
Copy link
Owner

@polarathene, thanks for proposing these changes. Currently, this code is a bit obscure, so any changes to make it more idiomatic is much appreciated. I left some comments below. It also looks like the Clippy lints are failing like #334, can you please take a look?

I'm not sure if Lora quantized kinds were intentionally excluded here, could simply be a call to kind.is_quantized()?:

Yes, that is a mistake. Can you please add an kind.is_quantized() function?

There seems to be some logic to detect a difference between lora and xlora types? (the ModelKind could better indicate that or similar?):

I think some methods on ModelKind would make this possible.

Many of these would be collapsed to the compound type proposed by the PR:

Sounds good!

Whereas these would need additional refactoring, but all have the same shape for parameters that they could pass in a struct?

Yes, I think we could do that. Please feel free to draft something here!1

@polarathene
Copy link
Contributor Author

Sure I'll tackle the feedback later today, current tasks for this project are:

  • Wrap another refactor PR (larger).
  • Get the earlier strum PR merged.
  • Rebase here and resolve concerns + additions.

@EricLBuehler
Copy link
Owner

Great! I think that making the code more idiomatic now before I merge #309 (Idefics 2 vision model) or #324 (Nomic embedding model) which would add lots of complexity.

@polarathene
Copy link
Contributor Author

Can you please add an kind.is_quantized() function?

@EricLBuehler how should this work for Speculative? At a glance it seems both target and draft fields can vary in model used, the toml-selectors examples I saw have one that is a base + quantized and another that is quantized + quantized.

I don't know how this type/feature works, so what would be the expectation if is_quantized() was called on that kind variant? Likewise I was thinking to add a quantized_type() and adapter_type() that returns an option of the quantized/adapter kind when valid else None. Since target and draft can differ in ModelKind I'm not sure how I should approach that beyond None 🤷‍♂️

@EricLBuehler
Copy link
Owner

I think for these methods, we should return a vector of the resulting data or perhaps some sort of struct. That way, there is much more type-safe control, which really fits the spirit of this PR. For example, you can check if any are quantized with .any().

@polarathene polarathene marked this pull request as ready for review May 22, 2024 04:33
Copy link
Contributor Author

@polarathene polarathene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context for review added, hope it's helpful 👍

mistralrs-core/src/pipeline/mod.rs Show resolved Hide resolved
mistralrs-core/src/pipeline/mod.rs Show resolved Hide resolved
mistralrs-core/src/pipeline/mod.rs Show resolved Hide resolved
mistralrs-core/src/pipeline/mod.rs Show resolved Hide resolved
mistralrs-core/src/pipeline/normal.rs Show resolved Hide resolved
mistralrs-core/src/pipeline/normal.rs Show resolved Hide resolved
| ModelKind::XLoraGGML
| ModelKind::XLoraGGUF
)
if use_flash_attn && !loader.get_kind().is_quantized()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now simpler and covers both adapter kinds as a bugfix.

NOTE: Speculative kind will also trigger this now. I am not 100% sure if that's desired? This is due to the any() call, as I assume if either target or draft are quantized it may not be compatible, but likewise this does not guarantee is_quantized == true equates to Speculative having both as true.

@polarathene

This comment was marked as resolved.

Copy link
Owner

@EricLBuehler EricLBuehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I think the logic is inverted on the checks for flash attention, but other than that, I am happy to accept a breaking change for the ModelKind enum and look forward to the quantized pipeline refactor.

mistralrs-bench/src/main.rs Outdated Show resolved Hide resolved
mistralrs-core/src/pipeline/mod.rs Show resolved Hide resolved
mistralrs-core/src/pipeline/mod.rs Show resolved Hide resolved
mistralrs-core/src/pipeline/mod.rs Show resolved Hide resolved
mistralrs-core/src/pipeline/mod.rs Show resolved Hide resolved
mistralrs-core/src/pipeline/normal.rs Show resolved Hide resolved
mistralrs-core/src/pipeline/normal.rs Show resolved Hide resolved
mistralrs-server/src/main.rs Outdated Show resolved Hide resolved
@polarathene
Copy link
Contributor Author

polarathene commented May 23, 2024

I think the logic is inverted?

Good spotting! 😬

I'll resolve that shortly 👍 (done!)


Just to confirm this concern I raised (as it'll be hidden in the resolved review comment):

#335 (comment)

NOTE: Speculative kind will also trigger this now. I am not 100% sure if that's desired?
This is due to the any() call, as I assume if either target or draft are quantized it may not be compatible, but likewise this does not guarantee is_quantized == true equates to Speculative having both as true.

ModelKind::Speculative { .. } may only be partially quantized (I've seen the draft always as quantized? While target may be quantized).

  • I assume it's better for is_quantized() to return true when there is any hint of quantization in the kind?
  • Unlike every other variant, you'd have a caveat where this must be kept in mind that it does not guarantee a kind that stores multiple kinds are all quantized.
  • Same applies for is_adapted() and the _and(|| { ... }) methods (which also use .any()).

Using .all() instead doesn't resolve the concern either.. so it might need to be documented with some context that variants like Speculative are prone to bugs?


Looks good. It seems like there are a few compat impls here which perhaps you can remove in a future PR by using ModelKindB? It seems much cleaner.

Yes, as I wasn't sure about breaking changes being approved I've been a bit cautious. A follow-up PR can remove the compat layers here when ModelKindB becomes the new ModelKind 👍

I think this breaking change is acceptable in a future PR. We're pre v1, so it is fine.

I'll tackle that after my next PR (dependent upon changes here) is sorted out 👍

@polarathene
Copy link
Contributor Author

I decided to mirror is_some + is_some_and that exist for Option type, this way you have a bit more flexibility and under the hood it handles the vec iteration without having to think about it.

I like this API. The old way with the boolean wasn't very idiomatic anyway.

NOTE: I have not applied this to gguf.rs and ggml.rs as I have a large refactoring PR already touching that. I can apply the same changes there after this PR is merged.

Sounds like a good idea.

In my upcoming PR it was a little awkward/noisy with this new approach to check if the kind has an Adapter and set variables is_lora + is_xlora.

Speculative isn't supported in the match arms for GGUF/GGML pipeline (probably a good reason for that?), so I can probably just assume only the first element returned from adapter_type() is relevant and use if let Some(a) with a.is_lora() + a.is_xlora() called directly instead of as a closure 👍


As I'm not really familiar with these features yet, am I right to assume that future adapters may be supported that aren't "LoRA" specific?

I am new to this domain, so I only have a surface understanding that LoRA is a technique to direct output towards a preferred style (I've noticed it can be applied with image, audio and text generation models). I think it's often referred to as a way to approach "fine-tuning" as a light-weight alternative via a separate file(s), as opposed to duplicating the model (weights?) data but with the adjustments merged (requiring much more disk space and inefficient for changing styles/adapters at runtime? Which the LoRAX project describes).

I have no idea about X-LoRA, and haven't found time to look into it yet.

Not too important to resolve for now, I'll continue with my refactoring and adjust when appropriate 😅

EDIT: I just took a glance at LoRAX docs for Adapters and noticed an Adapter type called Medusa that mentions "Speculative Decoding", which I assume is what Speculative is meant to represent?

The `is_lora` example is only applied to `pipeline/normal.rs` for now as a sibling PR is refactoring `pipeline/{ggml,gguf}.rs` files.
@polarathene
Copy link
Contributor Author

I rebased and squashed the commit suggestions into the prior commit.

The two main.rs conditions for flash attention have been inverted. Should be good to merge 👍


If you're new to squash merge on Github, it should provide by default the content of each commit message I wrote, and use the PR name as the squash commit name (with link ref to this PR).

@EricLBuehler
Copy link
Owner

Speculative isn't supported in the match arms for GGUF/GGML pipeline (probably a good reason for that?), so I can probably just assume only the first element returned from adapter_type() is relevant and use if let Some(a) with a.is_lora() + a.is_xlora() called directly instead of as a closure 👍

Speculative needs 2 models, so all other pipelines can just ignore the possibility of loading it.

EDIT: I just took a glance at LoRAX docs for Adapters and noticed an Adapter type called Medusa that mentions "Speculative Decoding", which I assume is what Speculative is meant to represent?

Yes, but the Medusa method is a bit different. Speculative Decoding is a sampling technique which uses a (smaller, faster, less precise) "draft" model to accelerate inference of a (larger, slower, more precise) "target" model.

Copy link
Owner

@EricLBuehler EricLBuehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you for this change! I think removing the ModelKind enum is not even a breaking change, actually, as it is not exported publically. Perhaps you could remove it in the GGUF/GGML pipeline refactor?

@EricLBuehler EricLBuehler merged commit 05f6ab8 into EricLBuehler:master May 23, 2024
11 checks passed
@polarathene
Copy link
Contributor Author

Perhaps you could remove it in the GGUF/GGML pipeline refactor?

Sure I can look into throwing that change in as a final commit 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants