refactor: `ModelKind` with `strum` + `derive_more` #335

polarathene · 2024-05-19T14:04:30Z

This is something to consider, but would be a bit awkward without replacing ModelKind as a breaking change.

It was something I explored after #334 and depends on it for strum, likewise I've not added the derive-more crate here either. Going forward with such a change for this type would be rather large, but can be done if deemed worthwhile.

I noticed quite a bit of logic to support the different kinds is mostly repeated with very little if any differences. Meanwhile some logic looks like it's already falling out of sync, thus reducing the noise throughout the codebase may help focus on minimizing that concern?

I'm not sure if Lora quantized kinds were intentionally excluded here, could simply be a call to kind.is_quantized()?:

mistral.rs/mistralrs-bench/src/main.rs

Lines 316 to 326 in 1d5f9f3

    
           if use_flash_attn 
        
               && matches!( 
        
                   loader.get_kind(), 
        
                   ModelKind::QuantizedGGML 
        
                       | ModelKind::QuantizedGGUF 
        
                       | ModelKind::XLoraGGML 
        
                       | ModelKind::XLoraGGUF 
        
               ) 
        
           { 
        
               warn!("Using flash attention with a quantized model has no effect!") 
        
           }

There seems to be some logic to detect a difference between lora and xlora types? (the ModelKind could better indicate that or similar?):

mistral.rs/mistralrs-core/src/pipeline/gguf.rs

Lines 406 to 407 in 1d5f9f3

    
           ModelKind::LoraGGUF => { 
        
               is_lora = true;

mistral.rs/mistralrs-core/src/pipeline/gguf.rs

Lines 473 to 476 in 1d5f9f3

    
           let is_xlora = match &model { 
        
               Model::Llama(_) | Model::Phi2(_) | Model::Phi3(_) => false, 
        
               Model::XLoraLlama(_) | Model::XLoraPhi3(_) => !is_lora, 
        
           };

mistral.rs/mistralrs-core/src/pipeline/gguf.rs

Lines 498 to 507 in 1d5f9f3

    
           metadata: GeneralMetadata { 
        
               max_seq_len, 
        
               repeat_last_n: self.config.repeat_last_n, 
        
               tok_trie, 
        
               has_no_kv_cache: self.no_kv_cache, 
        
               is_xlora, 
        
               num_hidden_layers, 
        
               eos_tok: eos, 
        
               is_lora, 
        
           },

mistral.rs/mistralrs-core/src/pipeline/normal.rs

Line 297 in 1d5f9f3

let is_xlora = model.is_xlora() && !is_lora;

Many of these would be collapsed to the compound type proposed by the PR:

mistral.rs/mistralrs-core/src/pipeline/normal.rs

Lines 231 to 234 in 1d5f9f3

    
           let mut is_lora = false; 
        
           let mut model = match self.kind { 
        
               ModelKind::QuantizedGGUF => unreachable!(), 
        
               ModelKind::QuantizedGGML => unreachable!(),

mistral.rs/mistralrs-core/src/pipeline/normal.rs

Lines 277 to 284 in 1d5f9f3

    
           ModelKind::XLoraGGUF => unreachable!(), 
        
           ModelKind::XLoraGGML => unreachable!(), 
        
           ModelKind::LoraGGUF => unreachable!(), 
        
           ModelKind::LoraGGML => unreachable!(), 
        
           ModelKind::Speculative { 
        
               target: _, 
        
               draft: _, 
        
           } => unreachable!(),

Whereas these would need additional refactoring, but all have the same shape for parameters that they could pass in a struct?

mistral.rs/mistralrs-core/src/pipeline/normal.rs

Lines 235 to 276 in 1d5f9f3

    
           ModelKind::Normal => normal_model_loader!( 
        
               paths, 
        
               dtype, 
        
               default_dtype, 
        
               &load_device, 
        
               config, 
        
               self.inner, 
        
               self.config.use_flash_attn, 
        
               silent, 
        
               mapper, 
        
               in_situ_quant.is_some(), 
        
               device.clone() 
        
           ), 
        
           ModelKind::XLoraNormal => xlora_model_loader!( 
        
               paths, 
        
               dtype, 
        
               default_dtype, 
        
               &load_device, 
        
               config, 
        
               self.inner, 
        
               self.config.use_flash_attn, 
        
               silent, 
        
               mapper, 
        
               in_situ_quant.is_some(), 
        
               device.clone() 
        
           ), 
        
           ModelKind::LoraNormal => { 
        
               is_lora = true; 
        
               lora_model_loader!( 
        
                   paths, 
        
                   dtype, 
        
                   default_dtype, 
        
                   &load_device, 
        
                   config, 
        
                   self.inner, 
        
                   self.config.use_flash_attn, 
        
                   silent, 
        
                   mapper, 
        
                   in_situ_quant.is_some(), 
        
                   device.clone() 
        
               ) 
        
           }

The builder methods also seem rather suitable for deferring to a crate to derive, and that would work better with a breaking change away from these large parameterized methods funneling data through them.

I won't have time to tackle such any time soon, but something to consider as more models and features are added before this sprawls further 😅

github-actions · 2024-05-19T14:05:43Z

Code Metrics Report

  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 Dockerfile              1           34           25            0            9
 Happy                   1          442          369            0           73
 JSON                    5            9            9            0            0
 Python                 21          741          622           21           98
 TOML                   15          390          353            1           36
-------------------------------------------------------------------------------
 Jupyter Notebooks       1            0            0            0            0
 |- Markdown             1           60           30           22            8
 |- Python               1           96           87            1            8
 (Total)                            156          117           23           16
-------------------------------------------------------------------------------
 Markdown               15         1026            0          758          268
 |- BASH                 6          205          192            0           13
 |- Python               6          121          110            0           11
 |- Rust                 3          185          172            9            4
 (Total)                           1537          474          767          296
-------------------------------------------------------------------------------
 Rust                   84        27822        25504          356         1962
 |- Markdown            40          419            0          407           12
 (Total)                          28241        25504          763         1974
===============================================================================
 Total                 144        30464        26882         1136         2446
===============================================================================

EricLBuehler · 2024-05-20T14:10:39Z

@polarathene, thanks for proposing these changes. Currently, this code is a bit obscure, so any changes to make it more idiomatic is much appreciated. I left some comments below. It also looks like the Clippy lints are failing like #334, can you please take a look?

I'm not sure if Lora quantized kinds were intentionally excluded here, could simply be a call to kind.is_quantized()?:

Yes, that is a mistake. Can you please add an kind.is_quantized() function?

There seems to be some logic to detect a difference between lora and xlora types? (the ModelKind could better indicate that or similar?):

I think some methods on ModelKind would make this possible.

Many of these would be collapsed to the compound type proposed by the PR:

Sounds good!

Whereas these would need additional refactoring, but all have the same shape for parameters that they could pass in a struct?

Yes, I think we could do that. Please feel free to draft something here!1

polarathene · 2024-05-20T23:05:27Z

Sure I'll tackle the feedback later today, current tasks for this project are:

Wrap another refactor PR (larger).
Get the earlier strum PR merged.
Rebase here and resolve concerns + additions.

EricLBuehler · 2024-05-20T23:28:56Z

Great! I think that making the code more idiomatic now before I merge #309 (Idefics 2 vision model) or #324 (Nomic embedding model) which would add lots of complexity.

polarathene · 2024-05-22T01:05:11Z

Can you please add an kind.is_quantized() function?

@EricLBuehler how should this work for Speculative? At a glance it seems both target and draft fields can vary in model used, the toml-selectors examples I saw have one that is a base + quantized and another that is quantized + quantized.

I don't know how this type/feature works, so what would be the expectation if is_quantized() was called on that kind variant? Likewise I was thinking to add a quantized_type() and adapter_type() that returns an option of the quantized/adapter kind when valid else None. Since target and draft can differ in ModelKind I'm not sure how I should approach that beyond None 🤷‍♂️

EricLBuehler · 2024-05-22T01:10:49Z

I think for these methods, we should return a vector of the resulting data or perhaps some sort of struct. That way, there is much more type-safe control, which really fits the spirit of this PR. For example, you can check if any are quantized with .any().

polarathene

Context for review added, hope it's helpful 👍

mistralrs-core/src/pipeline/mod.rs

mistralrs-core/src/pipeline/normal.rs

polarathene · 2024-05-22T04:58:37Z

mistralrs-bench/src/main.rs

-                | ModelKind::XLoraGGML
-                | ModelKind::XLoraGGUF
-        )
+    if use_flash_attn && !loader.get_kind().is_quantized()


This is now simpler and covers both adapter kinds as a bugfix.

NOTE: Speculative kind will also trigger this now. I am not 100% sure if that's desired? This is due to the any() call, as I assume if either target or draft are quantized it may not be compatible, but likewise this does not guarantee is_quantized == true equates to Speculative having both as true.

EricLBuehler

Thanks for the PR. I think the logic is inverted on the checks for flash attention, but other than that, I am happy to accept a breaking change for the ModelKind enum and look forward to the quantized pipeline refactor.

mistralrs-bench/src/main.rs

mistralrs-core/src/pipeline/mod.rs

mistralrs-core/src/pipeline/normal.rs

mistralrs-server/src/main.rs

polarathene · 2024-05-23T04:49:26Z

I think the logic is inverted?

Good spotting! 😬

~~I'll resolve that shortly~~ 👍 (done!)

Just to confirm this concern I raised (as it'll be hidden in the resolved review comment):

#335 (comment)

NOTE: Speculative kind will also trigger this now. I am not 100% sure if that's desired?
This is due to the any() call, as I assume if either target or draft are quantized it may not be compatible, but likewise this does not guarantee is_quantized == true equates to Speculative having both as true.

ModelKind::Speculative { .. } may only be partially quantized (I've seen the draft always as quantized? While target may be quantized).

I assume it's better for is_quantized() to return true when there is any hint of quantization in the kind?
Unlike every other variant, you'd have a caveat where this must be kept in mind that it does not guarantee a kind that stores multiple kinds are all quantized.
Same applies for is_adapted() and the _and(|| { ... }) methods (which also use .any()).

Using .all() instead doesn't resolve the concern either.. so it might need to be documented with some context that variants like Speculative are prone to bugs?

Looks good. It seems like there are a few compat impls here which perhaps you can remove in a future PR by using ModelKindB? It seems much cleaner.

Yes, as I wasn't sure about breaking changes being approved I've been a bit cautious. A follow-up PR can remove the compat layers here when ModelKindB becomes the new ModelKind 👍

I think this breaking change is acceptable in a future PR. We're pre v1, so it is fine.

I'll tackle that after my next PR (dependent upon changes here) is sorted out 👍

polarathene · 2024-05-23T04:49:34Z

I decided to mirror is_some + is_some_and that exist for Option type, this way you have a bit more flexibility and under the hood it handles the vec iteration without having to think about it.

I like this API. The old way with the boolean wasn't very idiomatic anyway.

NOTE: I have not applied this to gguf.rs and ggml.rs as I have a large refactoring PR already touching that. I can apply the same changes there after this PR is merged.

Sounds like a good idea.

In my upcoming PR it was a little awkward/noisy with this new approach to check if the kind has an Adapter and set variables is_lora + is_xlora.

Speculative isn't supported in the match arms for GGUF/GGML pipeline (probably a good reason for that?), so I can probably just assume only the first element returned from adapter_type() is relevant and use if let Some(a) with a.is_lora() + a.is_xlora() called directly instead of as a closure 👍

As I'm not really familiar with these features yet, am I right to assume that future adapters may be supported that aren't "LoRA" specific?

I am new to this domain, so I only have a surface understanding that LoRA is a technique to direct output towards a preferred style (I've noticed it can be applied with image, audio and text generation models). I think it's often referred to as a way to approach "fine-tuning" as a light-weight alternative via a separate file(s), as opposed to duplicating the model (weights?) data but with the adjustments merged (requiring much more disk space and inefficient for changing styles/adapters at runtime? Which the LoRAX project describes).

I have no idea about X-LoRA, and haven't found time to look into it yet.

Not too important to resolve for now, I'll continue with my refactoring and adjust when appropriate 😅

EDIT: I just took a glance at LoRAX docs for Adapters and noticed an Adapter type called Medusa that mentions "Speculative Decoding", which I assume is what Speculative is meant to represent?

The `is_lora` example is only applied to `pipeline/normal.rs` for now as a sibling PR is refactoring `pipeline/{ggml,gguf}.rs` files.

polarathene · 2024-05-23T04:57:18Z

I rebased and squashed the commit suggestions into the prior commit.

The two main.rs conditions for flash attention have been inverted. Should be good to merge 👍

If you're new to squash merge on Github, it should provide by default the content of each commit message I wrote, and use the PR name as the squash commit name (with link ref to this PR).

EricLBuehler · 2024-05-23T09:22:15Z

Speculative isn't supported in the match arms for GGUF/GGML pipeline (probably a good reason for that?), so I can probably just assume only the first element returned from adapter_type() is relevant and use if let Some(a) with a.is_lora() + a.is_xlora() called directly instead of as a closure 👍

Speculative needs 2 models, so all other pipelines can just ignore the possibility of loading it.

EDIT: I just took a glance at LoRAX docs for Adapters and noticed an Adapter type called Medusa that mentions "Speculative Decoding", which I assume is what Speculative is meant to represent?

Yes, but the Medusa method is a bit different. Speculative Decoding is a sampling technique which uses a (smaller, faster, less precise) "draft" model to accelerate inference of a (larger, slower, more precise) "target" model.

EricLBuehler

Looks good, thank you for this change! I think removing the ModelKind enum is not even a breaking change, actually, as it is not exported publically. Perhaps you could remove it in the GGUF/GGML pipeline refactor?

polarathene · 2024-05-23T09:36:33Z

Perhaps you could remove it in the GGUF/GGML pipeline refactor?

Sure I can look into throwing that change in as a final commit 👍

polarathene marked this pull request as draft May 19, 2024 14:04

polarathene mentioned this pull request May 22, 2024

chore: Use strum to simplify GGUFArchitecture maintenance #334

Merged

polarathene force-pushed the refactor/modelkind branch from 9d922cd to e7f3208 Compare May 22, 2024 04:32

polarathene marked this pull request as ready for review May 22, 2024 04:33

polarathene commented May 22, 2024

View reviewed changes

This comment was marked as resolved.

Sign in to view

polarathene force-pushed the refactor/modelkind branch from 46da520 to fb0989a Compare May 22, 2024 11:38

EricLBuehler reviewed May 23, 2024

View reviewed changes

polarathene added 2 commits May 23, 2024 16:53

refactor: ModelKind with strum + derive_more

c5db322

refactor: Utilize is_adapted_and() + is_quantized()

771864d

The `is_lora` example is only applied to `pipeline/normal.rs` for now as a sibling PR is refactoring `pipeline/{ggml,gguf}.rs` files.

polarathene force-pushed the refactor/modelkind branch from b255f24 to 771864d Compare May 23, 2024 04:53

EricLBuehler approved these changes May 23, 2024

View reviewed changes

EricLBuehler merged commit 05f6ab8 into EricLBuehler:master May 23, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: `ModelKind` with `strum` + `derive_more` #335

refactor: `ModelKind` with `strum` + `derive_more` #335

polarathene commented May 19, 2024

github-actions bot commented May 19, 2024 •

edited

EricLBuehler commented May 20, 2024

polarathene commented May 20, 2024

EricLBuehler commented May 20, 2024

polarathene commented May 22, 2024

EricLBuehler commented May 22, 2024

polarathene left a comment

polarathene May 22, 2024

This comment was marked as resolved.

EricLBuehler left a comment

polarathene commented May 23, 2024 •

edited

polarathene commented May 23, 2024

polarathene commented May 23, 2024

EricLBuehler commented May 23, 2024

EricLBuehler left a comment

polarathene commented May 23, 2024

	if use_flash_attn
	&& matches!(
	loader.get_kind(),
	ModelKind::QuantizedGGML
	\| ModelKind::QuantizedGGUF
	\| ModelKind::XLoraGGML
	\| ModelKind::XLoraGGUF
	)
	{
	warn!("Using flash attention with a quantized model has no effect!")
	}

	let is_xlora = match &model {
	Model::Llama(_) \| Model::Phi2(_) \| Model::Phi3(_) => false,
	Model::XLoraLlama(_) \| Model::XLoraPhi3(_) => !is_lora,
	};

	metadata: GeneralMetadata {
	max_seq_len,
	repeat_last_n: self.config.repeat_last_n,
	tok_trie,
	has_no_kv_cache: self.no_kv_cache,
	is_xlora,
	num_hidden_layers,
	eos_tok: eos,
	is_lora,
	},

	let mut is_lora = false;
	let mut model = match self.kind {
	ModelKind::QuantizedGGUF => unreachable!(),
	ModelKind::QuantizedGGML => unreachable!(),

	ModelKind::XLoraGGUF => unreachable!(),
	ModelKind::XLoraGGML => unreachable!(),
	ModelKind::LoraGGUF => unreachable!(),
	ModelKind::LoraGGML => unreachable!(),
	ModelKind::Speculative {
	target: _,
	draft: _,
	} => unreachable!(),

	ModelKind::Normal => normal_model_loader!(
	paths,
	dtype,
	default_dtype,
	&load_device,
	config,
	self.inner,
	self.config.use_flash_attn,
	silent,
	mapper,
	in_situ_quant.is_some(),
	device.clone()
	),
	ModelKind::XLoraNormal => xlora_model_loader!(
	paths,
	dtype,
	default_dtype,
	&load_device,
	config,
	self.inner,
	self.config.use_flash_attn,
	silent,
	mapper,
	in_situ_quant.is_some(),
	device.clone()
	),
	ModelKind::LoraNormal => {
	is_lora = true;
	lora_model_loader!(
	paths,
	dtype,
	default_dtype,
	&load_device,
	config,
	self.inner,
	self.config.use_flash_attn,
	silent,
	mapper,
	in_situ_quant.is_some(),
	device.clone()
	)
	}

refactor: ModelKind with strum + derive_more #335

refactor: ModelKind with strum + derive_more #335

Conversation

polarathene commented May 19, 2024

github-actions bot commented May 19, 2024 • edited

EricLBuehler commented May 20, 2024

polarathene commented May 20, 2024

EricLBuehler commented May 20, 2024

polarathene commented May 22, 2024

EricLBuehler commented May 22, 2024

polarathene left a comment

Choose a reason for hiding this comment

polarathene May 22, 2024

Choose a reason for hiding this comment

This comment was marked as resolved.

EricLBuehler left a comment

Choose a reason for hiding this comment

polarathene commented May 23, 2024 • edited

polarathene commented May 23, 2024

polarathene commented May 23, 2024

EricLBuehler commented May 23, 2024

EricLBuehler left a comment

Choose a reason for hiding this comment

polarathene commented May 23, 2024

refactor: `ModelKind` with `strum` + `derive_more` #335

refactor: `ModelKind` with `strum` + `derive_more` #335

github-actions bot commented May 19, 2024 •

edited

polarathene commented May 23, 2024 •

edited