Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
jmschrei committed Apr 16, 2023
1 parent 9004e21 commit 3be4acc
Showing 1 changed file with 41 additions and 20 deletions.
61 changes: 41 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,39 +5,37 @@
> **Note**
> IMPORTANT: pomegranate v1.0.0 is a ground-up rewrite of pomegranate using PyTorch as the computational backend instead of Cython. Although the same functionality is supported, the API is significantly different. Please see the tutorials and examples folders for help rewriting your code.
torchegranate is a rewrite of the [pomegranate](https://github.com/jmschrei/pomegranate) library to use PyTorch as a backend. It implements probabilistic models with a modular implementation, enabling greater flexibility in terms of model creation than most models allow. Specifically, one can drop any probability distribution into any compositional model, e.g., drop Poisson distributions into a mixture model, to create any model desired without needing to explicitly hardcode each potential model. Because one is defining the distributions to use in each of the compositional models, there is no limitation on the models being homogenous -- one can create a mixture of a exponential distribution and a gamma distribution just as easily as creating a mixture entirely composed of gamma distributions.
pomegranate is a library of probabilistic models defined by its modular implementation and treatment of all probabilistic models as the probability distributions they are. Together, these design choices allow one to easily drop two normal distributions into a mixture model to create a Gaussian mixture model or drop a Poisson distribution and an exponential distribution to just as easily create a heterogeneous mixture, or drop in two Bayesian networks, or drop in two hidden Markov models to create a mixture over sequence models.

But that's not all! A core aspect of pomegranate's philosophy is that every probabilistic model is a probability distribution. Hidden Markov models are simply distributions over sequences and Bayesian networks are joint probability tables that are broken down according to the conditional independences defined in a graph. Functionally, this means that one can drop a mixture model into a hidden Markov model to use as an emission just as easily as one can drop an individual distribution. As part of the roadmap, and made part in possible due to the flexibility of PyTorch, complete cross-functionality should be possible, such as a Bayes classifier with hidden Markov models or a mixture of Bayesian networks.
Recently, pomegranate (v1.0.0) was rewritten from the ground up using PyTorch to replace the outdated Cython backend. This rewrite gave me an opportunity to fix many bad design choices that I made while I was still a bb software engineer. Unfortunately, many of these changes are not back compatible and will disrupt everyone's workflows. However, the changes have significantly improved/simplified the code, fixed many issues raised by the community over the year, made it easier to contribute, and made the code significantly faster. I've written more below, but you're likely here now because your code is broken and this is the tl;dr.

Special shout-out to [NumFOCUS](https://numfocus.org/) for supporting this work with a special development grant.

### Installation

`pip install torchegranate`
`pip install pomegranate`

If you need the last Cython release before the rewrite, use `pip install pomegranate==0.14.8`.

### Why a Rewrite?

This rewrite was motivated by three main reasons:
This rewrite was motivated by four main reasons:

- <b>Speed</b>: Native PyTorch is just as fast as the hand-tuned Cython code I wrote for pomegranate, if not significantly faster.
- <b>Community Contribution</b>: A challenge that many people faced when using pomegranate was that they could not extend it because they did not know Cython, and even if they did know it, coding in Cython is a pain. I felt this pain every time I tried adding a new feature or fixing a bug. Using PyTorch as the backend significantly reduces this problem.
- <b>Speed</b>: Native PyTorch is usually significantly faster than the hand-tuned Cython code that I wrote.
- <b>Features</b>: PyTorch has many features, such as serialization, mixed precision, and GPU support, that can now be directly used in pomegranate without additional work on my end.
- <b>Community Contribution</b>: A challenge that many people faced when using pomegranate was that they could not extend it because they did not know Cython, and even if they did know it, coding in Cython is a pain. I felt this pain every time I tried adding a new feature or fixing a bug. Using PyTorch as the backend significantly reduces the amount of effort needed to add in new features.
- <b>Interoperability</b>: Libraries like PyTorch offer a unique ability to not just utilize their computational backends but to better integrate into existing deep learning resources. This rewrite should make it easier for people to merge probabilistic models with deep learning models.

### Roadmap

The ultimate goal is for this repository to include all of the useful features from pomegranate, at which point this repository will be merged back into the main pomegranate library. However, that is quite a far way off. Here are some milestones that I see for the next few releases.
## Features

- [x] v0.1.0: Initial draft of most models with basic feature support, only on CPUs
- [x] v0.2.0: Addition of GPU support for all existing operations and serialization via PyTorch
- [x] v0.3.0: Addition of missing value support for all existing algorithms
- [x] v0.4.0: Addition of Bayesian networks and factor graphs
- [ ] v0.5.0: Addition of sampling algorithms for each existing method
- [ ] v0.6.0: Addition of pass-through for forward and backward algorithms to enable direct inclusion of these components into PyTorch models
> **Note**
> Please see the [tutorials](https://github.com/jmschrei/pomegranate/tree/master/tutorials) folder for code examples.
Switching from a Cython backend to a PyTorch backend has enabled or expanded a large number of features. Because the rewrite is a thin wrapper over PyTorch, as new features get released for PyTorch they can be applied to pomegranate models without the need for a new release from me.

### GPU Support

All distributions in torchegranate have GPU support. Because each distribution is a `torch.nn.Module` object, the use is identical to other code written in PyTorch. This means that both the model and the data have to be moved to the GPU by the user. For instance:
All distributions and methods in pomegranate now have GPU support. Because each distribution is a `torch.nn.Module` object, the use is identical to other code written in PyTorch. This means that both the model and the data have to be moved to the GPU by the user. For instance:

```python
>>> X = torch.exp(torch.randn(50, 4))
Expand Down Expand Up @@ -74,7 +72,7 @@ tensor([1.9902, 2.3871, 0.8984, 1.2215], device='cuda:0')

### Serialization

torchegranate objects are all instances of `torch.nn.Module` and so serialization is the same as any other model and can use any of the other built-in functionality.
pomegranate distributions are all instances of `torch.nn.Module` and so serialization is the same as any other model and can use any of the other built-in functionality.

Saving:
```python
Expand All @@ -90,9 +88,32 @@ Loading:
>>> model = torch.load("test.torch")
```

### torch.compile

> **Note**
> `torch.compile` is under active development by the PyTorch team and may rapidly improve. For now, you may need to pass in `check_data=False` when initializing models to avoid one compatibility issue.
In PyTorch v2.0.0, `torch.compile` was introduced as a flexible wrapper around tools that would fuse operations together, use CUDA graphs, and generally try to remove I/O bottlenecks in GPU execution. Because these bottlenecks can be extremely significant in the small-to-medium sized data settings many pomegranate users are faced with, `torch.compile` seems like it will be extremely valuable. Rather than targetting entire models, which mostly just compiles the `forward` method, you should compile individual methods from your objects.

```python
# Create your object as normal
>>> mu = torch.exp(torch.randn(100))
>>> d = Exponential(mu).cuda()

# Create some data
>>> X = torch.exp(torch.randn(1000, 100))
>>> d.log_probability(X)

# Compile the `log_probability` method!
>>> d.log_probability = torch.compile(d.log_probability, mode='reduce-overhead', fullgraph=True)
>>> d.log_probability(X)
```

Unfortunately, I have had difficulty getting `torch.compile` to work when methods are called in a nested manner, e.g., when compiling the `predict` method for a mixture model which, inside it, calls the `log_probability` method of each distribution. I have tried to organize the code in a manner that avoids some of these errors, but because the error messages right now are opaque I have had some difficulty.

### Missing Values

torchegranate supports handling data with missing values through `torch.masked.MaskedTensor` objects. Simply, one needs to just put a mask over the values that are missing.
pomegranate supports handling data with missing values through `torch.masked.MaskedTensor` objects. Simply, one needs to just put a mask over the values that are missing.

```python
>>> X = <your tensor with NaN for the missing values>
Expand All @@ -114,10 +135,10 @@ Because not all operations are yet available for MaskedTensors, the following di

`torch.distributions` is a great implementation of the statistical characteristics of many distributions, but does not implement fitting these distributions to data or using them as components of larger functions. If all you need to do is calculate log probabilities, or sample, given parameters (perhaps as output from neural network components), `torch.distributions` is a great, simple, alternative.

> What models are implemented in torchegranate?
> What models are implemented in pomegranatee?

Currently, implementations of many distributions are included, as well as general mixture models, Bayes classifiers (including naive Bayes), hidden Markov models, and Markov chains. Bayesian networks will be added soon but are not yet included.

> How much faster is this than pomegranate?
> How much faster is v1.0.0 over previous versions?

It depends on the method being used. Most individual distributions are approximately 2-3x faster. Some distributions, such as the categorical distributions, can be over 10x faster. These will be even faster if a GPU is used.

0 comments on commit 3be4acc

Please sign in to comment.