Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero-Shot Topic Modelling and Topics Over Time #1999

Open
LopezBanos opened this issue May 20, 2024 · 1 comment
Open

Zero-Shot Topic Modelling and Topics Over Time #1999

LopezBanos opened this issue May 20, 2024 · 1 comment

Comments

@LopezBanos
Copy link

I created a Zero-Shot Model with certain topics specified and some that Zero Topics found.

# BERTopic Model

topic_model = BERTopic(
    embedding_model="thenlper/gte-small", # https://huggingface.co/thenlper/gte-large
    min_topic_size=15,
    zeroshot_topic_list=zeroshot_topic_list,
    zeroshot_min_similarity=.80,
    representation_model=KeyBERTInspired()
)
# Results
topics, probs = topic_model.fit_transform(docs)

If I want to plot the Topics Over Time I got an error:

# Topics Over Time (docs was a pd.Series and now I convert it to a list, both docs.to_list() and timestamps have 161 lenght)
topics_over_time = topic_model.topics_over_time(docs.to_list(), timestamps) # Error Happens in this line
model.visualize_topics_over_time(topics_over_time, topics=[0,1,2,3,4,5,6,7,8,9])

The error I get is:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[14], line 2
      1 # Topics Over Time
----> 2 topics_over_time = topic_model.topics_over_time(docs.to_list(), timestamps)
      3 model.visualize_topics_over_time(topics_over_time, topics=[0,1,2,3,4,5,6,7,8,9])

File ~/.conda/envs/BerTopicOctis/lib/python3.10/site-packages/bertopic/_bertopic.py:768, in BERTopic.topics_over_time(self, docs, timestamps, topics, nr_bins, datetime_format, evolution_tuning, global_tuning)
    766 selected_topics = topics if topics else self.topics_
    767 documents = pd.DataFrame({"Document": docs, "Topic": selected_topics, "Timestamps": timestamps})
--> 768 global_c_tf_idf = normalize(self.c_tf_idf_, axis=1, norm='l1', copy=False)
    770 all_topics = sorted(list(documents.Topic.unique()))
    771 all_topics_indices = {topic: index for index, topic in enumerate(all_topics)}

File ~/.conda/envs/BerTopicOctis/lib/python3.10/site-packages/sklearn/preprocessing/_data.py:1786, in normalize(X, norm, axis, copy, return_norm)
   1783 else:
   1784     raise ValueError("'%d' is not a supported axis" % axis)
-> 1786 X = check_array(
   1787     X,
   1788     accept_sparse=sparse_format,
   1789     copy=copy,
   1790     estimator="the normalize function",
   1791     dtype=FLOAT_DTYPES,
   1792 )
   1793 if axis == 0:
   1794     X = X.T

File ~/.conda/envs/BerTopicOctis/lib/python3.10/site-packages/sklearn/utils/validation.py:867, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    864 if ensure_2d:
    865     # If input is scalar raise error
    866     if array.ndim == 0:
--> 867         raise ValueError(
    868             "Expected 2D array, got scalar array instead:\narray={}.\n"
    869             "Reshape your data either using array.reshape(-1, 1) if "
    870             "your data has a single feature or array.reshape(1, -1) "
    871             "if it contains a single sample.".format(array)
    872         )
    873     # If input is 1D raise error
    874     if array.ndim == 1:

ValueError: Expected 2D array, got scalar array instead:
array=nan.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
@MaartenGr
Copy link
Owner

Sorry for this! Zero-shot topoic modeling is not available at the moment together with topics over time because of the missing c-TF-IDF matrix. Instead, you can use .update_topics so that the underlying c-TF-IDF matrices are created. After that, it should work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants