Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suggestion: incorporating document-level covariates #360

Open
mimshiran opened this issue Dec 7, 2021 · 22 comments
Open

suggestion: incorporating document-level covariates #360

mimshiran opened this issue Dec 7, 2021 · 22 comments

Comments

@mimshiran
Copy link

Hi

I'd like to suggest adding the option of document-level covariate (similar to STM in R). Basically it allows the user to investigate relationship between document-level covariates such as source (political affiliation for example), country, etc with topical prevalence. Is it feasible? it would make the package much more useful for social research.

Thank you!

@MaartenGr
Copy link
Owner

Thank you for the suggestion! I am not very familiar with document-level covariates as used in STM. However, reading through the STM documentation there seems to be some overlap in specific cases. For example, to explore the effect of certain document-level variables you can model topics per variable following this guide using topics_per_class. However, any quantification can be done by the user.

Do you feel that the topics_per_class sufficiently covers the use case? If not, what would you like to see added?

@mimshiran
Copy link
Author

Thank you for your response. topics_per_class is very helpful, but in stm the user can investigate the interaction between covariates and their relationship with topical prevalence. for example the interaction of gender with education on topics written by different people, or have more than one covariate (like a regression) so it's not just difference in topics for different categories. I think this link does a better job of explaining https://scholar.princeton.edu/files/bstewart/files/stmnips2013.pdf

@MaartenGr
Copy link
Owner

Thank you for sharing the paper, will make sure to read it through!

It does seem like it would definitely be an interesting and useful extension of sorts to BERTopic. I would love to implement it but it seems like it would take quite some time. Having said that, I'll make sure to put it on the list and see if I could implement a basic version of it.

@MaartenGr
Copy link
Owner

MaartenGr commented Jan 4, 2022

A quick update:

I have definitely not forgotten about this feature! I took some time in the last weeks to read through the paper and the R code of the STM model and it seems that it would be difficult to replicate the exact procedure in BERTopic. However, I do believe I can perhaps create a proxy for that procedure. It seems that the main difficulty here is preparing the data for a meaningful OLS model for two output variables: topic prevalence and topical content.

Topic Prevalence

We can proxy topic prevalence by leveraging the probability of each document belonging to a specific topic. That way, we can see if a covariate or combination of covariates influences the probability or prevalence of a topic across documents. To give you an example, this is a visualization of using the probabilities to see the differences in prevalence between democrats and republicans talking about American politics in 2008:
newplot (2)

And then a basic GLM probs ~ rating where probs is the topic prevalence and rating is either Conservative or Liberal for topic 0 (iraq | iraqi | troops):

Knipsel

However, the results do lean to either quickly significant covariates or not at all so I might have to do some normalization there.

Topical Content

The topical content is a bit more difficult and is one that I am currently working on. I believe I can proxy this by using localized c-TF-IDF representations but there is quite a bit of testing and optimization involved.

Concluding

Note that both implementations can take a bit of time as I am currently doing some experiments for the BERTopic paper that is coming.

What do you think? Does it all make sense? Does it seem useful to your use cases?

@mimshiran
Copy link
Author

This looks great! thank you so much for taking the time and working on this. This would be really useful for a more nuanced analysis of text. I dont know how I might be of help, but if there's anything you need help with please reach out and I'll see if it's within my capabilities

@MaartenGr
Copy link
Owner

Although there might be some more experimentation needed, I think users can start testing out the first version of performing covariate analyses within BERTopic. It is not ready to be added to BERTopic as there are some assumptions to the statistical models that I am not entirely convinced of. In the future, I might also consider more elegant approaches but for the time being this is something to experiment with.

Covariates

Within the structural topic model, the covariates and their impact on the topics are modeled during the creation of the topics. This is not the case with BERTopic as it assumes that topics are generated independently from any covariates that might exist. Technically, we can generate embeddings based on the metadata but I do not believe it to be necessary at this moment in order to improve upon the topic generation process. However, it is something to take into account as STM does assume that covariates influence the topic generation process. Do note that both models do assume that covariates might influence both topic content and prevalence.

Topic Prevalence

The topic prevalence is modeled using the document-topic probability matrix as a proxy. This means that we assume that a higher probability of a document belonging to topic t, the higher its topic prevalence is. Again, this assumption does not necessarily hold true but from some experiments I did, it seems like a strong proxy for the topic prevalence.

Topic Content

The topic content is a bit more difficult to implement as a dependent variable in contrast to topic prevalence where we can directly access the document-topic probability matrix. To do this, we calculate the c-TF-IDF representation of each document instead of the entire topic. It allows us to create a very localized representation of a topic. We then calculate the cosine similarity between the local c-TF-IDF representation for each document and the c-TF-IDF representation of the topic the document belongs to. This results in a bunch of similarity scores that we can use to model the topic content. Then, we can calculate the effect of covariates on the topic content. We assume that when a covariate changes the way a topic is represented the similarity scores will vary which should be captured in the resulting statistical model.

Code

As mentioned before, I am not at the point of including this into BERTopic but I am very curious if this is something users are interested in and also what their experience is using this extension. So, I will share the code here in a way that it should be easy to use on top of BERTopic.

Minimal Example

We start with a minimal example of how to measure the effect of covariates on topic prevalence and topic content. We are going to be using a corpus consisting of political blogs in 2008 (more info here about the data) with two possible covariates:

  • rating
    • a factor variable giving the partisan affiliation of the blog (based on who they supported for president)
  • day
    • the day of the year (1 to 365). All entries are from 2008.

First, we need to pip install statsmodels pip install statsmodels and then we can load in the data and train a basic BERTopic model:

import pandas as pd
from bertopic import BERTopic

# Load data
df = pd.read_csv("http://scholar.princeton.edu/sites/default/files/bstewart/files/poliblogs2008.csv")
docs = df.documents.tolist()
metadata = df.loc[:, ["rating", "day"]].copy()

# Fit BERTopic
topic_model = BERTopic(calculate_probabilities=True, min_topic_size=50)
topics, probs = topic_model.fit_transform(docs)

In the above example, nothing special is happening thus far except for one thing: we need to put all the metadata into a single dataframe. Here, we are only using rating and day. Before we start extracting the effect of covariates on topic content and prevalence, we first need to load in our function for doing so:

import numpy as np
import pandas as pd
from bertopic import BERTopic
from typing import Union, Callable, List, Mapping, Any
from sklearn.metrics.pairwise import cosine_similarity

import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.base.wrapper as wrap


def estimate_effect(topic_model, 
                    docs: List[str], 
                    probs: np.ndarray, 
                    topics: Union[int, List[int]], 
                    metadata: pd.DataFrame, 
                    y: str = "prevalence", 
                    estimator: Union[str, Callable] = None,
                    estimator_kwargs: Mapping[str, Any] = None) -> List[wrap.ResultsWrapper]:
    
    """ Estimate the effect of metadata on topic prevalence and topic content
    
    Arguments:
        docs: The original list of documents on which the model was trained on
        probs: A mxn probability matrix, *m* is the number of document and 
               *n* the number of topics. It represents the probabilities of all topics 
               across all documents. 
        topics: The topic(s) for which you want to estimate the effect of metadata on
        metadata: The metadata in a dataframe. Make sure that the columns have the exact same 
                  name as the elements in the estimator
        y: The target, either "prevalence" (topic prevalence) or "content" (topic content)
        estimator: Either the formula used in the estimator or a custom estimator. 
                   When it is used as a formula, it follows R-style formulas, for example:
                      * 'prevalence ~ rating'
                      * 'prevalence ~ rating + day + rating:day'
                   Make sure that the target is either 'prevalence' or 'content'
                   The custom estimator should be a `statsmodels.formula.api`, currently, 
                   `statsmodels.api` is not supported.
        estimator_kwargs: The arguments needed within the estimator, needs at 
                          least a "formula" argument
                          
    Returns:
        fitted_estimators: List of fitted estimators for either topic prevalence or topic content
    """

    data = metadata.loc[::] 
    data["topics"] = topic_model._map_predictions(topic_model.hdbscan_model.labels_)
    fitted_estimators = []
    
    if isinstance(topics, int):
        topics = [topics]
    
    # As a proxy for the topic prevalence, we take the probability of a document
    # belonging to a specific topic. We assume that a higher probability of a topic 
    # belonging to that topic also results in that document talking more about that topic    
    if y == "prevalence":
        for topic in topics:
            # Prepare topic prevalence, 
            # Exclude probs == 1 as no zero-one inflated beta regressions are currently avaible
            data["prevalence"] = list(probs[:, topic])
            data_filtered = data.loc[data.prevalence < 1, :]

            # Either use a custom estimator or a pre-set model
            if callable(estimator):
                est = estimator(data=data_filtered, **estimator_kwargs).fit()
            else:
                est = smf.glm(estimator, data=data_filtered, 
                             family=sm.families.Gamma(link=sm.families.links.log())).fit()
            fitted_estimators.append(est)

    # Topic content is modeled on a document-level by calculating the document cTFIDF 
    # representation. Based on that representation, we calculate its cosine similarity 
    # with its topic cTFIDF representation. The assumption here, is that we expect different 
    # similarity scores if a covariate changes the topic content.
    elif y == "content":
        
        # Extract topic content and prevalence
        data=data.loc[data.topic == topic, :]
        c_tf_idf_per_doc, _ = topic_model._c_tf_idf(pd.DataFrame({"Document": docs}), fit=False)
        sim_matrix = cosine_similarity(c_tf_idf_per_doc, topic_model.c_tf_idf)
        data["content"] = sim_matrix[:, topic+1]
        data["prevalence"] = list(probs[:, topic])              
        
        # Either use a custom estimator or a pre-set model
        if callable(estimator):
            est = estimator(data=data, **estimator_kwargs).fit()
        else:
            est = smf.glm(estimator, data=data, 
                          family=sm.families.Gamma(link=sm.families.links.log())).fit()  
        fitted_estimators.append(est)

    return fitted_estimators

The above is the main function for running our statistic models. The only interesting part is the arguments and their documentation but we will get to that in the next step.

To run the analyses we can simply call the above function with the appriate parameters:

ests = estimate_effect(topic_model=topic_model, 
                      topics=[1, 2],
                      metadata=metadata, 
                      docs=docs, 
                      probs=probs, 
                      estimator="prevalence ~ rating",
                      y="prevalence")
print([est.summary() for est in ests])

In the code above there are two parameters that are important, namely topics, estimator, and y:

  • topics
    • A list of topics and refers to the topics for which you want to calculate the effect of covariates. Here, we choose topics 1 and 2.
  • estimator
    • The R-styled formula that you will be using in the model
    • We choose to model the effect of "rating" on "prevalence"
  • y
    • This needs to be the exact same value as the dependent variable, here "prevalence"

To model the topic content, we can simply run:

ests = estimate_effect(topic_model=topic_model, 
                      topics=[1, 2],
                      metadata=metadata, 
                      docs=docs, 
                      probs=probs, 
                      estimator="content ~ rating",
                      y="content")
print([est.summary() for est in ests])

Note that the value of "rating" in estimator corresponds to the column name in metadata.

Custom Statistical Model

We can extend the above by defining a statistical model if we might expect the data to follow a different distribution or if you simply do not agree with any of the rough defaults I have set in the model:

estimator = smf.glm
estimator_kwargs = {"formula": 'prevalence ~ rating',
                    "family": sm.families.Gamma(link=sm.families.links.log())}

ests = estimate_effect(topic_model=topic_model, 
                      topics=[1],
                      metadata=metadata, 
                      docs=docs, 
                      probs=probs, 
                      y="prevalence",
                      estimator=estimator,
                      estimator_kwargs=estimator_kwargs)
print([est.summary() for est in ests])

In the example above, you can see that the estimator is now used to select a glm model, whereas the estimator_kwargs are used to define specific parameters. Here, the formula parameter in the estimator_kwargs is necessary as it otherwise does not know what the formula is.

Feedback

This is, hopefully, a fairly straightforward example of analyzing the effects of covariates on topic content and prevalence. It should work on any data. I would advise following along with the above minimal example and perhaps looking at some of the variables to see how they work.

As mentioned before, you can view this as a proof of concept but still usable as it is right now. This does mean, however, that things might be subject to change and this will be improved as more feedback comes in.

In other words, any and all feedback is highly appreciated!

@MaartenGr MaartenGr pinned this issue Jan 10, 2022
@mimshiran
Copy link
Author

This is so exciting!! I'll test it as soon as possible and let you know if I have any comments. Thank you so much for your work on this.

@simonfelding
Copy link
Contributor

Very interesting. Thank you for the code example and for pinning this.

@drob-xx
Copy link

drob-xx commented Feb 2, 2022

I've started looking at the covariates code and really appreciate the willingness to extend the code base and move in this direction. One issue jumps out at me however. As has been pointed out many times, setting calculate_probabilities=True will greatly increase the processing time to create a model on a corpus of any size. From reading through the issues queue, my understanding is that the underlying issue is with hdbscan.membership_vector and is not directly addressable within BERTopic. Is the thinking that at some point this will be addressed (as hinted in [(https://github.com//issues/367)]

I'm bringing this up because I've essentially looked outside of BERTopic to address this whole issue (covariates). I am using BERTopic for an initial round of topic identification and then using vocabularies built upon that initial pass to arrive at cosine similarity scores to determine the impact of a covariate. In my case I identify a relevant vocabulary and then weigh it against party affiliation (Democrat, Republican). If I'm understanding correctly the solution being pursued here would provide a substitute path entirely within BERTopic. A great idea but the processing time issue seems like a practical limitation.

@MaartenGr
Copy link
Owner

@drob-xx You are completely right in stating that setting calculate_probabilities=True can increase the processing time. This, however, only applies if the datasets are large enough but if you run BERTopic on millions of records then it is indeed suggested not to set that value to True.

There are two ways in circumventing this. First, you can use a smaller sample of your data to train BERTopic. This is often a valid use case as millions of documents are typically not necessary to generate a global representation of the topics that exist in the data. However, if you are looking for very specific and small topics among those millions of documents, then sampling would not work. Second, as you suggested, is to look outside of BERTopic. The STM model works quite well with covariates and is often used in these kinds of use cases. The main downside of using the STM model is that there currently is no implementation in python but I am not entirely sure of that.

I am using BERTopic for an initial round of topic identification and then using vocabularies built upon that initial pass to arrive at cosine similarity scores to determine the impact of a covariate. In my case I identify a relevant vocabulary and then weigh it against party affiliation (Democrat, Republican).

Hmmm, I have a bit of trouble wrapping my head around this. Which two variables are you comparing with those cosine similarity scores? Are you building those vocabularies on a document level?

Just to clarify things, the probabilities are only necessary to calculate the effect of covariates on the topic prevalence. Here, we proxy the prevalence of a topic by looking at the probability distributions of topics in a document. We assume that a higher probability of Topic A in Document X would mean that Topic A appears more frequently and that a lower probability of Topic B in Document X would mean that Topic B appears less frequently in that document.

The topic content, on the other hand, does not need the probabilities to be generated. We are calculating the similarity of the c-TF-IDF representation of each document with the c-TF-IDF representation of each topic. The resulting similarity scores are then the dependent variables (sliced by each topic).

@drob-xx
Copy link

drob-xx commented Feb 6, 2022

@MaartenGr Thanks as always for a quick response. My post was long and likely confusing. I'm not sure how much of my question has to do with your new code as opposed to how I'm approaching my own project. However, I'll keep going as it will likely result in a better understanding on my part of what I'm trying to accomplish.

My project is not calculating a p-value to determine the impact of a covariant on topic relevance. However, I am interested in determining the effect of party identification on the use of vocabularies within US Congressional press releases. I am interested in your code as it would be another way of determining the relevance of party affiliation in press release language. Right now I am using BERTopic to identify relevant topics. I then use TF-IDF calculations to choose (by hand) vocabularies that are relevant to a topic but differ between parties. For example in press releases dealing with healthcare issues I identified two vocabularies:

RepWords = ['small_business obamacare patient physician hospital option market rural veteran prescription tax law bipartisan medical businesses treatment veterans_affairs medical_center business']
DemWords = ['affordable_care insurance coverage affordable aca trumpcare price cost families information americans communities community pre_existing ']

Then I calculate TF-IDF scores for each press release. Since I of course have party affiliation (and other data) I can then look at those scores in relation to party affiliation.

I've started to use your code, but as I wrote the first issue that arose was the length of time to compute the probabilities jumped out. Calculating them increased processing time of some 100K documents of about 300-1500 words from some 20-30 minutes to hours running on a Colab+ account. Since I've gotten used to relatively fast processing times since I've moved from my desktop system it reminded me of how cpu intensive this work is. At base I was wondering how long that would remain an issue as it may effect my overall approach.

@MaartenGr
Copy link
Owner

MaartenGr commented Feb 6, 2022

@drob-xx If my understanding is correct, you are interested in whether the topic representation for a specific topic, for example, "business", may differ based on party affiliation, is that correct? In which case, it is indeed highly relevant to the code I shared with respect to calculating the covariates.

But first, when you want to calculate the differences in vocabulary within a topic between party affiliations, I believe there is no need to calculate the probabilities. They do not seem to be relevant to your use case, so I would advise setting calculate_probabilities=False. This would solve your issue with the number of hours running on a Colab+ account.

My project is not calculating a p-value to determine the impact of a covariant on topic relevance.

And how about calculating a p-value to determine the impact of a covariant on topic content? As mentioned before, it seems that topic content (i.e., vocabulary) is exactly what you are describing, namely, the vocabulary used within a topic for different covariates (e.g., party affiliation). By calculating the p-value, you can explore in which topics the party affiliation has a significant effect on the vocabulary used. This might help you find those topics instead of having to manually go through them.

Right now I am using BERTopic to identify relevant topics. I then use TF-IDF calculations to choose (by hand) vocabularies that are relevant to a topic but differ between parties.

I would advise not using the classical TF-IDF calculations here if you are interested in the differences in vocabulary, c-TF-IDF is much more optimized for such tasks and typically generates better results.

Code

Based on all the above, the procedure for you would then be something as follows:

  • Run BERTopic without calculating the probabilities as they do not directly influence the vocabularies (topic content)
  • Use estimate_effect to calculate the effect of party affiliation on the vocabularies (topic content) in certain topics
  • Based on the output of estimate_effect, calculate the c-TF-IDF representations for all topics sliced by party affiliation to get the vocabularies for each party

We first run BERTopic:

# Load data
df = pd.read_csv("http://scholar.princeton.edu/sites/default/files/bstewart/files/poliblogs2008.csv")
docs = df.documents.tolist()
metadata = df.loc[:, ["rating", "day"]].copy()

# Fit BERTopic and remove stopwords
vectorizer_model = CountVectorizer(stop_words="english")
topic_model = BERTopic(min_topic_size=25, vectorizer_model=vectorizer_model)
topics, probs = topic_model.fit_transform(docs)

The original estimate_effect had some bugs that I recently worked out, so make sure to use the version below:

import numpy as np
import pandas as pd
from bertopic import BERTopic
from typing import Union, Callable, List, Mapping, Any
from sklearn.metrics.pairwise import cosine_similarity

import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.base.wrapper as wrap


def estimate_effect(topic_model, 
                    docs: List[str], 
                    topics: Union[int, List[int]], 
                    metadata: pd.DataFrame, 
                    y: str = "prevalence", 
                    probs: np.ndarray = None, 
                    estimator: Union[str, Callable] = None,
                    estimator_kwargs: Mapping[str, Any] = None) -> List[wrap.ResultsWrapper]:
    
    """ Estimate the effect of metadata on topic prevalence and topic content
    
    Arguments:
        docs: The original list of documents on which the model was trained on
        probs: A mxn probability matrix, *m* is the number of document and 
               *n* the number of topics. It represents the probabilities of all topics 
               across all documents. 
        topics: The topic(s) for which you want to estimate the effect of metadata on
        metadata: The metadata in a dataframe. Make sure that the columns have the exact same 
                  name as the elements in the estimator
        y: The target, either "prevalence" (topic prevalence) or "content" (topic content)
        estimator: Either the formula used in the estimator or a custom estimator. 
                   When it is used as a formula, it follows R-style formulas, for example:
                      * 'prevalence ~ rating'
                      * 'prevalence ~ rating + day + rating:day'
                   Make sure that the target is either 'prevalence' or 'content'
                   The custom estimator should be a `statsmodels.formula.api`, currently, 
                   `statsmodels.api` is not supported.
        estimator_kwargs: The arguments needed within the estimator, needs at 
                          least a "formula" argument
                          
    Returns:
        fitted_estimators: List of fitted estimators for either topic prevalence or topic content
    """

    data = metadata.loc[::] 
    data["topics"] = topic_model._map_predictions(topic_model.hdbscan_model.labels_)
    data["docs"] = docs
    fitted_estimators = []
    
    if isinstance(topics, int):
        topics = [topics]
    
    # As a proxy for the topic prevalence, we take the probability of a document
    # belonging to a specific topic. We assume that a higher probability of a topic 
    # belonging to that topic also results in that document talking more about that topic    
    if y == "prevalence":
        for topic in topics:
            # Prepare topic prevalence, 
            # Exclude probs == 1 as no zero-one inflated beta regressions are currently avaible
            data["prevalence"] = list(probs[:, topic])
            data_filtered = data.loc[data.prevalence < 1, :]

            # Either use a custom estimator or a pre-set model
            if callable(estimator):
                est = estimator(data=data_filtered, **estimator_kwargs).fit()
            else:
                est = smf.glm(estimator, data=data_filtered, family=sm.families.Gamma(link=sm.families.links.log())).fit()
            fitted_estimators.append(est)

    # Topic content is modeled on a document-level by calculating the document cTFIDF 
    # representation. Based on that representation, we calculate its cosine similarity 
    # with its topic cTFIDF representation. The assumption here, is that we expect different 
    # similarity scores if a covariate changes the topic content.
    elif y == "content":
        for topic in topics:
            # Extract topic content and prevalence
            selected_data = data.loc[data.topics == topic, :]
            c_tf_idf_per_doc, _ = topic_model._c_tf_idf(pd.DataFrame({"Document": selected_data.docs.tolist()}), fit=False)
            sim_matrix = cosine_similarity(c_tf_idf_per_doc, topic_model.c_tf_idf)
            selected_data["content"] = sim_matrix[:, topic+1]

            # Either use a custom estimator or a pre-set model
            if callable(estimator):
                est = estimator(data=selected_data, **estimator_kwargs).fit()
            else:
                est = smf.glm(estimator, data=selected_data, 
                              family=sm.families.Gamma(link=sm.families.links.log())).fit()  # perhaps remove the gamma + link?
            fitted_estimators.append(est)

    return fitted_estimators

Then, using the updated estimated_effect function, we can calculate the effect of party affiliation on the vocabularies:

ests = estimate_effect(topic_model=topic_model, 
                      topics=[1, 2],
                      metadata=metadata, 
                      docs=docs, 
                      probs=None, 
                      estimator="content ~ rating",
                      y="content")
print([est.summary() for est in ests])

Now, we can calculate the vocabularies for each party affiliation and each topic:

def calculate_ctfidf_representation(topic_model, df, rating):
    selected_data = df.loc[df.rating == rating, :]
    documents_per_topic = selected_data.groupby(["Topic"], as_index=False).agg({"Document": " ".join, "blog": "count"})
    ctfidf, words = topic_model._c_tf_idf(documents_per_topic, fit=False)
    labels = sorted(list(documents_per_topic.Topic.unique()))
    sliced_topics = topic_model._extract_words_per_topic(words=words, c_tf_idf=ctfidf, labels=labels)
    return sliced_topics

# Make sure that the original dataframe is in the correct format
df = pd.read_csv("http://scholar.princeton.edu/sites/default/files/bstewart/files/poliblogs2008.csv")
df["Topic"] = topics
df.rename({"documents": "Document"}, axis=1, inplace=True)

# Calculate topic vocabularies
conservative_topics = calculate_ctfidf_representation(topic_model, df, "Conservative")
liberal_topics = calculate_ctfidf_representation(topic_model, df, "Liberal")

The conservative_topics and liberal_topics contain the topic representations for each topic sliced by the party affiliation. To compare the vocabularies of topic 20 between party affiliations, simply run conservative_topics[20], liberal_topics[20].

EDIT: Forgot to add some processing

@drob-xx
Copy link

drob-xx commented Feb 6, 2022

This is all very interesting. I will continue to dig. Thanks very much, as always, for taking the time!

@drob-xx
Copy link

drob-xx commented May 31, 2022

@MaartenGr I finally got back to this and wanted to report my experience. Both approaches you have presented here, one for calculating the covariates and the other for "sub-selecting" topic vocabularies from subsets of documents are very cool and seem quite powerful. I am summarizing here to close the loop as well as to make sure I understand what these techniques are doing.

My corpus is a near complete set of U.S. Congressional press releases from 2017 to 2020 (the 115th and 116th Congresses). I am interested in the overall topic composition as well as differences in the subject matter Republicans and Democrats talk about and differences in how they talk about the same topics.

I have not come up with the final tuning of BERTopic that I want to use for this project but the settings I've used seem more than adequate for this stage. Here is what I ran:

vectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words="english")

FortyModelDiv8 = BERTopic(top_n_words=20, 
                       vectorizer_model=vectorizer_model,
                       nr_topics=40,
                       calculate_probabilities=False,
                       verbose=True,
                       low_memory=True,
                       diversity=.8,
                       )
topics, _ = FortyModelDiv8.fit_transform(AllPRs['PRText'].to_list())

I then ran the second version of estimate_effect presented in this thread on two topics where the topic clusters are:

26 border wall emergency security national declaration border security southern border immigration 
29 army river project corps engineers everglades infrastructure funding harbor restoration dams

I did a comparison of content based on the party affiliation of the author of the press releases. Here are the outputs:

[<class 'statsmodels.iolib.summary.Summary'>
"""
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                content   No. Observations:                  802
Model:                            GLM   Df Residuals:                      799
Model Family:                   Gamma   Df Model:                            2
Link Function:                    log   Scale:                        0.088110
Method:                          IRLS   Log-Likelihood:                 1236.2
Date:                Mon, 30 May 2022   Deviance:                       75.742
Time:                        21:06:22   Pearson chi2:                     70.4
No. Iterations:                     6                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept               -1.7422      0.014   -127.376      0.000      -1.769      -1.715
Party[T.Independent]     0.3865      0.297      1.301      0.193      -0.196       0.969
Party[T.Republican]     -0.0004      0.021     -0.019      0.985      -0.042       0.041
========================================================================================
""", <class 'statsmodels.iolib.summary.Summary'>
"""
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                content   No. Observations:                  753
Model:                            GLM   Df Residuals:                      750
Model Family:                   Gamma   Df Model:                            2
Link Function:                    log   Scale:                         0.12986
Method:                          IRLS   Log-Likelihood:                 1078.2
Date:                Mon, 30 May 2022   Deviance:                       112.65
Time:                        21:06:22   Pearson chi2:                     97.4
No. Iterations:                     6                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept               -1.7571      0.021    -83.606      0.000      -1.798      -1.716
Party[T.Independent]    -0.0200      0.138     -0.145      0.885      -0.290       0.250
Party[T.Republican]     -0.1414      0.027     -5.235      0.000      -0.194      -0.088
========================================================================================
"""]

My understanding of how estimate_effects works when configured to measure content is that the cosine_similarity of the cTFIDF score of each document vs. the models overall cTFIDF values producing a value for each document/topic pairing. This value, compared in this case to the party affiliation - Republican or Democrat - is used to calculate a p-value.

In these cases the first result for T.Republican of 0.985 and the second of 0.000 indicates that party membership probably does not influence choice of topic when discussing border/immigration issues, but does when talking about public works/water control related topics. This would seemingly be explained by the fact that the immigration issues are national and high profile and addressed by both parties, whereas appropriations and projects having to do with water control - presumably centered in rural, southern districts, are more issues for Republicans than for Democrats. (The number of press releases from Independents is very, very small which is why I'm not commenting on them).

The second suggestion you had for extracting the cTFIDF results for subsets of the corpus (split by party) was very interesting. Here is the output from two different topics - the first line is the overall topic list, the second for Democrats and the third for Republicans:

('violence background safety checks shooting act schools prevention background check firearm legislation concealed carry check law enforcement gun violence prevention weapons mass school violence hr',
 'violence background background checks checks school safety prevention gun violence prevention weapons check act mass firearm legislation las vegas lives law shootings people house',
 'violence concealed schools carry law enforcement school safety stop school safety school violence act act reciprocity stop concealed carry reciprocity firearms lawabiding amendment students legislation second background')

('jerusalem peace palestinian united resolution middle east embassy east ally security states middle twostate solution support house solution statement president capital bds movement',
 'peace twostate twostate solution solution united west bank resolution jerusalem security annexation israelis palestinians united states statement conflict east state middle east congress support ally',
 'jerusalem embassy united resolution middle east peace ally east middle capital hamas antiisrael states bds movement security support house congressman trump rep')

As always thanks so much for this excellent package and the patience and dedication you show here.

@MaartenGr
Copy link
Owner

@drob-xx Thank you for sharing your experience and thoughts in such an extensive way! Definitely helps to understand how this is being used and what the potential bottlenecks are when using this.

@SoranHD
Copy link

SoranHD commented Jul 12, 2023

This thread has been extremely useful, so thanks to everyone who has contributed!
@MaartenGr would it be possible to share the code you used to create the topic prevalence plot you shared above?

@MaartenGr
Copy link
Owner

@SoranHD It's been a while and I do not think I have that code around anymore. It should be reproducible though based on the code I have shared above for calculating and approaching topic prevalence.

@justin-boldsen
Copy link

Thank you for your response. topics_per_class is very helpful, but in stm the user can investigate the interaction between covariates and their relationship with topical prevalence. for example the interaction of gender with education on topics written by different people, or have more than one covariate (like a regression) so it's not just difference in topics for different categories. I think this link does a better job of explaining https://scholar.princeton.edu/files/bstewart/files/stmnips2013.pdf

The above link isn't working for me but for anyone looking, I think this is the correct paper: https://projects.iq.harvard.edu/files/wcfia/files/stmnips2013.pdf

@calvinchengyx
Copy link

Hi @MaartenGr thanks very much for this feature and i found it's super useful! Is it possible you could also share the code of the example - "the visualization of using the probabilities to see the differences in prevalence between democrats and republicans talking about American politics in 2008" ? thanks so much.

@MaartenGr
Copy link
Owner

@calvinchengyx Have you checked my comment above?

It's been a while and I do not think I have that code around anymore. It should be reproducible though based on the code I have shared above for calculating and approaching topic prevalence.

@calvinchengyx
Copy link

calvinchengyx commented Feb 21, 2024

@calvinchengyx Have you checked my comment above?

It's been a while and I do not think I have that code around anymore. It should be reproducible though based on the code I have shared above for calculating and approaching topic prevalence.

Yes! the calculation is all reproducable and thanks again for sharing it! I already used the GLM table outputs as shared above. Just wondering if there is still the code for the boxplot, which will be super helpful for the presentation.

@MaartenGr
Copy link
Owner

My message above was in response to a user asking for code to reproduce that specific plot. Unfortunately, I do not have that code available. I believe it was simply some matplotlib code, so it should be straightforward to create.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants