Skip to content

Filterable Data Source

Bryan Van de Ven edited this page May 25, 2019 · 3 revisions

Proposal for a new filterable data source

Current pain point

Right now, Bokeh (version 0.12.4) requires glyphs to use full columns of data from a ColumnDataSource (CDS). This makes it difficult to link plots using row-wise subsets of data.

For example, consider the following use-case taken from a question on the Dicourse:

I have a case where I show a full set of records on a plot, and then list a subset of those records, with additional details, in a datatable. I want users to be able to select a row of the datatable, and have the corresponding data point show as selected in the plot.

example subset selection

Because there is no way in Bokeh to specify subsets of a data source, a user has to use two data sources, one for the scatter plot and one for the data table, even though the underlying data for the two plots is the same. Additionally, users have to write a CustomJS callback and keep track of the indices themselves to produce the simple linked selection shown in the gif above. This is made more confusing by the unusual structure of the selected property on data sources that has 0d, 1d, and 2d properties that then contain indices.

from bokeh.plotting import figure, output_file, show
from bokeh.models import CustomJS
from bokeh.models.sources import ColumnDataSource
from bokeh.layouts import column
from bokeh.models.widgets import DataTable, TableColumn, Button

output_file("subset_example.html")

data = dict(
        index = list(range(10)),
        x = list(range(10)),
        y = list(range(10)),
        z = ['some other data'] * 10
    )

filtered_index = [i for i, y in enumerate(data['y']) if y > 5]
filtered_data = dict(
        index = filtered_index,
        x = [x for i, x in enumerate(data['x']) if i in filtered_index],
        y = [y for i, y in enumerate(data['y']) if i in filtered_index],
        z = ['some other data'] * len(filtered_index)
)

source1 = ColumnDataSource(data)
source2 = ColumnDataSource(filtered_data)

fig1 = figure(plot_width=300, plot_height=300)
fig1.circle(x='x', y='y', size=10, source=source1)

columns = [
        TableColumn(field="y", title="Y"),
        TableColumn(field="z", title="Text"),
    ]
data_table = DataTable(source=source2, columns=columns, width=400, height=280)

button = Button(label="Select")
button.callback = CustomJS(args=dict(source1=source1, source2=source2), code="""
        var inds_in_source2 = source2['selected']['1d'].indices;

        var d = source2['data'];
        var inds = []

        if (inds_in_source2.length == 0) { return; }

        for (i = 0; i < inds_in_source2.length; i++) {
            ind2 = inds_in_source2[i]
            inds.push(d['index'][ind2])
        }

        source1['selected']['1d'].indices = inds
        source1.trigger('change');
    """)

show(column(fig1, data_table, button))

Solution: a filterable data source

My proposal is to add a filterable data source that keeps track of which rows to provide to each renderer that is associated with it. This would allow users to specify subsets of data (e.g. filtered by the value of some column) for individual glyphs. Applications with multiple plots, each using a subset of the same data, would share the data in a similar way to how Bokeh allows plots to share full CDSs now. Linked selection between these plots would be automatic, so that users don't have to write a CustomJS callback to get the functionality shown above.

Constraints

We don't want to break any of the API on the CDS.

Proposed implementation: introducing TableDataSource

Instead of changing the CDS to make it filterable, we can introduce a new data source, potentially called TableDataSource.

The TableDataSource would keep track of filters and indices for each renderer that uses it. The filters could be (as suggested by @bryedev here) either None, a Seq(Int) which lists the subset indices, a Seq(Bool) for boolean filtering, or a function that returns either a Seq(Int) or Seq(Bool).

The TableDataSource would also implement __getitem__, so that the following syntax would be possible:

tds = TableDataSource(df)
r = fig.circle(x = 'x', y = 'y', source = tds[tds['weather'] == 'sunny'])

The TableDataSource would subclass DataSource and inherit the selected property which keeps track of the selection on the full dataset. With some work in the glyph renderer and selection manager (sort of like changes in this commit, though details would be different), linked selection will just work.

Instead of containing a data property that contains the data itself like the CDS, the TDS has a cds property which is a data source object that can be shared with multiple TDSs. This would separate the data from the filters and allow the subsets to be represented by the filters on that data.

cds = ColumnDataSource(data)
tds1 = TableDataSource(cds, filter=lambda tds: tds['weather'] == 'sunny')
tds2 = TableDataSource(cds, filter=[0, 1, 2])

# If the tds is created from a cds, they are all automatically linked
r = fig0.circle(x='x', y='y', source=cds)
r = fig1.circle(x='x', y='y', source=tds1)
r = fig2.circle(x='x', y='y', source=tds2)

End result

No Button or CustomJS necessary!

better subset selection!

from bokeh.plotting import figure, output_file, show
from bokeh.models.sources import ColumnDataSource, TableDataSource
from bokeh.layouts import column
from bokeh.models.widgets import DataTable, TableColumn

output_file("subset_example_tds.html")

data = dict(
        x=list(range(10)),
        y=list(range(10)),
        z=['some other data'] * 10
    )

source = TableDataSource(data)

fig1 = figure(plot_width=300, plot_height=300)
fig1.circle(x='x', y='y', size=10, source=source)

columns = [
        TableColumn(field="y", title="Y"),
        TableColumn(field="z", title="Text"),
    ]
data_table = DataTable(source=source[source['y'] > 5], columns=columns, width=400, height=280)

show(column(fig1, data_table))