Pipeline#

Datashader provides a flexible series of processing stages that map from raw data into viewable images. As shown in the Introduction, using datashader can be as simple as calling datashade(), but understanding each of these stages will help you get the most out of the library.

The stages in a datashader pipeline are similar to those in a 3D graphics shading pipeline:

pipeline diagram

Here the computational steps are listed across the top of the diagram, while the data structures or objects are listed along the bottom. Breaking up the computations in this way is what makes Datashader able to handle arbitrarily large datasets, because only one stage (Aggregation) requires access to the entire dataset. The remaining stages use a fixed-sized data structure regardless of the input dataset, allowing you to use any visualization or embedding methods you prefer without running into performance limitations.

In this notebook, we’ll first put together a simple, artificial example to get some data, and then show how to configure and customize each of the data-processing stages involved:

  1. Projection

  2. Aggregation

  3. Transformation

  4. Colormapping

  5. Embedding

Data#

For an example, we’ll construct a dataset made of five overlapping 2D Gaussian distributions with different σs (spatial scales). By default we’ll have 10,000 datapoints from each category, but you should see sub-second response times even for 1 million datapoints per category if you increase num.

import pandas as pd
import numpy as np

num=10000
np.random.seed(1)

dists = {cat: pd.DataFrame(dict([('x',np.random.normal(x,s,num)), 
                                 ('y',np.random.normal(y,s,num)), 
                                 ('val',val), 
                                 ('cat',cat)]))      
         for x,  y,  s,  val, cat in 
         [(  2,  2, 0.03, 10, "d1"), 
          (  2, -2, 0.10, 20, "d2"), 
          ( -2, -2, 0.50, 30, "d3"), 
          ( -2,  2, 1.00, 40, "d4"), 
          (  0,  0, 3.00, 50, "d5")] }

df = pd.concat(dists,ignore_index=True)
df["cat"]=df["cat"].astype("category")

Datashader can work many different data objects provided by different data libraries depending on the type of data involved, such as columnar data in Pandas or Dask dataframes, gridded multidimensional array data using xarray, columnar data on GPUs using cuDF, multidimensional arrays on GPUs using CuPy, and ragged arrays using SpatialPandas (see the Performance User Guide for a guide to selecting an appropriate library). Here, we’re using a Pandas dataframe, with 50,000 rows by default:

df.tail()
x y val cat
49995 -1.397579 0.610189 50 d5
49996 -2.649610 3.080821 50 d5
49997 1.933360 0.243676 50 d5
49998 4.306374 1.032139 50 d5
49999 -0.493567 -2.242669 50 d5

To illustrate this dataset, we’ll make a quick-and-dirty Datashader plot that dumps these x,y coordinates into an image:

import datashader as ds
import datashader.transfer_functions as tf

%time tf.shade(ds.Canvas().points(df,'x','y'))
CPU times: user 519 ms, sys: 27.6 ms, total: 546 ms
Wall time: 551 ms

Without any special tweaking, datashader is able to reveal the overall shape of this distribution faithfully: four summed 2D normal distributions of different variances, arranged at the corners of a square, overlapping another very high-variance 2D normal distribution centered in the square. This immediately obvious structure makes a great starting point for exploring the data, and you can then customize each of the various stages involved as described below.

Of course, this is just a static plot, and you can’t see what the axes are, so we can instead embed this data into an interactive plot if we prefer:

import holoviews as hv
from holoviews.operation.datashader import datashade
hv.extension("bokeh")

datashade(hv.Points(df)).opts(width=700, height=700)

Here, if you are running a live Python process, you can enable the “wheel zoom” tool on the right, zoom in anywhere in the distribution, and datashader will render a new image that shows the full distribution at that new location. If you are viewing this on a static web site, zooming will simply make the existing set of pixels larger, because this dynamic updating requires Python.

Now that you can see the overall result, we’ll unpack each of the steps in the Datashader pipeline and show how this image is constructed from the data.

Projection#

Datashader is designed to render datasets projected on to a 2D rectangular grid, eventually generating an image where each pixel corresponds to one cell in that grid. The Projection stage is primarily conceptual, as it consists of you deciding what you want to plot and how you want to plot it:

  • Variables: Select which variable you want to have on the x axis, and which one for the y axis. If those variables are not already columns in your dataframe (e.g. if you want to do a coordinate transformation), you’ll need to create suitable columns mapping directly to x and y for use in the next step. For this example, the “x” and “y” columns are conveniently named x and y already, but any column name can be used for these axes.

  • Ranges: Decide what ranges of those values you want to map onto the scene. If you omit the ranges, datashader will calculate the ranges from the data values, but you will often wish to supply explicit ranges for three reasons:

    1. Calculating the ranges requires a complete pass over the data, which takes nearly as much time as actually aggregating the data, so your plots will be about twice as fast if you specify the ranges.

    2. Real-world datasets often have some outliers with invalid values, which can make it difficult to see the real data, so after your first plot you will often want to specify only the range that appears to have valid data.

    3. Over the valid range of data, you will often be mainly interested in a specific region, allowing you to zoom in to that area (though with an interactive plot you can always do that as needed).

  • Axis types: Decide whether you want 'linear' or 'log' axes.

  • Resolution: Decide what size of aggregate array you are going to want.

Here’s an example of specifying a Canvas (a.k.a. “Scene”) object for a 200x200-pixel image covering the range +/-8.0 on both axes:

canvas = ds.Canvas(plot_width=300, plot_height=300, 
                   x_range=(-8,8), y_range=(-8,8), 
                   x_axis_type='linear', y_axis_type='linear')

At this stage, no computation has actually been done – the canvas object is a purely declarative, recording your preferences to be applied in the next stage.

Aggregation#

Once a Canvas object has been specified, it can then be used to guide aggregating the data into a fixed-sized grid. Data is assumed to consist of a series of items, each of which has some visible representation (its rendering as a “glyph”) that is combined with the representation of other items to produce an aggregate representation of the whole set of items in the rectangular grid. The available glyph types for representing a data item are currently:

  • Canvas.points: each data item is a coordinate location (an x,y pair), mapping into the single closest grid cell to that datapoint’s location.

  • Canvas.line: each data item is a coordinate location, mapping into every grid cell falling between this point’s location and the next in a straight line segment.

  • Canvas.area: each data item is a coordinate location, rendered as a shape filling the axis-aligned area between this point, the next point, and a baseline (e.g. zero, filling the area between a line and a base).

  • Canvas.trimesh: each data item is a triple of coordinate locations specifying a triangle, filling in the region bounded by that triangle.

  • Canvas.polygons: each data item is a sequence of coordinate locations specifying a polygon, filling in the region bounded by that polygon (minus holes if specified separately).

  • Canvas.raster: the collection of data items is an array specifying regularly spaced axis-aligned rectangles forming a regular grid; each cell in this array is rendered as a filled rectangle.

  • Canvas.quadmesh: the collection of data items is an array specifying irregularly spaced quadrilaterals forming a grid that is regular in the input space but can have arbitrary rectilinear or curvilinear shapes in the aggregate grid; each cell in this array is rendered as a filled quadrilateral.

These types are each covered in detail in the User Guide. Many other plots like time series and network graphs can be constructed out of these basic primitives. Datashader can also be extended to add additional types here and in each section below; see Extending Datashader for more details.

2D Reductions#

Once you have determined your mapping, you’ll next need to choose a reduction operator to use when aggregating multiple datapoints into a given pixel. For points, each datapoint is mapped into a single pixel, while the other glyphs have spatial extent and can thus map into multiple pixels, each of which operates the same way. All glyphs act like points if the entire glyph is contained within that pixel. Here we will talk only about “datapoints” for simplicity, which for an area-based glyph should be interpreted as “the part of that glyph that falls into this pixel”.

All of the currently supported reduction operators are incremental, which means that we can efficiently process datasets in a single pass. Given an aggregate bin to update (typically corresponding to one eventual pixel) and a new datapoint, the reduction operator updates the state of the bin in some way. (Actually, datapoints are normally processed in batches for efficiency, but it’s simplest to think about the operator as being applied per data point, and the mathematical result should be the same.) A large number of useful reduction operators are supplied in ds.reductions, including:

count(column=None): increment an integer count each time a datapoint maps to this bin. The resulting aggregate array will be an unsigned integer type, allowing counts to be distinguished from the other types that are normally floating point.

any(column=None): the bin is set to 1 if any datapoint maps to it, and 0 otherwise.

sum(column): add the value of the given column for this datapoint to a running total for this bin.

summary(name1=op1,name2=op2,...): allows multiple reduction operators to be computed in a single pass over the data; just provide a name for each resulting aggregate and the corresponding reduction operator to use when creating that aggregate. If multiple aggregates are needed for the same dataset and the same Canvas, using summary will generally be much more efficient than making multiple separate passes over the dataset.

The API documentation contains the complete list of reduction operators provided, including mean, min, max, var (variance), std (standard deviation). The reductions are also imported into the datashader namespace for convenience, so that they can be accessed like ds.mean() here.

For the operators above, those accepting a column argument will only do the operation if the value of that column for this datapoint is not NaN. E.g. count with a column specified will count the datapoints having non-NaN values for that column.

Once you have selected your reduction operator, you can compute the aggregation for each pixel-sized aggregate bin:

canvas.points(df, 'x', 'y', agg=ds.count())
<xarray.DataArray (y: 300, x: 300)> Size: 360kB
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], shape=(300, 300), dtype=uint32)
Coordinates:
  * y        (y) float64 2kB -7.973 -7.92 -7.867 -7.813 ... 7.867 7.92 7.973
  * x        (x) float64 2kB -7.973 -7.92 -7.867 -7.813 ... 7.867 7.92 7.973
Attributes:
    x_range:  (-8, 8)
    y_range:  (-8, 8)

The result of will be an xarray DataArray data structure containing the bin values (typically one value per bin, but more for multiple category or multiple-aggregate operators) along with axis range and type information.

We can visualize this array in many different ways by customizing the pipeline stages described in the following sections, but for now we’ll simply render images using the default parameters to show the effects of a few different aggregate operators:

tf.Images(tf.shade(   canvas.points(df,'x','y', ds.count()),     name="count()"),
          tf.shade(   canvas.points(df,'x','y', ds.any()),       name="any()"),
          tf.shade(   canvas.points(df,'x','y', ds.mean('y')),   name="mean('y')"),
          tf.shade(50-canvas.points(df,'x','y', ds.mean('val')), name="50- mean('val')"))
count()

any()

mean('y')

50- mean('val')

Here count() renders each possible count in a different color, to show the statistical distribution by count, while any() turns on a pixel if any point lands in that bin, and mean('y') averages the y column for every datapoint that falls in that bin. Of course, since every datapoint falling into a bin happens to have the same y value, the mean reduction with y simply scales each pixel by its y location.

For the last image above, we specified that the val column should be used for the mean reduction, which in this case results in each category being assigned a different shade of blue, because in our dataset all items in the same category happen to have the same val. Here we also manipulated the result of the aggregation before displaying it by subtracting it from 50, a Transformation as described in more detail below.

3D Reductions#

The above examples are 2D reductions that generate a single x × y aggregate array when given a particular reduction operator. You can instead supply a by reduction operator to do 3D reductions, resulting in a stack of aggregate arrays, x × y × c, where c is some other column:

by(column, reduction): create a 3D stack of aggregates, with each datapoint contributing to one of the aggregate arrays according to the value of the indicated categorical column or categorizer object and using the indicated 2D reduction operator to create each 2D array in the 3D stack.

The resulting 3D stack can later be selected or collapsed into a 2D array for visualization, or the third dimension can be used for colormappping to generate an x × y × color image.

For instance, to aggregate counts by our categorical cat dimension:

aggc = canvas.points(df, 'x', 'y', ds.by('cat', ds.count()))
aggc
<xarray.DataArray (y: 300, x: 300, cat: 5)> Size: 2MB
array([[[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        ...,
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]],

       [[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        ...,
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]],

       [[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        ...,
...
        ...,
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]],

       [[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        ...,
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]],

       [[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        ...,
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]]], shape=(300, 300, 5), dtype=uint32)
Coordinates:
  * y        (y) float64 2kB -7.973 -7.92 -7.867 -7.813 ... 7.867 7.92 7.973
  * x        (x) float64 2kB -7.973 -7.92 -7.867 -7.813 ... 7.867 7.92 7.973
  * cat      (cat) <U2 40B 'd1' 'd2' 'd3' 'd4' 'd5'
Attributes:
    x_range:  (-8, 8)
    y_range:  (-8, 8)

Here the count() aggregate has been collected into not just one 2D aggregate array, but a whole stack of aggregate arrays, one per cat value, making the aggregate be three dimensional x × y × cat rather than just two x × y.

With this 3D aggregate of counts per category, you can then select a specific category or subset of them for further processing, where .sum(dim='cat') will collapse across such a subset to give a single aggregate array:

agg_d3_d5=aggc.sel(cat=['d3', 'd5']).sum(dim='cat').astype('uint32')

tf.Images(tf.shade(aggc.sel(cat='d3'), name="Category d3"),
          tf.shade(agg_d3_d5,          name="Categories d3 and d5"))          
Category d3

Categories d3 and d5

by also works with numerical (non-categorical) axes when given a “categorizer” object for the column, allowing binning over three numerical dimensions:

categorizer = ds.category_binning('val', lower=10, upper=50, nbins=4, include_under=False, include_over=False)

agg3D = canvas.points(df, 'x', 'y', ds.by(categorizer, ds.count()))
agg3D
<xarray.DataArray (y: 300, x: 300, val: 5)> Size: 2MB
array([[[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        ...,
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]],

       [[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        ...,
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]],

       [[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        ...,
...
        ...,
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]],

       [[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        ...,
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]],

       [[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        ...,
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]]], shape=(300, 300, 5), dtype=uint32)
Coordinates:
  * y        (y) float64 2kB -7.973 -7.92 -7.867 -7.813 ... 7.867 7.92 7.973
  * x        (x) float64 2kB -7.973 -7.92 -7.867 -7.813 ... 7.867 7.92 7.973
  * val      (val) int64 40B 0 1 2 3 4
Attributes:
    x_range:  (-8, 8)
    y_range:  (-8, 8)

Here agg3D does a 3D histogram over the x × y × val space, counting how many data points fall in each bin of this three-dimensional cube. We requested that the val range (10,50) be divided into 4 bins, which results in 5 aggregate arrays by val:

  • 0: val values in the range [10,20) (including values below 10, if include_under==True (the default)

  • 1: val values in the range [20,30)

  • 2: val values in the range [30,40)

  • 3: val values in the range [40,50) (including values above 50, if include_over==True (the default)

  • 4: val values of NaN or other values not falling into any of the above bins (depending on include_under and include_over)

category_modulo works similarly, but wrapping the given range cyclicly, for cyclic quantities like the day of the week:

modulo_categorizer = ds.category_modulo('val', modulo=7, offset=0)

agg3D_modulo = canvas.points(df, 'x', 'y', ds.by(modulo_categorizer, ds.mean('x')))
agg3D_modulo
<xarray.DataArray (y: 300, x: 300, val: 7)> Size: 5MB
array([[[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
...
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]]], shape=(300, 300, 7))
Coordinates:
  * y        (y) float64 2kB -7.973 -7.92 -7.867 -7.813 ... 7.867 7.92 7.973
  * x        (x) float64 2kB -7.973 -7.92 -7.867 -7.813 ... 7.867 7.92 7.973
  * val      (val) int64 56B 0 1 2 3 4 5 6
Attributes:
    x_range:  (-8, 8)
    y_range:  (-8, 8)

Here the resulting aggregate arrays bin by (val-0)%7:

  • 0: val values 0, 7, 14, etc.

  • 1: val values 1, 8, 15, etc.

  • 2: val values 2, 9, 16, etc.

  • 3: val values 3, 10, 17, etc.

  • 4: val values 4, 11, 18, etc.

  • 5: val values 5, 12, 19, etc.

  • 6: val values 6, 13, 20, etc.

Other custom categorizers can also be defined as new Python classes; see category_codes in reductions.py.

As for aggc, the agg3D and agg3D_modulo arrays can then be selected or collapsed to get a 2D aggregate array for display, or each val can be mapped to a different color for display in the colormapping stage.

Transformation#

Now that the data has been projected and aggregated into a 2D or 3D gridded data structure, it can be processed in any way you like, before converting it to an image as will be described in the following section. At this stage, the data is still stored as bin data, not pixels, which makes a very wide variety of operations and transformations simple to express.

For instance, instead of plotting all the data, we can easily plot only those bins in the 99th percentile by count (left), or apply any NumPy ufunc to the bin values (whether or not it makes any sense!):

agg  = canvas.points(df, 'x', 'y')

tf.Images(tf.shade(agg.where(agg>=np.percentile(agg,99)), name="99th Percentile"),
          tf.shade(np.power(agg,2),                       name="Numpy square ufunc"),
          tf.shade(np.sin(agg),                           name="Numpy sin ufunc"))
99th Percentile

Numpy square ufunc

Numpy sin ufunc