# Continuous Aggregates¶

One of the fundamental goals of PipelineDB is to **facilitate high-performance continuous aggregation**, so not suprisingly aggregates are a central component of PipelineDB’s utility. Continuous aggregates can be very powerful–in the most general sense they make it possible to keep the amount of data persisted in PipelineDB constant relative to the amount of data that has been pushed through it. This can enable sustainable and very high data throughput on modest hardware.

Continuous aggregates are **incrementally updated** in real time as new events are read by the the continuous view that they’re a part of. For simple aggregates such as count and sum, it is easy to see how their results can be incrementally updated–just add the new value to the existing result.

But for more complicated aggregates, such as avg, stddev, percentile_cont, etc., more advanced infrastructure is required to support efficient incremental updates, and PipelineDB handles all of that complexity for you transparently.

Below you’ll find a description of all the aggregates that PipelineDB supports. A few of them behave slightly differently than their standard counterparts in order to efficiently operate on infinite streams of data. Such aggregates have been annotated with an explanation of how exactly their behavior differs.

Note

It may be helpful for you to consult the excellent PostgreSQL aggregates documentation.

## Bloom Filter Aggregates¶

**bloom_agg ( expression )**

Adds all input values to a Bloom Filter

**bloom_agg ( expression , p , n )**

Adds all input values to a Bloom filter and sizes it according to the given parameters.

pis the desired false-positive rate, andnis the expected number of unique elements to add.

**bloom_union_agg ( bloom filter )**

Takes the union of all input Bloom filters, resulting in a single Bloom filter containing all of the input Bloom filters’ information.

**bloom_intersection_agg ( bloom filter )**

Takes the intersection of all input Bloom filters, resulting in a single Bloom filter containing only the information shared by all of the input Bloom filters.

See Bloom Filter Functions for functionality that can be used to manipulate Bloom filters.

## Frequency Tracking Aggregates¶

**freq_agg ( expression )**

Adds all input values to an internal Count-Min Sketch, enabling efficient online computation of the frequency of each input expression.

**freq_agg ( expression, epsilon, p )**

Same as above, but accepts

epsilonandpas parameters for the underlyingcmsketch.epsilondetermines the acceptable error rate of thecmsketch, and defaults to0.002(0.2%).pdetermines the confidence, and defaults to0.995(99.5%). Lowerepsilonandpwill result in smallercmsketchstructures, and vice versa.

**freq_merge_agg ( count-min sketch )**

Merges all input Count-min sketches into a single one containing all of the information of the input Count-min sketches.

See Frequency Functions for functionality that can be used to manipulate Count-Min sketches.

## Top-K Aggregates¶

**topk_agg ( expression , k )**

Tracks the top k input expressions by adding all input values to a Top-K data structure sized for the given

k, incrementing each value’s count by1each time it is added.

**topk_agg (expression, k, weight )**

Same as above, but associates the given weight to the input expression (rather than a default weight of 1).

**topk_merge_agg ( topk )**

Merges all

topkinputs into a singletopkdata structure.

See Top-K Functions for functionality that can be used to manipulate **topk** objects.

## HyperLogLog Aggregates¶

**hll_agg ( expression )**

Adds all input values to a HyperLogLog.

**hll_agg ( expression, p )**

Adds all input values to a HyperLogLog with the given

p. A largerpreduces the HyperLogLog’s error rate, at the expense of a larger size.

**hll_union_agg ( hyperloglog )**

Takes the union of all input HyperLogLogs, resulting in a single HyperLogLog that contains all of the information of the input HyperLogLogs.

See HyperLogLog Functions for functionality that can be used to manipulate HyperLogLog objects.

## Distribution Aggregates¶

**dist_agg ( expression )**

Adds all input values to a T-Digest in order to track the distribution of all input expressions.

**dist_agg ( expression, compression )**

Same as above, but builds the underyling

tdigestusing the givencompression.compressionmust be an integer in the range`[20, 1000]`

. A higher value forcompressionwill yield a largertdigestwith but with more precision than a smallertdigestwith a lowercompressionvalue.

See Distribution Functions for functionality that can be used to manipulate **tdigest** objects.

## Miscellaneous Aggregates¶

**bucket_agg ( expression , bucket_id )**

Adds 4-byte hashes of each input expression to the bucket with the given

bucket_id. Each hash may only be present precisely once in one bucket at any given time. Buckets can therefore be thought of as exclusive sets of hashes of the input expressions.

**bucket_agg ( expression , bucket_id , timestamp )**

Same as above, but allows a

timestampexpression to determine bucket entry order. That is, only a value’slatestentry will cause it to change buckets.

See Miscellaneous Functions for functionality that can be used to manipulate **bucket_agg** objects.

**exact_count_distinct ( expression )**

Counts the exact number of distinct values for the given expression. Since

count distinctused in continuous views implicitly uses HyperLogLog for efficiency,exact_count_distinctcan be used when the small margin of error inherent to using HyperLogLog is not acceptable.

Important

**exact_count_distinct** must store all unique values observed in order to determine uniqueness, so it is not recommended for use when many unique values are expected.

**first_values ( n ) WITHIN GROUP (ORDER BY sort_expression)**

An ordered-set aggregate that stores the first

nvalues ordered by the provided sort expression.

Note

See also: PipelineDB-specific Functions, which explains some of the PipelineDB’s non-aggregate functionality for manipulating Bloom filters, Count-min sketches, HyperLogLogs and T-Digests. Also, check out Probabilistic Data Structures & Algorithms for more information about what they are and how you can leverage them.

**keyed_max ( key, value )**

Returns the

valueassociated with the “highest”key.

**keyed_min ( key, value )**

Returns the

valueassociated with the “lowest”key.

**set_agg ( expression )**

Adds all input values to a set.

See Miscellaneous Functions for functionality that can be used to manipulate sets.

## Combine¶

Since PipelineDB can incrementally update aggregate values, it has the capability to combine existing aggregates using more information than simply their current raw values. For example, combining multiple averages isn’t simply a matter of taking the average of the averages. Their weights must be taken into account.

For this type of operation, PipelineDB exposes the special **combine** aggregate. Its description is as follows:

**combine ( aggregate column )**

Given an aggregate column, combines all values into a single value as if all of the individual aggregates’ inputs were aggregated a single time.

Note

**combine** only works on aggregate columns that belong to continuous views.

Let’s look at an example:

```
postgres=# CREATE VIEW v AS SELECT g::integer, AVG(x::integer) FROM stream GROUP BY g;
CREATE VIEW
postgres=# INSERT INTO stream (g, x) VALUES (0, 10), (0, 10), (0, 10), (0, 10), (0, 10);
INSERT 0 5
postgres=# INSERT INTO stream (g, x) VALUES (1, 20);
INSERT 0 1
postgres=# SELECT * FROM v;
g | avg
---+---------------------
0 | 10.0000000000000000
1 | 20.0000000000000000
(2 rows)
postgres=# SELECT avg(avg) FROM v;
avg
---------------------
15.0000000000000000
(1 row)
postgres=# -- But that didn't take into account that the value of 10 weighs much more,
postgres=# -- because it was inserted 5 times, whereas 20 was only inserted once.
postgres=# -- combine() will take this weight into account
postgres=#
postgres=# SELECT combine(avg) FROM v;
combine
---------------------
11.6666666666666667
(1 row)
postgres=# -- There we go! This is the same average we would have gotten if we ran
postgres=# -- a single average on all 6 of the above inserted values, yet we only
postgres=# -- needed two rows to do it.
```

## General Aggregates¶

**array_agg ( expression )**

Input values, including nulls, concatenated into an array

**avg ( expression )**

The average of all input values

**bit_and ( expression )**

The bitwise AND of all non-null input values, or null if none

**bit_or ( expression )**

The bitwise OR of all non-null input values, or null if none

**bool_and ( expression )**

True if all input values are true, otherwise false

**bool_or ( expression )**

True if at least one input value is true, otherwise false

**count ( * )**

Number of input rows

**count ( DISTINCT expression)**

Number of rows for which

expressionis distinct.Note

Counting the distinct number of expressions on an infinite stream would require infinite memory, so continuous views use HyperLogLog to accomplish distinct counting in constant space and time, at the expense of a small margin of error. Empirically, PipelineDB’s implementation of HyperLogLog has an error rate of ~0.81%. For example,

count distinctmight show`1008`

when the actual number of unique expressions was`1000`

.

**count ( expression )**

Number of rows for which

expressionis non-null.

**every ( expression )**

Equivalent to

bool_and

**json_agg ( expression )**

Aggregates values as a JSON array

**json_object_agg ( key, value )**

Aggregates

key-valuepairs as a JSON object

**jsonb_agg ( expression )**

Aggregates values as a JSONB array

**jsonb_object_agg ( key, value )**

Aggregates

key-valuepairs as a JSONB object

**max ( expression )**

Maximum value of expression across all input values

**min ( expression )**

Minimum value of expression across all input values

**string_agg ( expression, delimiter )**

Input values concatenated into a string, separated by

delimiter

**sum ( expression )**

Sum of

expressionacross all input values

## Statistical Aggregates¶

**corr ( y, x )**

Correlation coefficient

**covar_pop ( y, x )**

Population covariance

**covar_samp ( y, x )**

Sample covariance

**regr_avgx ( y, x )**

Average of the independent variable

`(sum(x)/N)`

**regr_avgy ( y, x )**

Average of the independent variable

`(sum(y)/N)`

**regr_count ( y, x )**

Number of input rows in which both expressions are non-null

**regr_intercept ( y, x )**

y-intercept of the least-squares-fit linear equation determined by the (x, y) pairs

**regr_r2 ( y, x )**

Square of the correlation coefficient

**regr_slope ( y, x )**

Slope of the least-squares-fit linear equation determined by the (x, y) pairs

**regr_sxx ( y, x )**

`sum(X^2) - sum(X)^2/N`

– sum of squares of the independent variable

**regr_sxy ( y, x )**

`sum(X*Y) - sum(X) * sum(Y)/N`

– sum of products of independent times dependent variable

**regr_syy ( y, x )**

`sum(Y^2) - sum(Y)^2/N`

– sum of squares of the independent variable

**stddev ( expression )**

Sample standard deviation of the input values

**stddev_pop ( expression )**

Population standard deviation of the input values

**variance ( expression )**

Sample variance of the input values (square of the sample standard deviation)

**var_pop ( expression )**

Population variance of the input values (square of the population standard deviation)

## Ordered-set Aggregates¶

**ordered-set** aggregates apply ordering to their input in order to obtain their results, so they use the `WITHIN GROUP`

clause. Its syntax is as follows:

```
aggregate_name ( [ expression [ , ... ] ] ) WITHIN GROUP ( order_by_clause )
```

Let’s look at a couple examples.

Compute the 99th percentile of **value**:

```
SELECT percentile_cont(0.99) WITHIN GROUP (ORDER BY value) FROM some_table;
```

Or with a continuous view:

```
CREATE VIEW percentile AS
SELECT percentile_cont(0.99) WITHIN GROUP (ORDER BY value::float8)
FROM some_stream;
```

**percentile_cont ( fraction )**

Continuous percentile: returns a value corresponding to the specified fraction in the ordering, interpolating between adjacent input items if needed

**percentile_cont ( array of fractions )**

Multiple continuous percentile: returns an array of results matching the shape of the fractions parameter, with each non-null element replaced by the value corresponding to that percentile

Note

Computing percentiles on infinite streams would require infinite memory, so both forms of

percentile_cont, when used by continuous views, use T-Digest as a way to estimate percentiles with a very high degree of accuracy. In general, percentiles in continuous views are more accurate the closer they are to the upper or lower bounds of`[0, 1)`

.

## Hypothetical-set Aggregates¶

**hypothetical-set** aggregates take an expression and compute something about it within the context of a set of input rows. For example, **rank(2)** computes the `rank`

of `2`

within the context of whatever the input rows end up being.

The hypothetical-set aggregates use the `WITHIN GROUP`

clause to define the input rows. Its syntax is as follows:

```
aggregate_name ( [ expression [ , ... ] ] ) WITHIN GROUP ( order_by_clause )
```

Here is an example of of a hypothetical-set aggregate being used by a continuous view:

```
CREATE VIEW continuous_rank AS
SELECT rank(42) WITHIN GROUP (ORDER BY value::float8)
FROM some_stream;
```

This continuous view will continuously update the rank of `42`

given all of the events it has read.

**rank ( arguments )**

Rank of the hypothetical row, with gaps for duplicate rows

**dense_rank ( arguments )**

Rank of the hypothetical row, without gaps

Note

Computing the hypothetical

dense_rankof a value given an infinite stream of values would require infinite memory, so continuous views use HyperLogLog to do it in constant time and space, at the expense of a small margin of error. Empirically, PipelineDB’s implementation of HyperLogLog has an error rate of ~0.2%. In other words,dense_rank (1000)in a continuous view might show 998 when the actual number of unique lower-ranking values seen was`1000`

.

**percent_rank ( arguments )**

Relative rank of the hypothetical row, ranging from 0 to 1

**cume_dist ( arguments )**

Relative rank of the hypothetical row, ranging from 1/N to 1

## Unsupported Aggregates¶

**mode ( )**

Future releases of PipelineDB will include an implementation of an online mode estimation algorithm, but for now it’s not supported

**percentile_disc ( arguments )**

Given an input percentile (such as 0.99),

percentile_discreturns the very first value in the input set that is within that percentile. This requires actually sorting the input set, which is obviously impractical on an infinite stream, and doesn’t even allow for a highly accurate estimation algorithm such as the one we use forpercentile_cont.

**xmlagg ( xml )**

:(

**<aggregate_name> (DISTINCT expression)**

Only the

`count`

aggregate function is supported with a`DISTINCT`

expression as noted above in the General Aggregates section. In future releases, we might leverage Bloom Filter to allow`DISTINCT`

expressions for all aggregate functions.