Quickstart

First, complete the PipelineDB Installation process.

Wikipedia Traffic

In this example we’ll compute some basic statistics on a day’s worth of Wikipedia page view data. Each record in the dataset contains hourly page view statistics for every Wikipedia page. The record format is as follows:

hour | project | page title | view count | bytes served

First, let’s create our continuous view using psql:

psql -c "
CREATE FOREIGN TABLE wiki_stream (
        hour timestamp,
        project text,
        title text,
        view_count bigint,
        size bigint)
SERVER pipelinedb;
CREATE VIEW wiki_stats WITH (action=materialize) AS
SELECT hour, project,
        count(*) AS total_pages,
        sum(view_count) AS total_views,
        min(view_count) AS min_views,
        max(view_count) AS max_views,
        avg(view_count) AS avg_views,
        percentile_cont(0.99) WITHIN GROUP (ORDER BY view_count) AS p99_views,
        sum(size) AS total_bytes_served
FROM wiki_stream
GROUP BY hour, project;"

Now we’ll decompress the dataset as a stream and write it to stdin, which can be used as an input to COPY:

curl -sL http://pipelinedb.com/data/wiki-pagecounts | gunzip | \
        psql -c "
        COPY wiki_stream (hour, project, title, view_count, size) FROM STDIN"

Note that this dataset is large, so the above command will run for quite a while (cancel it whenever you’d like). As it’s running, select from the continuous view as it ingests data from the input stream:

psql -c "
SELECT * FROM wiki_stats ORDER BY total_views DESC";