Encodes each time series as a feature vector and clusters the series based
on similarity. The feature representation is controlled by method and
normalise, and should match the value used in make_plot_bouquet() so
that cluster assignments align visually with the rendered paths. The
resulting cluster column can be passed directly to make_plot_bouquet()
via stem_colors, flower_colors, facet_by, or cluster_hull.
The returned object is both the original data tibble (with an added
cluster column) and a cluster_bouquet S3 object. Call
summary.cluster_bouquet() on it to print a structured report of cluster
quality, sizes, and member series.
Usage
cluster_bouquet(
data,
time_col = 1,
series_col = 2,
value_col = 3,
k = "auto",
k_max = 8L,
method = c("coords_hclust", "coords_kmeans", "coords_pam", "heading_hclust",
"heading_kmeans", "heading_pam", "area_hclust", "area_pam"),
resolution = 0.5,
cluster_col = "cluster",
normalise = FALSE,
ceiling_pct = 0.8,
launch_deg = 90,
seed = NULL,
verbose = FALSE
)Arguments
- data
A long-format data frame or tibble – the same format expected by
make_plot_bouquet().- time_col
<
tidy-select> The time index column. Defaults to the first column.- series_col
<
tidy-select> The column identifying each individual time series. Defaults to the second column.- value_col
<
tidy-select> The numeric value column. Defaults to the third column.- k
Either a positive integer specifying the number of clusters, or
"auto"(default) to select the optimal number by maximising the composite silhouette score acrossk = 2, ..., k_max.- k_max
A single positive integer. Upper bound considered when
k = "auto". Capped atn_series - 1. Defaults to8L.- method
The clustering algorithm and feature representation. One of:
"coords_hclust"(Default) Clusters on the actual (x, y) path coordinates. Paths that look similar in the plot cluster together. Uses Ward's D2 hierarchical clustering. Deterministic.
"coords_kmeans"As
"coords_hclust"but k-means. Setseedfor reproducibility."coords_pam"As
"coords_hclust"but Partitioning Around Medoids. More robust to outliers. Requires cluster."heading_hclust"Clusters on the cumulative heading angle sequence. Captures turning patterns independently of absolute position; rotation-invariant. Uses Ward's D2.
"heading_kmeans"As
"heading_hclust"but k-means."heading_pam"As
"heading_hclust"but PAM. Requires cluster."area_hclust"Pairwise area distance: the area of the polygon enclosed by two paths from the shared origin, computed via the cross-product sum \(\frac{1}{2}|\sum_i (x^{(1)}_i y^{(2)}_i - x^{(2)}_i y^{(1)}_i)|\). Paths spatially close and similarly shaped get small distances. Uses Ward's D2.
"area_pam"As
"area_hclust"but PAM. Requires cluster.
- resolution
A non-negative number. Composite k-selection score: \(\bar{s}(k) \times \min_c \bar{s}_c(k) \times k^{\text{resolution}}\). Higher values favour finer clusters.
0= plain mean silhouette. Default0.5.- cluster_col
String. Name of the cluster column added to
data. Default"cluster".- normalise
Logical. Controls path building for clustering. Should match the value passed to
make_plot_bouquet()so that cluster assignments align visually with the rendered paths.FALSE(default) uses a single global \(\theta\);TRUEnormalises per series. Seemake_plot_bouquet()for full details.- ceiling_pct
A single number in
(0, 1]. Passed to path building for path-geometry methods. Should match the value used inmake_plot_bouquet(). Default0.80.- launch_deg
Initial heading in degrees for path building. Only used by path-geometry methods. Should match
make_plot_bouquet(). Default90.- seed
Integer or
NULL. Random seed passed tobase::set.seed()before k-means initialisation. Ensures reproducible results for"coords_kmeans"and"heading_kmeans". DefaultNULL= no seed (non-reproducible).- verbose
Logical. Print k selection table, chosen k, silhouette score, and cluster sizes. Default
FALSE.
Value
The original data tibble with one additional factor column (named
by cluster_col, levels "C1", "C2", ... ordered by decreasing size)
plus a cluster_bouquet S3 class. The object also carries a bq_meta
attribute containing all clustering diagnostics; use
summary.cluster_bouquet() to print them.
Details
Feature representations
Path coordinates ("coords_*"): each series is encoded as
\(2(n-1)\) values \((x_1, \ldots, x_{n-1}, y_1, \ldots, y_{n-1})\).
Directly reflects the visual plot – paths close together cluster together.
Most intuitive for visual interpretation. This is the default.
Heading sequence ("heading_*"): cumulative heading angle in
degrees at each step. Captures the turning pattern independently of
absolute position; rotation-invariant.
Area distance ("area_*"): pairwise distance is the area of the
polygon enclosed by two paths (shoelace formula). Captures both shape
difference and spatial separation in a single scalar with a clean
geometric interpretation.
Composite k-selection
Plain mean silhouette consistently selects k = 2. The composite score adds a worst-cluster penalty and a resolution exponent to favour finer structure.
Pairing with make_plot_bouquet()
Pass the same normalise, ceiling_pct, and launch_deg to both
functions. When a cluster_bouquet object is piped into
make_plot_bouquet(), normalise is inherited automatically.
See also
make_plot_bouquet() for visualising the result.
summary.cluster_bouquet() for a structured quality report.
plot_cluster_quality() to plot silhouette scores across k.
Examples
set.seed(42)
n <- 52L
weeks <- seq(as.Date("2023-01-01"), by = "week", length.out = n)
season <- sin(seq(0, 2 * pi, length.out = n))
gw_long <- tibble::tibble(
week = rep(weeks, 6L),
station = rep(paste0("S", 1:6), each = n),
level_m = c(
8.5 + 0.8 * season + cumsum(rnorm(n, 0.00, 0.2)),
8.3 + 0.7 * season + cumsum(rnorm(n, 0.01, 0.2)),
7.2 + 0.5 * season + cumsum(rnorm(n, 0.02, 0.2)),
7.0 + 0.6 * season + cumsum(rnorm(n, 0.00, 0.2)),
9.1 + 1.1 * season + cumsum(rnorm(n, -0.01, 0.2)),
9.3 + 1.0 * season + cumsum(rnorm(n, -0.02, 0.2))
)
)
# Default: path-coordinate clustering (coords_hclust), auto k
clustered <- cluster_bouquet(gw_long,
time_col = week, series_col = station, value_col = level_m)
summary(clustered)
#> -- cluster_bouquet summary ----------------------------------------------
#> Method : coords_hclust
#> Normalise : FALSE
#> Seed : none
#> Series : 6 k = 3 Resolution = 0.50
#> Mean silhouette : 0.364 (reasonable structure)
#>
#> Auto k selection (composite silhouette score):
#> k = 2 : 0.2077
#> k = 3 : 0.3641 <-- selected
#> k = 4 : 0.2998
#> k = 5 : 0.1471
#>
#> Cluster sizes and members:
#> C1 (3 series, mean sil = 0.399)
#> S1, S4, S5
#> C2 (2 series, mean sil = 0.493)
#> S2, S6
#> C3 (1 series, mean sil = 0.000)
#> S3
#>
#> Per-series silhouette widths:
#> S5 C1 0.553 |||||||||||
#> S1 C1 0.375 ||||||||
#> S4 C1 0.269 |||||
#> S2 C2 0.590 ||||||||||||
#> S6 C2 0.396 ||||||||
#> S3 C3 0.000
#> ------------------------------------------------------------------------
# Path-coordinate clustering -- most visually intuitive
# \donttest{
cluster_bouquet(gw_long,
time_col = week, series_col = station, value_col = level_m,
method = "coords_hclust")
#> # A tibble: 312 × 4
#> week station level_m cluster
#> <date> <chr> <dbl> <fct>
#> 1 2023-01-01 S1 8.77 C1
#> 2 2023-01-08 S1 8.76 C1
#> 3 2023-01-15 S1 8.93 C1
#> 4 2023-01-22 S1 9.15 C1
#> 5 2023-01-29 S1 9.32 C1
#> 6 2023-02-05 S1 9.38 C1
#> 7 2023-02-12 S1 9.76 C1
#> 8 2023-02-19 S1 9.81 C1
#> 9 2023-02-26 S1 10.3 C1
#> 10 2023-03-05 S1 10.3 C1
#> # ℹ 302 more rows
# Area-based clustering
cluster_bouquet(gw_long,
time_col = week, series_col = station, value_col = level_m,
method = "area_hclust")
#> # A tibble: 312 × 4
#> week station level_m cluster
#> <date> <chr> <dbl> <fct>
#> 1 2023-01-01 S1 8.77 C1
#> 2 2023-01-08 S1 8.76 C1
#> 3 2023-01-15 S1 8.93 C1
#> 4 2023-01-22 S1 9.15 C1
#> 5 2023-01-29 S1 9.32 C1
#> 6 2023-02-05 S1 9.38 C1
#> 7 2023-02-12 S1 9.76 C1
#> 8 2023-02-19 S1 9.81 C1
#> 9 2023-02-26 S1 10.3 C1
#> 10 2023-03-05 S1 10.3 C1
#> # ℹ 302 more rows
# }