Skip to content

Commit a54dcdb

Browse files
authored
release: v2.5.0
release: v2.5.0
2 parents b5a4b0f + cf52c29 commit a54dcdb

File tree

108 files changed

+4113
-195
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

108 files changed

+4113
-195
lines changed

.circleci/config.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ jobs:
4646
name: python/default
4747
steps:
4848
- coveralls/upload:
49-
carryforward: 3.11, 3.12
49+
carryforward: 3.11, 3.12, 3.13
5050
parallel_finished: true
5151

5252
workflows:
@@ -56,7 +56,7 @@ workflows:
5656
- tests:
5757
matrix:
5858
parameters:
59-
version: ["3.11", "3.12"]
59+
version: ["3.11", "3.12", "3.13"]
6060
- coverage:
6161
requires:
6262
- tests

.coveragerc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
[run]
2+
omit = "sequentia/model_selection/_validation.py"

.gitattributes

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
*.ipynb linguist-documentation

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,3 +94,6 @@ venv.bak/
9494

9595
# Changelog entry
9696
ENTRY.md
97+
98+
# Jupyter Notebook checkpoints
99+
*.ipynb_checkpoints/

CHANGELOG.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -388,6 +388,21 @@ Nothing, initial release!
388388

389389
</details>
390390

391+
## [v2.5.0](https://github.com/eonu/sequentia/releases/tag/v2.5.0) - 2024-12-27
392+
393+
### Documentation
394+
395+
- update copyright notice ([#255](https://github.com/eonu/sequentia/issues/255))
396+
397+
### Features
398+
399+
- add `mise.toml` and support `numpy>=2` ([#254](https://github.com/eonu/sequentia/issues/254))
400+
- add python v3.13 support ([#253](https://github.com/eonu/sequentia/issues/253))
401+
- add library benchmarks ([#256](https://github.com/eonu/sequentia/issues/256))
402+
- add `model_selection` sub-package for hyper-parameters ([#257](https://github.com/eonu/sequentia/issues/257))
403+
- add model spec support to `HMMClassifier.__init__` ([#258](https://github.com/eonu/sequentia/issues/258))
404+
- add `HMMClassifier.fit` multiprocessing ([#259](https://github.com/eonu/sequentia/issues/259))
405+
391406
## [v2.0.2](https://github.com/eonu/sequentia/releases/tag/v2.0.2) - 2024-04-13
392407

393408
### Bug Fixes

CODE_OF_CONDUCT.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,6 @@ We are thankful for their work and all the communities who have paved the way wi
5050
---
5151

5252
<p align="center">
53-
<b>Sequentia</b> &copy; 2019-2025, Edwin Onuonga - Released under the <a href="https://opensource.org/licenses/MIT">MIT</a> license.<br/>
53+
<b>Sequentia</b> &copy; 2019, Edwin Onuonga - Released under the <a href="https://opensource.org/licenses/MIT">MIT</a> license.<br/>
5454
<em>Authored and maintained by Edwin Onuonga.</em>
5555
</p>

CONTRIBUTING.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -105,6 +105,6 @@ By contributing, you agree that your contributions will be licensed under the re
105105
---
106106

107107
<p align="center">
108-
<b>Sequentia</b> &copy; 2019-2025, Edwin Onuonga - Released under the <a href="https://opensource.org/licenses/MIT">MIT</a> license.<br/>
108+
<b>Sequentia</b> &copy; 2019, Edwin Onuonga - Released under the <a href="https://opensource.org/licenses/MIT">MIT</a> license.<br/>
109109
<em>Authored and maintained by Edwin Onuonga.</em>
110110
</p>

LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
MIT License
22

3-
Copyright (c) 2019-2025 Edwin Onuonga (eonu) <ed@eonu.net>
3+
Copyright (c) 2019 Edwin Onuonga (eonu) <ed@eonu.net>
44

55
Permission is hereby granted, free of charge, to any person obtaining a copy
66
of this software and associated documentation files (the "Software"), to deal

README.md

Lines changed: 171 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@
3434
<a href="#about">About</a> ·
3535
<a href="#build-status">Build Status</a> ·
3636
<a href="#features">Features</a> ·
37+
<a href="#installation">Installation</a> ·
3738
<a href="#documentation">Documentation</a> ·
3839
<a href="#examples">Examples</a> ·
3940
<a href="#acknowledgments">Acknowledgments</a> ·
@@ -68,12 +69,15 @@ Some examples of how Sequentia can be used on sequence data include:
6869

6970
### Models
7071

71-
The following models provided by Sequentia all support variable length sequences.
72-
7372
#### [Dynamic Time Warping + k-Nearest Neighbors](https://sequentia.readthedocs.io/en/latest/sections/models/knn/index.html) (via [`dtaidistance`](https://github.com/wannesm/dtaidistance))
7473

74+
Dynamic Time Warping (DTW) is a distance measure that can be applied to two sequences of different length.
75+
When used as a distance measure for the k-Nearest Neighbors (kNN) algorithm this results in a simple yet
76+
effective inference algorithm.
77+
7578
- [x] Classification
7679
- [x] Regression
80+
- [x] Variable length sequences
7781
- [x] Multivariate real-valued observations
7882
- [x] Sakoe–Chiba band global warping constraint
7983
- [x] Dependent and independent feature warping (DTWD/DTWI)
@@ -82,19 +86,82 @@ The following models provided by Sequentia all support variable length sequences
8286

8387
#### [Hidden Markov Models](https://sequentia.readthedocs.io/en/latest/sections/models/hmm/index.html) (via [`hmmlearn`](https://github.com/hmmlearn/hmmlearn))
8488

85-
Parameter estimation with the Baum-Welch algorithm and prediction with the forward algorithm [[1]](#references)
89+
A Hidden Markov Model (HMM) is a state-based statistical model which represents a sequence as
90+
a series of observations that are emitted from a collection of latent hidden states which form
91+
an underlying Markov chain. Each hidden state has an emission distribution that models its observations.
92+
93+
Expectation-maximization via the Baum-Welch algorithm (or forward-backward algorithm) [[1]](#references) is used to
94+
derive a maximum likelihood estimate of the Markov chain probabilities and emission distribution parameters
95+
based on the provided training sequence data.
8696

8797
- [x] Classification
88-
- [x] Multivariate real-valued observations (Gaussian mixture model emissions)
89-
- [x] Univariate categorical observations (discrete emissions)
98+
- [x] Variable length sequences
99+
- [x] Multivariate real-valued observations (modeled with Gaussian mixture emissions)
100+
- [x] Univariate categorical observations (modeled with discrete emissions)
90101
- [x] Linear, left-right and ergodic topologies
91102
- [x] Multi-processed predictions
92103

93104
### Scikit-Learn compatibility
94105

95-
**Sequentia (≥2.0) is fully compatible with the Scikit-Learn API (≥1.4), enabling for rapid development and prototyping of sequential models.**
106+
**Sequentia (≥2.0) is compatible with the Scikit-Learn API (≥1.4), enabling for rapid development and prototyping of sequential models.**
107+
108+
The integration relies on the use of [metadata routing](https://scikit-learn.org/stable/metadata_routing.html),
109+
which means that in most cases, the only necessary change is to add a `lengths` key-word argument to provide
110+
sequence length information, e.g. `fit(X, y, lengths=lengths)` instead of `fit(X, y)`.
111+
112+
### Similar libraries
113+
114+
As DTW k-nearest neighbors is the core algorithm offered by Sequentia, below is a comparison of the DTW k-nearest neighbors algorithm features supported by Sequentia and similar libraries.
115+
116+
||**`sequentia`**|[`aeon`](https://github.com/aeon-toolkit/aeon)|[`tslearn`](https://github.com/tslearn-team/tslearn)|[`sktime`](https://github.com/sktime/sktime)|[`pyts`](https://github.com/johannfaouzi/pyts)|
117+
|-|:-:|:-:|:-:|:-:|:-:|
118+
|Scikit-Learn compatible||||||
119+
|Multivariate sequences||||||
120+
|Variable length sequences|||➖<sup>1</sup>|❌<sup>2</sup>|❌<sup>3</sup>|
121+
|No padding required|||➖<sup>1</sup>|❌<sup>2</sup>|❌<sup>3</sup>|
122+
|Classification||||||
123+
|Regression||||||
124+
|Preprocessing||||||
125+
|Multiprocessing||||||
126+
|Custom weighting||||||
127+
|Sakoe-Chiba band constraint||||||
128+
|Itakura paralellogram constraint||||||
129+
|Dependent DTW (DTWD)||||||
130+
|Independent DTW (DTWI)||||||
131+
|Custom DTW measures|❌<sup>4</sup>|||||
132+
133+
- <sup>1</sup>`tslearn` supports variable length sequences with padding, but doesn't seem to mask the padding.
134+
- <sup>2</sup>`sktime` does not support variable length sequences, so they are padded (and padding is not masked).
135+
- <sup>3</sup>`pyts` does not support variable length sequences, so they are padded (and padding is not masked).
136+
- <sup>4</sup>`sequentia` only supports [`dtaidistance`](https://github.com/wannesm/dtaidistance), which is one of the fastest DTW libraries as it is written in C.
137+
138+
### Benchmarks
139+
140+
To compare the above libraries in runtime performance on dynamic time warping k-nearest neighbors classification tasks, a simple benchmark was performed on a univariate sequence dataset.
141+
142+
The [Free Spoken Digit Dataset](https://sequentia.readthedocs.io/en/latest/sections/datasets/digits.html) was used for benchmarking and consists of:
143+
144+
- 3000 recordings of 10 spoken digits (0-9)
145+
- 50 recordings of each digit for each of 6 speakers
146+
- 1500 used for training, 1500 used for testing (split via label stratification)
147+
- 13 features ([MFCCs](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum))
148+
- Only the first feature was used as not all of the above libraries support multivariate sequences
149+
- Sequence length statistics: (min 6, median 17, max 92)
150+
151+
Each result measures the total time taken to complete training and prediction repeated 10 times.
152+
153+
All of the above libraries support multiprocessing, and prediction was performed using 16 workers.
96154

97-
In most cases, the only necessary change is to add a `lengths` key-word argument to provide sequence length information, e.g. `fit(X, y, lengths=lengths)` instead of `fit(X, y)`.
155+
<sup>*</sup>: `sktime`, `tslearn` and `pyts` seem to not mask padding, which may result in incorrect predictions.
156+
157+
<img src="benchmarks/benchmark.svg" width="100%"/>
158+
159+
> **Device information**:
160+
> - Product: ThinkPad T14s (Gen 6)
161+
> - Processor: AMD Ryzen™ AI 7 PRO 360 (8 cores, 16 threads, 2-5GHz)
162+
> - Memory: 64 GB LPDDR5X-7500MHz
163+
> - Solid State Drive: 1 TB SSD M.2 2280 PCIe Gen4 Performance TLC Opal
164+
> - Operating system: Fedora Linux 41 (Workstation Edition)
98165
99166
## Installation
100167

@@ -104,13 +171,13 @@ The latest stable version of Sequentia can be installed with the following comma
104171
pip install sequentia
105172
```
106173

107-
### C library compilation
174+
### C libraries
108175

109-
For optimal performance when using any of the k-NN based models, it is important that `dtaidistance` C libraries are compiled correctly.
176+
For optimal performance when using any of the k-NN based models, it is important that the correct `dtaidistance` C libraries are accessible.
110177

111178
Please see the [`dtaidistance` installation guide](https://dtaidistance.readthedocs.io/en/latest/usage/installation.html) for troubleshooting if you run into C compilation issues, or if setting `use_c=True` on k-NN based models results in a warning.
112179

113-
You can use the following to check if the appropriate C libraries have been installed.
180+
You can use the following to check if the appropriate C libraries are available.
114181

115182
```python
116183
from dtaidistance import dtw
@@ -127,26 +194,25 @@ Documentation for the package is available on [Read The Docs](https://sequentia.
127194

128195
## Examples
129196

130-
Demonstration of classifying multivariate sequences with two features into two classes using the `KNNClassifier`.
197+
Demonstration of classifying multivariate sequences into two classes using the `KNNClassifier`.
131198

132-
This example also shows a typical preprocessing workflow, as well as compatibility with Scikit-Learn.
199+
This example also shows a typical preprocessing workflow, as well as compatibility with
200+
Scikit-Learn for pipelining and hyper-parameter optimization.
133201

134-
```python
135-
import numpy as np
202+
---
136203

137-
from sklearn.preprocessing import scale
138-
from sklearn.decomposition import PCA
139-
from sklearn.pipeline import Pipeline
204+
First, we create some sample multivariate input data consisting of three sequences with two features.
140205

141-
from sequentia.models import KNNClassifier
142-
from sequentia.preprocessing import IndependentFunctionTransformer, median_filter
206+
- Sequentia expects sequences to be concatenated and represented as a single NumPy array.
207+
- Sequence lengths are provided separately and used to decode the sequences when needed.
143208

144-
# Create input data
145-
# - Sequentia expects sequences to be concatenated into a single array
146-
# - Sequence lengths are provided separately and used to decode the sequences when needed
147-
# - This avoids the need for complex structures such as lists of arrays with different lengths
209+
This avoids the need for complex structures such as lists of nested arrays with different lengths,
210+
or a 3D array with wasteful and annoying padding.
211+
212+
```python
213+
import numpy as np
148214

149-
# Sequences
215+
# Sequence data
150216
X = np.array([
151217
# Sequence 1 - Length 3
152218
[1.2 , 7.91],
@@ -168,27 +234,99 @@ lengths = np.array([3, 5, 2])
168234

169235
# Sequence classes
170236
y = np.array([0, 1, 1])
237+
```
238+
239+
With this data, we can train a `KNNClassifier` and use it for prediction and scoring.
240+
241+
**Note**: Each of the `fit()`, `predict()` and `score()` methods require the sequence lengths
242+
to be provided in addition to the sequence data `X` and labels `y`.
243+
244+
```python
245+
from sequentia.models import KNNClassifier
246+
247+
# Initialize and fit the classifier
248+
clf = KNNClassifier(k=1)
249+
clf.fit(X, y, lengths=lengths)
250+
251+
# Make predictions based on the provided sequences
252+
y_pred = clf.predict(X, lengths=lengths)
253+
254+
# Make predicitons based on the provided sequences and calculate accuracy
255+
acc = clf.score(X, y, lengths=lengths)
256+
```
257+
258+
Alternatively, we can use [`sklearn.preprocessing.Pipeline`](https://scikit-learn.org/1.5/modules/generated/sklearn.pipeline.Pipeline.html) to build a more complex preprocessing pipeline:
259+
260+
1. Individually denoise each sequence by applying a [median filter](https://sequentia.readthedocs.io/en/latest/sections/preprocessing/transforms/filters.html#sequentia.preprocessing.transforms.median_filter) to each sequence.
261+
2. Individually [standardize](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html) each sequence by subtracting the mean and dividing the s.d. for each feature.
262+
3. Reduce the dimensionality of the data to a single feature by using [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).
263+
4. Pass the resulting transformed data into a `KNNClassifier`.
264+
265+
**Note**: Steps 1 and 2 use [`IndependentFunctionTransformer`](https://sequentia.readthedocs.io/en/latest/sections/preprocessing/transforms/function_transformer.html#sequentia.preprocessing.transforms.IndependentFunctionTransformer) provided by Sequentia to
266+
apply the specified transformation to each sequence in `X` individually, rather than using
267+
[`FunctionTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html#sklearn.preprocessing.FunctionTransformer) from Scikit-Learn which would transform the entire `X`
268+
array once, treating it as a single sequence.
171269

172-
# Create a transformation pipeline that feeds into a KNNClassifier
173-
# 1. Individually denoise each sequence by applying a median filter for each feature
174-
# 2. Individually standardize each sequence by subtracting the mean and dividing the s.d. for each feature
175-
# 3. Reduce the dimensionality of the data to a single feature by using PCA
176-
# 4. Pass the resulting transformed data into a KNNClassifier
270+
```python
271+
from sklearn.preprocessing import scale
272+
from sklearn.decomposition import PCA
273+
from sklearn.pipeline import Pipeline
274+
275+
from sequentia.preprocessing import IndependentFunctionTransformer, median_filter
276+
277+
# Create a preprocessing pipeline that feeds into a KNNClassifier
177278
pipeline = Pipeline([
178279
('denoise', IndependentFunctionTransformer(median_filter)),
179280
('scale', IndependentFunctionTransformer(scale)),
180281
('pca', PCA(n_components=1)),
181282
('knn', KNNClassifier(k=1))
182283
])
183284

184-
# Fit the pipeline to the data - lengths must be provided
285+
# Fit the pipeline to the data
185286
pipeline.fit(X, y, lengths=lengths)
186287

187-
# Predict classes for the sequences and calculate accuracy - lengths must be provided
288+
# Predict classes for the sequences and calculate accuracy
188289
y_pred = pipeline.predict(X, lengths=lengths)
290+
291+
# Make predicitons based on the provided sequences and calculate accuracy
189292
acc = pipeline.score(X, y, lengths=lengths)
190293
```
191294

295+
For hyper-parameter optimization, Sequentia provides a `sequentia.model_selection` sub-package
296+
that includes most of the hyper-parameter search and cross-validation methods provided by
297+
[`sklearn.model_selection`](https://scikit-learn.org/stable/api/sklearn.model_selection.html),
298+
but adapted to work with sequences.
299+
300+
For instance, we can perform a grid search with k-fold cross-validation stratifying over labels
301+
in order to find an optimal value for the number of neighbors in `KNNClassifier` for the
302+
above pipeline.
303+
304+
```python
305+
from sequentia.model_selection import StratifiedKFold, GridSearchCV
306+
307+
# Define hyper-parameter search and specify cross-validation method
308+
search = GridSearchCV(
309+
# Re-use the above pipeline
310+
estimator=Pipeline([
311+
('denoise', IndependentFunctionTransformer(median_filter)),
312+
('scale', IndependentFunctionTransformer(scale)),
313+
('pca', PCA(n_components=1)),
314+
('knn', KNNClassifier(k=1))
315+
]),
316+
# Try a range of values of k
317+
param_grid={"knn__k": [1, 2, 3, 4, 5]},
318+
# Specify k-fold cross-validation with label stratification using 4 splits
319+
cv=StratifiedKFold(n_splits=4),
320+
)
321+
322+
# Perform cross-validation over accuracy and retrieve the best model
323+
search.fit(X, y, lengths=lengths)
324+
clf = search.best_estimator_
325+
326+
# Make predicitons using the best model and calculate accuracy
327+
acc = clf.score(X, y, lengths=lengths)
328+
```
329+
192330
## Acknowledgments
193331

194332
In earlier versions of the package, an approximate DTW implementation [`fastdtw`](https://github.com/slaypni/fastdtw) was used in hopes of speeding up k-NN predictions, as the authors of the original FastDTW paper [[2]](#references) claim that approximated DTW alignments can be computed in linear memory and time, compared to the O(N<sup>2</sup>) runtime complexity of the usual exact DTW implementation.
@@ -262,12 +400,12 @@ All contributions to this repository are greatly appreciated. Contribution guide
262400

263401
Sequentia is released under the [MIT](https://opensource.org/licenses/MIT) license.
264402

265-
Certain parts of the source code are heavily adapted from [Scikit-Learn](scikit-learn.org/).
403+
Certain parts of source code are heavily adapted from [Scikit-Learn](scikit-learn.org/).
266404
Such files contain a copy of [their license](https://github.com/scikit-learn/scikit-learn/blob/main/COPYING).
267405

268406
---
269407

270408
<p align="center">
271-
<b>Sequentia</b> &copy; 2019-2025, Edwin Onuonga - Released under the <a href="https://opensource.org/licenses/MIT">MIT</a> license.<br/>
409+
<b>Sequentia</b> &copy; 2019, Edwin Onuonga - Released under the <a href="https://opensource.org/licenses/MIT">MIT</a> license.<br/>
272410
<em>Authored and maintained by Edwin Onuonga.</em>
273411
</p>

benchmarks/__init__.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# Copyright (c) 2019 Sequentia Developers.
2+
# Distributed under the terms of the MIT License (see the LICENSE file).
3+
# SPDX-License-Identifier: MIT
4+
# This source code is part of the Sequentia project (https://github.com/eonu/sequentia).
5+
6+
"""Collection of runtime benchmarks for Python packages
7+
providing dynamic time warping k-nearest neighbors algorithms.
8+
"""

0 commit comments

Comments
 (0)