You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-[x] Multivariate real-valued observations (modeled with Gaussian mixture emissions)
100
+
-[x] Univariate categorical observations (modeled with discrete emissions)
90
101
-[x] Linear, left-right and ergodic topologies
91
102
-[x] Multi-processed predictions
92
103
93
104
### Scikit-Learn compatibility
94
105
95
-
**Sequentia (≥2.0) is fully compatible with the Scikit-Learn API (≥1.4), enabling for rapid development and prototyping of sequential models.**
106
+
**Sequentia (≥2.0) is compatible with the Scikit-Learn API (≥1.4), enabling for rapid development and prototyping of sequential models.**
107
+
108
+
The integration relies on the use of [metadata routing](https://scikit-learn.org/stable/metadata_routing.html),
109
+
which means that in most cases, the only necessary change is to add a `lengths` key-word argument to provide
110
+
sequence length information, e.g. `fit(X, y, lengths=lengths)` instead of `fit(X, y)`.
111
+
112
+
### Similar libraries
113
+
114
+
As DTW k-nearest neighbors is the core algorithm offered by Sequentia, below is a comparison of the DTW k-nearest neighbors algorithm features supported by Sequentia and similar libraries.
- <sup>1</sup>`tslearn` supports variable length sequences with padding, but doesn't seem to mask the padding.
134
+
- <sup>2</sup>`sktime` does not support variable length sequences, so they are padded (and padding is not masked).
135
+
- <sup>3</sup>`pyts` does not support variable length sequences, so they are padded (and padding is not masked).
136
+
- <sup>4</sup>`sequentia` only supports [`dtaidistance`](https://github.com/wannesm/dtaidistance), which is one of the fastest DTW libraries as it is written in C.
137
+
138
+
### Benchmarks
139
+
140
+
To compare the above libraries in runtime performance on dynamic time warping k-nearest neighbors classification tasks, a simple benchmark was performed on a univariate sequence dataset.
141
+
142
+
The [Free Spoken Digit Dataset](https://sequentia.readthedocs.io/en/latest/sections/datasets/digits.html) was used for benchmarking and consists of:
143
+
144
+
- 3000 recordings of 10 spoken digits (0-9)
145
+
- 50 recordings of each digit for each of 6 speakers
146
+
- 1500 used for training, 1500 used for testing (split via label stratification)
147
+
- 13 features ([MFCCs](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum))
148
+
- Only the first feature was used as not all of the above libraries support multivariate sequences
149
+
- Sequence length statistics: (min 6, median 17, max 92)
150
+
151
+
Each result measures the total time taken to complete training and prediction repeated 10 times.
152
+
153
+
All of the above libraries support multiprocessing, and prediction was performed using 16 workers.
96
154
97
-
In most cases, the only necessary change is to add a `lengths` key-word argument to provide sequence length information, e.g. `fit(X, y, lengths=lengths)` instead of `fit(X, y)`.
155
+
<sup>*</sup>: `sktime`, `tslearn` and `pyts` seem to not mask padding, which may result in incorrect predictions.
156
+
157
+
<imgsrc="benchmarks/benchmark.svg"width="100%"/>
158
+
159
+
> **Device information**:
160
+
> - Product: ThinkPad T14s (Gen 6)
161
+
> - Processor: AMD Ryzen™ AI 7 PRO 360 (8 cores, 16 threads, 2-5GHz)
> - Operating system: Fedora Linux 41 (Workstation Edition)
98
165
99
166
## Installation
100
167
@@ -104,13 +171,13 @@ The latest stable version of Sequentia can be installed with the following comma
104
171
pip install sequentia
105
172
```
106
173
107
-
### C library compilation
174
+
### C libraries
108
175
109
-
For optimal performance when using any of the k-NN based models, it is important that `dtaidistance` C libraries are compiled correctly.
176
+
For optimal performance when using any of the k-NN based models, it is important that the correct `dtaidistance` C libraries are accessible.
110
177
111
178
Please see the [`dtaidistance` installation guide](https://dtaidistance.readthedocs.io/en/latest/usage/installation.html) for troubleshooting if you run into C compilation issues, or if setting `use_c=True` on k-NN based models results in a warning.
112
179
113
-
You can use the following to check if the appropriate C libraries have been installed.
180
+
You can use the following to check if the appropriate C libraries are available.
114
181
115
182
```python
116
183
from dtaidistance import dtw
@@ -127,26 +194,25 @@ Documentation for the package is available on [Read The Docs](https://sequentia.
127
194
128
195
## Examples
129
196
130
-
Demonstration of classifying multivariate sequences with two features into two classes using the `KNNClassifier`.
197
+
Demonstration of classifying multivariate sequences into two classes using the `KNNClassifier`.
131
198
132
-
This example also shows a typical preprocessing workflow, as well as compatibility with Scikit-Learn.
199
+
This example also shows a typical preprocessing workflow, as well as compatibility with
200
+
Scikit-Learn for pipelining and hyper-parameter optimization.
133
201
134
-
```python
135
-
import numpy as np
202
+
---
136
203
137
-
from sklearn.preprocessing import scale
138
-
from sklearn.decomposition importPCA
139
-
from sklearn.pipeline import Pipeline
204
+
First, we create some sample multivariate input data consisting of three sequences with two features.
140
205
141
-
from sequentia.models import KNNClassifier
142
-
from sequentia.preprocessing import IndependentFunctionTransformer, median_filter
206
+
- Sequentia expects sequences to be concatenated and represented as a single NumPy array.
207
+
- Sequence lengths are provided separately and used to decode the sequences when needed.
143
208
144
-
# Create input data
145
-
# - Sequentia expects sequences to be concatenated into a single array
146
-
# - Sequence lengths are provided separately and used to decode the sequences when needed
147
-
# - This avoids the need for complex structures such as lists of arrays with different lengths
209
+
This avoids the need for complex structures such as lists of nested arrays with different lengths,
With this data, we can train a `KNNClassifier` and use it for prediction and scoring.
240
+
241
+
**Note**: Each of the `fit()`, `predict()` and `score()` methods require the sequence lengths
242
+
to be provided in addition to the sequence data `X` and labels `y`.
243
+
244
+
```python
245
+
from sequentia.models import KNNClassifier
246
+
247
+
# Initialize and fit the classifier
248
+
clf = KNNClassifier(k=1)
249
+
clf.fit(X, y, lengths=lengths)
250
+
251
+
# Make predictions based on the provided sequences
252
+
y_pred = clf.predict(X, lengths=lengths)
253
+
254
+
# Make predicitons based on the provided sequences and calculate accuracy
255
+
acc = clf.score(X, y, lengths=lengths)
256
+
```
257
+
258
+
Alternatively, we can use [`sklearn.preprocessing.Pipeline`](https://scikit-learn.org/1.5/modules/generated/sklearn.pipeline.Pipeline.html) to build a more complex preprocessing pipeline:
259
+
260
+
1. Individually denoise each sequence by applying a [median filter](https://sequentia.readthedocs.io/en/latest/sections/preprocessing/transforms/filters.html#sequentia.preprocessing.transforms.median_filter) to each sequence.
261
+
2. Individually [standardize](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html) each sequence by subtracting the mean and dividing the s.d. for each feature.
262
+
3. Reduce the dimensionality of the data to a single feature by using [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).
263
+
4. Pass the resulting transformed data into a `KNNClassifier`.
264
+
265
+
**Note**: Steps 1 and 2 use [`IndependentFunctionTransformer`](https://sequentia.readthedocs.io/en/latest/sections/preprocessing/transforms/function_transformer.html#sequentia.preprocessing.transforms.IndependentFunctionTransformer) provided by Sequentia to
266
+
apply the specified transformation to each sequence in `X` individually, rather than using
267
+
[`FunctionTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html#sklearn.preprocessing.FunctionTransformer) from Scikit-Learn which would transform the entire `X`
268
+
array once, treating it as a single sequence.
171
269
172
-
# Create a transformation pipeline that feeds into a KNNClassifier
173
-
# 1. Individually denoise each sequence by applying a median filter for each feature
174
-
# 2. Individually standardize each sequence by subtracting the mean and dividing the s.d. for each feature
175
-
# 3. Reduce the dimensionality of the data to a single feature by using PCA
176
-
# 4. Pass the resulting transformed data into a KNNClassifier
270
+
```python
271
+
from sklearn.preprocessing import scale
272
+
from sklearn.decomposition importPCA
273
+
from sklearn.pipeline import Pipeline
274
+
275
+
from sequentia.preprocessing import IndependentFunctionTransformer, median_filter
276
+
277
+
# Create a preprocessing pipeline that feeds into a KNNClassifier
# Specify k-fold cross-validation with label stratification using 4 splits
319
+
cv=StratifiedKFold(n_splits=4),
320
+
)
321
+
322
+
# Perform cross-validation over accuracy and retrieve the best model
323
+
search.fit(X, y, lengths=lengths)
324
+
clf = search.best_estimator_
325
+
326
+
# Make predicitons using the best model and calculate accuracy
327
+
acc = clf.score(X, y, lengths=lengths)
328
+
```
329
+
192
330
## Acknowledgments
193
331
194
332
In earlier versions of the package, an approximate DTW implementation [`fastdtw`](https://github.com/slaypni/fastdtw) was used in hopes of speeding up k-NN predictions, as the authors of the original FastDTW paper [[2]](#references) claim that approximated DTW alignments can be computed in linear memory and time, compared to the O(N<sup>2</sup>) runtime complexity of the usual exact DTW implementation.
@@ -262,12 +400,12 @@ All contributions to this repository are greatly appreciated. Contribution guide
262
400
263
401
Sequentia is released under the [MIT](https://opensource.org/licenses/MIT) license.
264
402
265
-
Certain parts of the source code are heavily adapted from [Scikit-Learn](scikit-learn.org/).
403
+
Certain parts of source code are heavily adapted from [Scikit-Learn](scikit-learn.org/).
266
404
Such files contain a copy of [their license](https://github.com/scikit-learn/scikit-learn/blob/main/COPYING).
0 commit comments