Release 0.3.0

Published on January 21th, 2021

Linfa 0.3.0 concentrates on polishing the existing implementation and adds only three new algorithms to the crowd. A new feature system is introduced, which allows the selection of the BLAS/LAPACK backend in the base-crate. The Dataset interface is polished and follows the ndarray model more closely. The new linfa-datasets crate gives easier access to sample datasets and can be used for testing.

New algorithms

  • Approximated DBSCAN has been added to linfa-clustering by [@Sauro98]
  • Gaussian Naive Bayes has been added to linfa-bayes by [@VasanthakumarV]
  • Elastic Net linear regression has been added to linfa-elasticnet by [@paulkoerbitz] and [@bytesnake]

Changes

  • Added benchmark to gaussian mixture models (a3eede55)
  • Fixed bugs in linear decision trees, added generator for TiKZ trees (bfa5aebe7)
  • Implemented serde for all crates behind feature flag (4f0b63bb)
  • Implemented new backend features (7296c9ec4)
  • Introduced linfa-datasets for easier testing (3cec12b4f)
  • Rename Dataset to DatasetBase and introduce Dataset and DatasetView (21dd579cf)
  • Improve kernel tests and documentation (8e81a6d)

Example

The following section shows a small example how datasets interact with the training and testing of a Linear Decision Tree.

You can load a dataset, shuffle it and then split it into training and validation sets:

// initialize pseudo random number generator with seed 42
let mut rng = Isaac64Rng::seed_from_u64(42);
// load the Iris dataset, shuffle and split with ratio 0.8
let (train, test) = linfa_datasets::iris()
    .shuffle(&mut rng)
    .split_with_ratio(0.8);

With the training dataset a linear decision tree model can be trained. Entropy is used as a metric for the optimal split here:

let entropy_model = DecisionTree::params()
    .split_quality(SplitQuality::Entropy)
    .max_depth(Some(100))
    .min_weight_split(10.0)
    .min_weight_leaf(10.0)
    .fit(&train);

The validation dataset is now used to estimate the error. For this the true labels are predicted and then a confusion matrix gives clue about the type of error:

let cm = entropy_model
    .predict(test.records().view())
    .confusion_matrix(&test);

println!("{:?}", cm);

println!(
    "Test accuracy with Entropy criterion: {:.2}%",
    100.0 * cm.accuracy()
);

Finally you can analyze which features were used in the decision and export the whole tree it to a TeX file. It will contain a TiKZ tree with information on the splitting decision and impurity improvement:

let feats = entropy_model.features();
println!("Features trained in this tree {:?}", feats);

let mut tikz = File::create("decision_tree_example.tex").unwrap();
tikz.write(gini_model.export_to_tikz().to_string().as_bytes())
    .unwrap();

The whole example can be found in linfa-trees/examples/decision_tree.rs.