sklearn_utils

Build Status Documentation Status codecov

Utility functions, preprocessing steps, and class I need during in my research and developement projects in scikit learn.

Installation

You can install sklearn-utils with pip:

pip install sklearn-utils

Examples

If you want to scale your data based on reference values you may use StandardScalerByLabel. For example, I scale all the blood sample by healthy samples.

from sklearn_utils.preprocessing import StandardScalerByLabel

preprocessing = StandardScalerByLabel('healthy')
X_t = preprocessing.fit_transform(X, y)

Or you may want your list of dict in the end of sklearn pipeline, after set of operations and feature selection.

from sklearn_utils.preprocessing import InverseDictVectorizer

vect = DictVectorizer(sparse=False)
skb = SelectKBest(k=100)
pipe = Pipeline([
    ('vect', vect),
    ('skb', skb),
    ('inv_vect', InverseDictVectorizer(vect, skb))
])

X_t = pipe.fit_transform(X, y)

For more features, You can check the documentation.

Documentation

The documentation of the project avaiable in http://sklearn-utils.rtfd.io .

API Documentation

Preprocessing

class sklearn_utils.preprocessing.DictInput(transformer, feature_selection=False, sparse=False)[source]

Bases: sklearn.base.TransformerMixin

Converts a preprocessing step to accept list of dict.

__init__(transformer, feature_selection=False, sparse=False)[source]
Parameters:
  • transformer – Sklearn transformer
  • feature_selection – is this transformer perform feature selection.
fit(X, y=None)[source]
transform(X)[source]
Parameters:X – features.
class sklearn_utils.preprocessing.FoldChangeScaler(reference_label, bounds=(-10, 10))[source]

Bases: sklearn.base.TransformerMixin

Scales by measured value by distance to mean according to time of value. Useful when you want to standart scale but no varience.

__init__(reference_label, bounds=(-10, 10))[source]
Reference_label:
 the label scaling will be performed by.
Bounds:min-max values the fold change scaler can get.

There are bounds because the scaling can provide unstable results.

fit(X, y)[source]
X:list of dict
Y:labels
transform(X)[source]
class sklearn_utils.preprocessing.FeatureRenaming(names, case_sensetive=False)[source]

Bases: sklearn.base.TransformerMixin

Preprocessing to re-name features.

__init__(names, case_sensetive=False)[source]
Names:dict which contain old feature names as key and new names as value.
Case_insensetive:
 performs mactching case sensetive
fit(X, y=None)[source]
transform(X, y=None)[source]
X:list of dict
class sklearn_utils.preprocessing.DynamicPipeline[source]

Bases: object

Dynamic Pipeline

classmethod make_pipeline(selected_steps)[source]
class sklearn_utils.preprocessing.StandardScalerByLabel(reference_label)[source]

Bases: sklearn.preprocessing.data.StandardScaler

StandardScaler for using only by give label.

__init__(reference_label)[source]
Reference_label:
 the label scaling will be performed by.
partial_fit(X, y)[source]
X:{array-like, sparse matrix}, shape [n_samples, n_features] The data used to compute the mean and standard deviation used for later scaling along the features axis.
Y:Healthy ‘h’ or ‘sick_name’
class sklearn_utils.preprocessing.FunctionalEnrichmentAnalysis(reference_label, feature_groups, method='fisher_exact', alternative='two-sided', filter_func=None)[source]

Bases: sklearn.base.TransformerMixin

Functional Enrichment Analysis

__init__(reference_label, feature_groups, method='fisher_exact', alternative='two-sided', filter_func=None)[source]
Reference_label:
 label of refence values in the calculation
Method:only fisher exact test avaliable so far
Feature_groups:list of dict where keys are new feature and values are list of old features
Filter_func:function return true or false
fit(X, y)[source]
transform(X, y=None)[source]
X:list of dict
Y:labels
class sklearn_utils.preprocessing.FeatureMerger(features, strategy='mean')[source]

Bases: sklearn.base.TransformerMixin

Merge some features based on given strategy.

__init__(features, strategy='mean')[source]
Features:dict which contain new feature as key and old features as list in values.
Strategy:strategy to merge features. ‘mean’, ‘sum’ and lambda function accepted.

Lambda function accepts list of values as input.

fit(X, y=None)[source]
transform(X, y=None)[source]

Utils

sklearn_utils.utils.filter_by_label(X, y, ref_label, reverse=False)[source]

Select items with label from dataset.

Parameters:
  • X – dataset
  • y – labels
  • ref_label – reference label
  • reverse (bool) – if false selects ref_labels else eliminates
sklearn_utils.utils.average_by_label(X, y, ref_label)[source]

Calculates average dictinary from list of dictionary for give label

Parameters:
  • X (List[Dict]) – dataset
  • y (list) – labels
  • ref_label – reference label
sklearn_utils.utils.map_dict(d, key_func=None, value_func=None, if_func=None)[source]
Parameters:
  • d (dict) – dictionary
  • key_func (func) – func which will run on key.
  • value_func (func) – func which will run on values.
sklearn_utils.utils.map_dict_list(ds, key_func=None, value_func=None, if_func=None)[source]
Parameters:
  • ds (List[Dict]) – list of dict
  • key_func (func) – func which will run on key.
  • value_func (func) – func which will run on values.
class sklearn_utils.utils.SkUtilsIO(path, gz=False)[source]

Bases: object

IO class to read and write dataset to file.

__init__(path, gz=False)[source]
Filename:file name with path.
from_csv(label_column='labels')[source]

Read dataset from csv.

from_json()[source]

Reads dataset from json.

from_pickle()[source]

Reads dataset to pickle.

to_csv(X, y)[source]

Writes dataset to csv.

to_json(X, y)[source]

Reads dataset to csv.

Parameters:
  • X – dataset as list of dict.
  • y – labels.
to_pickle(X, y)[source]

Writes dataset to pickle.

Parameters:
  • X – dataset as list of dict.
  • y – labels.

Noise

class sklearn_utils.noise.SelectNotKBest(**kwargs)[source]

Bases: sklearn.base.TransformerMixin

Select all feature except best K feature

__init__(**kwargs)[source]
fit(X, y)[source]
get_support()[source]
transform(X)[source]

Transform to select not k best feature :param X: np.matrix

class sklearn_utils.noise.NoiseGenerator(noise_func, noise_func_args)[source]

Bases: sklearn.base.TransformerMixin

Add noise to dataset

__init__(noise_func, noise_func_args)[source]

Add noise to data :noise_func: a function which generator noise with same shape with data :noise_func_args: arguments of noise function

fit(X, y)[source]
relative_noise_size(data, noise)[source]
Data:original data as numpy matrix
Noise:noise matrix as numpy matrix
transform(X)[source]
X:numpy ndarray