sklearn_utils¶

Utility functions, preprocessing steps, and class I need during in my research and developement projects in scikit learn.

Installation¶

You can install sklearn-utils with pip:

pip install sklearn-utils

Examples¶

If you want to scale your data based on reference values you may use StandardScalerByLabel. For example, I scale all the blood sample by healthy samples.

from sklearn_utils.preprocessing import StandardScalerByLabel

preprocessing = StandardScalerByLabel('healthy')
X_t = preprocessing.fit_transform(X, y)

Or you may want your list of dict in the end of sklearn pipeline, after set of operations and feature selection.

from sklearn_utils.preprocessing import InverseDictVectorizer

vect = DictVectorizer(sparse=False)
skb = SelectKBest(k=100)
pipe = Pipeline([
    ('vect', vect),
    ('skb', skb),
    ('inv_vect', InverseDictVectorizer(vect, skb))
])

X_t = pipe.fit_transform(X, y)

For more features, You can check the documentation.

Documentation¶

The documentation of the project avaiable in http://sklearn-utils.rtfd.io .

API Documentation¶

Preprocessing¶

class sklearn_utils.preprocessing.DictInput(transformer, feature_selection=False, sparse=False)[source]¶

Bases: sklearn.base.TransformerMixin

Converts a preprocessing step to accept list of dict.

__init__(transformer, feature_selection=False, sparse=False)[source]¶

Parameters:	transformer – Sklearn transformer feature_selection – is this transformer perform feature selection.

fit(X, y=None)[source]¶

transform(X)[source]¶

Parameters:	X – features.

class sklearn_utils.preprocessing.FoldChangeScaler(reference_label, bounds=(-10, 10))[source]¶

Bases: sklearn.base.TransformerMixin

Scales by measured value by distance to mean according to time of value. Useful when you want to standart scale but no varience.

__init__(reference_label, bounds=(-10, 10))[source]¶

Reference_label:
	the label scaling will be performed by.
Bounds:	min-max values the fold change scaler can get.

There are bounds because the scaling can provide unstable results.

fit(X, y)[source]¶

X:	list of dict
Y:	labels

transform(X)[source]¶

class sklearn_utils.preprocessing.FeatureRenaming(names, case_sensetive=False)[source]¶

Bases: sklearn.base.TransformerMixin

Preprocessing to re-name features.

__init__(names, case_sensetive=False)[source]¶

Case_insensetive:
Names:	dict which contain old feature names as key and new names as value.
	performs mactching case sensetive

fit(X, y=None)[source]¶

transform(X, y=None)[source]¶

X:	list of dict

class sklearn_utils.preprocessing.DynamicPipeline[source]¶

Bases: object

Dynamic Pipeline

classmethod make_pipeline(selected_steps)[source]¶

class sklearn_utils.preprocessing.StandardScalerByLabel(reference_label)[source]¶

Bases: sklearn.preprocessing.data.StandardScaler

StandardScaler for using only by give label.

__init__(reference_label)[source]¶

Reference_label:
	the label scaling will be performed by.

partial_fit(X, y)[source]¶

X:	{array-like, sparse matrix}, shape [n_samples, n_features] The data used to compute the mean and standard deviation used for later scaling along the features axis.
Y:	Healthy ‘h’ or ‘sick_name’

class sklearn_utils.preprocessing.FunctionalEnrichmentAnalysis(reference_label, feature_groups, method='fisher_exact', alternative='two-sided', filter_func=None)[source]¶

Bases: sklearn.base.TransformerMixin

Functional Enrichment Analysis

__init__(reference_label, feature_groups, method='fisher_exact', alternative='two-sided', filter_func=None)[source]¶

Reference_label:
	label of refence values in the calculation
Method:	only fisher exact test avaliable so far
Feature_groups:	list of dict where keys are new feature and values are list of old features
Filter_func:	function return true or false

fit(X, y)[source]¶

transform(X, y=None)[source]¶

X:	list of dict
Y:	labels

class sklearn_utils.preprocessing.FeatureMerger(features, strategy='mean')[source]¶

Bases: sklearn.base.TransformerMixin

Merge some features based on given strategy.

__init__(features, strategy='mean')[source]¶

Features:	dict which contain new feature as key and old features as list in values.
Strategy:	strategy to merge features. ‘mean’, ‘sum’ and lambda function accepted.

Lambda function accepts list of values as input.

fit(X, y=None)[source]¶

transform(X, y=None)[source]¶

Utils¶

sklearn_utils.utils.filter_by_label(X, y, ref_label, reverse=False)[source]¶

Select items with label from dataset.

Parameters:	X – dataset y – labels ref_label – reference label reverse (bool) – if false selects ref_labels else eliminates

sklearn_utils.utils.average_by_label(X, y, ref_label)[source]¶

Calculates average dictinary from list of dictionary for give label

Parameters:	X (List[Dict]) – dataset y (list) – labels ref_label – reference label

sklearn_utils.utils.map_dict(d, key_func=None, value_func=None, if_func=None)[source]¶

Parameters:	d (dict) – dictionary key_func (func) – func which will run on key. value_func (func) – func which will run on values.

sklearn_utils.utils.map_dict_list(ds, key_func=None, value_func=None, if_func=None)[source]¶

Parameters:	ds (List[Dict]) – list of dict key_func (func) – func which will run on key. value_func (func) – func which will run on values.

class sklearn_utils.utils.SkUtilsIO(path, gz=False)[source]¶

Bases: object

IO class to read and write dataset to file.

__init__(path, gz=False)[source]¶

Filename:	file name with path.

from_csv(label_column='labels')[source]¶: Read dataset from csv.

from_json()[source]¶: Reads dataset from json.

from_pickle()[source]¶: Reads dataset to pickle.

to_csv(X, y)[source]¶: Writes dataset to csv.

to_json(X, y)[source]¶

Reads dataset to csv.

Parameters:	X – dataset as list of dict. y – labels.

to_pickle(X, y)[source]¶

Writes dataset to pickle.

Parameters:	X – dataset as list of dict. y – labels.

Noise¶

class sklearn_utils.noise.SelectNotKBest(**kwargs)[source]¶

Bases: sklearn.base.TransformerMixin

Select all feature except best K feature

__init__(**kwargs)[source]¶

fit(X, y)[source]¶

get_support()[source]¶

transform(X)[source]¶: Transform to select not k best feature :param X: np.matrix

class sklearn_utils.noise.NoiseGenerator(noise_func, noise_func_args)[source]¶

Bases: sklearn.base.TransformerMixin

Add noise to dataset

__init__(noise_func, noise_func_args)[source]¶: Add noise to data :noise_func: a function which generator noise with same shape with data :noise_func_args: arguments of noise function

fit(X, y)[source]¶

relative_noise_size(data, noise)[source]¶

Data:	original data as numpy matrix
Noise:	noise matrix as numpy matrix

transform(X)[source]¶

X:	numpy ndarray