Module imodelsx.linear_ngram

Simple scikit-learn interface for finetuning a single linear layer on top of LLM embeddings.

Classes

class LinearNgram (checkpoint: str = 'tfidfvectorizer',
tokenizer=None,
ngrams=2,
all_ngrams=True,
random_state=None)
Expand source code
class LinearNgram(BaseEstimator):
    def __init__(
        self,
        checkpoint: str = "tfidfvectorizer",
        tokenizer=None,
        ngrams=2,
        all_ngrams=True,
        random_state=None,
    ):
        """LinearNgram Class - use either LinearNgramClassifier or LinearNgramRegressor rather than initializing this class directly.

        Parameters
        ----------
        checkpoint: str
            Name of vectorizer checkpoint: "countvectorizer" or "tfidfvectorizer"
        ngrams
            Order of ngrams to extract. 1 for unigrams, 2 for bigrams, etc.
        all_ngrams
            Whether to use all order ngrams <= ngrams argument
        random_state
            random seed for fitting

        Example
        -------
        ```
        from imodelsx import LinearNgramClassifier
        import datasets
        import numpy as np

        # load data
        dset = datasets.load_dataset('rotten_tomatoes')['train']
        dset = dset.select(np.random.choice(len(dset), size=300, replace=False))
        dset_val = datasets.load_dataset('rotten_tomatoes')['validation']
        dset_val = dset_val.select(np.random.choice(len(dset_val), size=300, replace=False))


        # fit a simple ngram model
        m = LinearNgramClassifier()
        m.fit(dset['text'], dset['label'])
        preds = m.predict(dset_val['text'])
        acc = (preds == dset_val['label']).mean()
        print('validation acc', acc)
        ```
        """
        assert checkpoint in ["countvectorizer", "tfidfvectorizer"]
        self.checkpoint = checkpoint
        self.tokenizer = tokenizer
        self.ngrams = ngrams
        self.all_ngrams = all_ngrams
        self.random_state = random_state

    def fit(
        self,
        X_text: ArrayLike,
        y: ArrayLike,
        verbose=True,
    ):
        """Extract embeddings then fit linear model

        Parameters
        ----------
        X_text: ArrayLike[str]
        y: ArrayLike[str]
        """

        # metadata
        if isinstance(self, ClassifierMixin):
            self.classes_ = unique_labels(y)
        if self.random_state is not None:
            np.random.seed(self.random_state)

        # set up model
        if verbose:
            print("initializing model...")

        # get embs
        if verbose:
            print("calculating embeddings...")
        if self.all_ngrams:
            lower_ngram = 1
        else:
            lower_ngram = self.ngrams

        # get vectorizer
        if self.checkpoint == "countvectorizer":
            self.vectorizer = CountVectorizer(
                tokenizer=self.tokenizer, ngram_range=(
                    lower_ngram, self.ngrams)
            )
        elif self.checkpoint == "tfidfvectorizer":
            self.vectorizer = TfidfVectorizer(
                tokenizer=self.tokenizer, ngram_range=(
                    lower_ngram, self.ngrams)
            )

        # get embs
        embs = self.vectorizer.fit_transform(X_text)

        # train linear
        warnings.filterwarnings("ignore", category=ConvergenceWarning)
        if verbose:
            print("training linear model...")
        if isinstance(self, ClassifierMixin):
            self.linear = LogisticRegressionCV()
        elif isinstance(self, RegressorMixin):
            self.linear = RidgeCV()
        self.linear.fit(embs, y)

        return self

    def predict(self, X_text):
        """For regression returns continuous output.
        For classification, returns discrete output.
        """
        check_is_fitted(self)
        embs = self.vectorizer.transform(X_text)
        return self.linear.predict(embs)

    def predict_proba(self, X_text):
        check_is_fitted(self)
        embs = self.vectorizer.transform(X_text)
        return self.linear.predict_proba(embs)

Base class for all estimators in scikit-learn.

Inheriting from this class provides default implementations of:

  • setting and getting parameters used by GridSearchCV and friends;
  • textual and HTML representation displayed in terminals and IDEs;
  • estimator serialization;
  • parameters validation;
  • data validation;
  • feature names validation.

Read more in the :ref:User Guide <rolling_your_own_estimator>.

Notes

All estimators should specify all the parameters that can be set at the class level in their __init__ as explicit keyword arguments (no *args or **kwargs).

Examples

>>> import numpy as np
>>> from sklearn.base import BaseEstimator
>>> class MyEstimator(BaseEstimator):
...     def __init__(self, *, param=1):
...         self.param = param
...     def fit(self, X, y=None):
...         self.is_fitted_ = True
...         return self
...     def predict(self, X):
...         return np.full(shape=X.shape[0], fill_value=self.param)
>>> estimator = MyEstimator(param=2)
>>> estimator.get_params()
{'param': 2}
>>> X = np.array([[1, 2], [2, 3], [3, 4]])
>>> y = np.array([1, 0, 1])
>>> estimator.fit(X, y).predict(X)
array([2, 2, 2])
>>> estimator.set_params(param=3).fit(X, y).predict(X)
array([3, 3, 3])

LinearNgram Class - use either LinearNgramClassifier or LinearNgramRegressor rather than initializing this class directly.

Parameters

checkpoint : str
Name of vectorizer checkpoint: "countvectorizer" or "tfidfvectorizer"
ngrams
Order of ngrams to extract. 1 for unigrams, 2 for bigrams, etc.
all_ngrams
Whether to use all order ngrams <= ngrams argument
random_state
random seed for fitting

Example

from imodelsx import LinearNgramClassifier
import datasets
import numpy as np

# load data
dset = datasets.load_dataset('rotten_tomatoes')['train']
dset = dset.select(np.random.choice(len(dset), size=300, replace=False))
dset_val = datasets.load_dataset('rotten_tomatoes')['validation']
dset_val = dset_val.select(np.random.choice(len(dset_val), size=300, replace=False))


# fit a simple ngram model
m = LinearNgramClassifier()
m.fit(dset['text'], dset['label'])
preds = m.predict(dset_val['text'])
acc = (preds == dset_val['label']).mean()
print('validation acc', acc)

Ancestors

  • sklearn.base.BaseEstimator
  • sklearn.utils._repr_html.base.ReprHTMLMixin
  • sklearn.utils._repr_html.base._HTMLDocumentationLinkMixin
  • sklearn.utils._metadata_requests._MetadataRequester

Subclasses

Methods

def fit(self,
X_text: numpy._typing._array_like._Buffer | numpy._typing._array_like._SupportsArray[numpy.dtype[typing.Any]] | numpy._typing._nested_sequence._NestedSequence[numpy._typing._array_like._SupportsArray[numpy.dtype[typing.Any]]] | bool | int | float | complex | str | bytes | numpy._typing._nested_sequence._NestedSequence[bool | int | float | complex | str | bytes],
y: numpy._typing._array_like._Buffer | numpy._typing._array_like._SupportsArray[numpy.dtype[typing.Any]] | numpy._typing._nested_sequence._NestedSequence[numpy._typing._array_like._SupportsArray[numpy.dtype[typing.Any]]] | bool | int | float | complex | str | bytes | numpy._typing._nested_sequence._NestedSequence[bool | int | float | complex | str | bytes],
verbose=True)
Expand source code
def fit(
    self,
    X_text: ArrayLike,
    y: ArrayLike,
    verbose=True,
):
    """Extract embeddings then fit linear model

    Parameters
    ----------
    X_text: ArrayLike[str]
    y: ArrayLike[str]
    """

    # metadata
    if isinstance(self, ClassifierMixin):
        self.classes_ = unique_labels(y)
    if self.random_state is not None:
        np.random.seed(self.random_state)

    # set up model
    if verbose:
        print("initializing model...")

    # get embs
    if verbose:
        print("calculating embeddings...")
    if self.all_ngrams:
        lower_ngram = 1
    else:
        lower_ngram = self.ngrams

    # get vectorizer
    if self.checkpoint == "countvectorizer":
        self.vectorizer = CountVectorizer(
            tokenizer=self.tokenizer, ngram_range=(
                lower_ngram, self.ngrams)
        )
    elif self.checkpoint == "tfidfvectorizer":
        self.vectorizer = TfidfVectorizer(
            tokenizer=self.tokenizer, ngram_range=(
                lower_ngram, self.ngrams)
        )

    # get embs
    embs = self.vectorizer.fit_transform(X_text)

    # train linear
    warnings.filterwarnings("ignore", category=ConvergenceWarning)
    if verbose:
        print("training linear model...")
    if isinstance(self, ClassifierMixin):
        self.linear = LogisticRegressionCV()
    elif isinstance(self, RegressorMixin):
        self.linear = RidgeCV()
    self.linear.fit(embs, y)

    return self

Extract embeddings then fit linear model

Parameters

X_text : ArrayLike[str]
 
y : ArrayLike[str]
 
def predict(self, X_text)
Expand source code
def predict(self, X_text):
    """For regression returns continuous output.
    For classification, returns discrete output.
    """
    check_is_fitted(self)
    embs = self.vectorizer.transform(X_text)
    return self.linear.predict(embs)

For regression returns continuous output. For classification, returns discrete output.

def predict_proba(self, X_text)
Expand source code
def predict_proba(self, X_text):
    check_is_fitted(self)
    embs = self.vectorizer.transform(X_text)
    return self.linear.predict_proba(embs)
def set_fit_request(self: LinearNgram,
*,
X_text: bool | str | None = '$UNCHANGED$',
verbose: bool | str | None = '$UNCHANGED$') ‑> LinearNgram
Expand source code
def func(*args, **kw):
    """Updates the `_metadata_request` attribute of the consumer (`instance`)
    for the parameters provided as `**kw`.

    This docstring is overwritten below.
    See REQUESTER_DOC for expected functionality.
    """
    if not _routing_enabled():
        raise RuntimeError(
            "This method is only available when metadata routing is enabled."
            " You can enable it using"
            " sklearn.set_config(enable_metadata_routing=True)."
        )

    if self.validate_keys and (set(kw) - set(self.keys)):
        raise TypeError(
            f"Unexpected args: {set(kw) - set(self.keys)} in {self.name}. "
            f"Accepted arguments are: {set(self.keys)}"
        )

    # This makes it possible to use the decorated method as an unbound method,
    # for instance when monkeypatching.
    # https://github.com/scikit-learn/scikit-learn/issues/28632
    if instance is None:
        _instance = args[0]
        args = args[1:]
    else:
        _instance = instance

    # Replicating python's behavior when positional args are given other than
    # `self`, and `self` is only allowed if this method is unbound.
    if args:
        raise TypeError(
            f"set_{self.name}_request() takes 0 positional argument but"
            f" {len(args)} were given"
        )

    requests = _instance._get_metadata_request()
    method_metadata_request = getattr(requests, self.name)

    for prop, alias in kw.items():
        if alias is not UNCHANGED:
            method_metadata_request.add_request(param=prop, alias=alias)
    _instance._metadata_request = requests

    return _instance

Configure whether metadata should be requested to be passed to the fit method.

    Note that this method is only relevant when this estimator is used as a
    sub-estimator within a :term:`meta-estimator` and metadata routing is enabled
    with ``enable_metadata_routing=True`` (see :func:<code>sklearn.set\_config</code>).
    Please check the :ref:`User Guide <metadata_routing>` on how the routing
    mechanism works.

    The options for each parameter are:

    - <code>True</code>: metadata is requested, and passed to <code>fit</code> if provided. The request is ignored if metadata is not provided.

    - <code>False</code>: metadata is not requested and the meta-estimator will not pass it to <code>fit</code>.

    - <code>None</code>: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

    - <code>str</code>: metadata should be passed to the meta-estimator with this given alias instead of the original name.

    The default (<code>sklearn.utils.metadata\_routing.UNCHANGED</code>) retains the
    existing request. This allows you to change the request for some
    parameters and not others.

    !!! versionadded "Added in version:&ensp;1.3"



    Parameters
    ----------
    X_text : str, True, False, or None,                     default=sklearn.utils.metadata_routing.UNCHANGED
        Metadata routing for <code>X\_text</code> parameter in <code>fit</code>.

    verbose : str, True, False, or None,                     default=sklearn.utils.metadata_routing.UNCHANGED
        Metadata routing for <code>verbose</code> parameter in <code>fit</code>.

    Returns
    -------
    self : object
        The updated object.
def set_predict_proba_request(self: LinearNgram,
*,
X_text: bool | str | None = '$UNCHANGED$') ‑> LinearNgram
Expand source code
def func(*args, **kw):
    """Updates the `_metadata_request` attribute of the consumer (`instance`)
    for the parameters provided as `**kw`.

    This docstring is overwritten below.
    See REQUESTER_DOC for expected functionality.
    """
    if not _routing_enabled():
        raise RuntimeError(
            "This method is only available when metadata routing is enabled."
            " You can enable it using"
            " sklearn.set_config(enable_metadata_routing=True)."
        )

    if self.validate_keys and (set(kw) - set(self.keys)):
        raise TypeError(
            f"Unexpected args: {set(kw) - set(self.keys)} in {self.name}. "
            f"Accepted arguments are: {set(self.keys)}"
        )

    # This makes it possible to use the decorated method as an unbound method,
    # for instance when monkeypatching.
    # https://github.com/scikit-learn/scikit-learn/issues/28632
    if instance is None:
        _instance = args[0]
        args = args[1:]
    else:
        _instance = instance

    # Replicating python's behavior when positional args are given other than
    # `self`, and `self` is only allowed if this method is unbound.
    if args:
        raise TypeError(
            f"set_{self.name}_request() takes 0 positional argument but"
            f" {len(args)} were given"
        )

    requests = _instance._get_metadata_request()
    method_metadata_request = getattr(requests, self.name)

    for prop, alias in kw.items():
        if alias is not UNCHANGED:
            method_metadata_request.add_request(param=prop, alias=alias)
    _instance._metadata_request = requests

    return _instance

Configure whether metadata should be requested to be passed to the predict_proba method.

    Note that this method is only relevant when this estimator is used as a
    sub-estimator within a :term:`meta-estimator` and metadata routing is enabled
    with ``enable_metadata_routing=True`` (see :func:<code>sklearn.set\_config</code>).
    Please check the :ref:`User Guide <metadata_routing>` on how the routing
    mechanism works.

    The options for each parameter are:

    - <code>True</code>: metadata is requested, and passed to <code>predict\_proba</code> if provided. The request is ignored if metadata is not provided.

    - <code>False</code>: metadata is not requested and the meta-estimator will not pass it to <code>predict\_proba</code>.

    - <code>None</code>: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

    - <code>str</code>: metadata should be passed to the meta-estimator with this given alias instead of the original name.

    The default (<code>sklearn.utils.metadata\_routing.UNCHANGED</code>) retains the
    existing request. This allows you to change the request for some
    parameters and not others.

    !!! versionadded "Added in version:&ensp;1.3"



    Parameters
    ----------
    X_text : str, True, False, or None,                     default=sklearn.utils.metadata_routing.UNCHANGED
        Metadata routing for <code>X\_text</code> parameter in <code>predict\_proba</code>.

    Returns
    -------
    self : object
        The updated object.
def set_predict_request(self: LinearNgram,
*,
X_text: bool | str | None = '$UNCHANGED$') ‑> LinearNgram
Expand source code
def func(*args, **kw):
    """Updates the `_metadata_request` attribute of the consumer (`instance`)
    for the parameters provided as `**kw`.

    This docstring is overwritten below.
    See REQUESTER_DOC for expected functionality.
    """
    if not _routing_enabled():
        raise RuntimeError(
            "This method is only available when metadata routing is enabled."
            " You can enable it using"
            " sklearn.set_config(enable_metadata_routing=True)."
        )

    if self.validate_keys and (set(kw) - set(self.keys)):
        raise TypeError(
            f"Unexpected args: {set(kw) - set(self.keys)} in {self.name}. "
            f"Accepted arguments are: {set(self.keys)}"
        )

    # This makes it possible to use the decorated method as an unbound method,
    # for instance when monkeypatching.
    # https://github.com/scikit-learn/scikit-learn/issues/28632
    if instance is None:
        _instance = args[0]
        args = args[1:]
    else:
        _instance = instance

    # Replicating python's behavior when positional args are given other than
    # `self`, and `self` is only allowed if this method is unbound.
    if args:
        raise TypeError(
            f"set_{self.name}_request() takes 0 positional argument but"
            f" {len(args)} were given"
        )

    requests = _instance._get_metadata_request()
    method_metadata_request = getattr(requests, self.name)

    for prop, alias in kw.items():
        if alias is not UNCHANGED:
            method_metadata_request.add_request(param=prop, alias=alias)
    _instance._metadata_request = requests

    return _instance

Configure whether metadata should be requested to be passed to the predict method.

    Note that this method is only relevant when this estimator is used as a
    sub-estimator within a :term:`meta-estimator` and metadata routing is enabled
    with ``enable_metadata_routing=True`` (see :func:<code>sklearn.set\_config</code>).
    Please check the :ref:`User Guide <metadata_routing>` on how the routing
    mechanism works.

    The options for each parameter are:

    - <code>True</code>: metadata is requested, and passed to <code>predict</code> if provided. The request is ignored if metadata is not provided.

    - <code>False</code>: metadata is not requested and the meta-estimator will not pass it to <code>predict</code>.

    - <code>None</code>: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

    - <code>str</code>: metadata should be passed to the meta-estimator with this given alias instead of the original name.

    The default (<code>sklearn.utils.metadata\_routing.UNCHANGED</code>) retains the
    existing request. This allows you to change the request for some
    parameters and not others.

    !!! versionadded "Added in version:&ensp;1.3"



    Parameters
    ----------
    X_text : str, True, False, or None,                     default=sklearn.utils.metadata_routing.UNCHANGED
        Metadata routing for <code>X\_text</code> parameter in <code>predict</code>.

    Returns
    -------
    self : object
        The updated object.
class LinearNgramClassifier (checkpoint: str = 'tfidfvectorizer',
tokenizer=None,
ngrams=2,
all_ngrams=True,
random_state=None)
Expand source code
class LinearNgramClassifier(LinearNgram, ClassifierMixin):
    ...

Base class for all estimators in scikit-learn.

Inheriting from this class provides default implementations of:

  • setting and getting parameters used by GridSearchCV and friends;
  • textual and HTML representation displayed in terminals and IDEs;
  • estimator serialization;
  • parameters validation;
  • data validation;
  • feature names validation.

Read more in the :ref:User Guide <rolling_your_own_estimator>.

Notes

All estimators should specify all the parameters that can be set at the class level in their __init__ as explicit keyword arguments (no *args or **kwargs).

Examples

>>> import numpy as np
>>> from sklearn.base import BaseEstimator
>>> class MyEstimator(BaseEstimator):
...     def __init__(self, *, param=1):
...         self.param = param
...     def fit(self, X, y=None):
...         self.is_fitted_ = True
...         return self
...     def predict(self, X):
...         return np.full(shape=X.shape[0], fill_value=self.param)
>>> estimator = MyEstimator(param=2)
>>> estimator.get_params()
{'param': 2}
>>> X = np.array([[1, 2], [2, 3], [3, 4]])
>>> y = np.array([1, 0, 1])
>>> estimator.fit(X, y).predict(X)
array([2, 2, 2])
>>> estimator.set_params(param=3).fit(X, y).predict(X)
array([3, 3, 3])

LinearNgram Class - use either LinearNgramClassifier or LinearNgramRegressor rather than initializing this class directly.

Parameters

checkpoint : str
Name of vectorizer checkpoint: "countvectorizer" or "tfidfvectorizer"
ngrams
Order of ngrams to extract. 1 for unigrams, 2 for bigrams, etc.
all_ngrams
Whether to use all order ngrams <= ngrams argument
random_state
random seed for fitting

Example

from imodelsx import LinearNgramClassifier
import datasets
import numpy as np

# load data
dset = datasets.load_dataset('rotten_tomatoes')['train']
dset = dset.select(np.random.choice(len(dset), size=300, replace=False))
dset_val = datasets.load_dataset('rotten_tomatoes')['validation']
dset_val = dset_val.select(np.random.choice(len(dset_val), size=300, replace=False))


# fit a simple ngram model
m = LinearNgramClassifier()
m.fit(dset['text'], dset['label'])
preds = m.predict(dset_val['text'])
acc = (preds == dset_val['label']).mean()
print('validation acc', acc)

Ancestors

  • LinearNgram
  • sklearn.base.BaseEstimator
  • sklearn.utils._repr_html.base.ReprHTMLMixin
  • sklearn.utils._repr_html.base._HTMLDocumentationLinkMixin
  • sklearn.utils._metadata_requests._MetadataRequester
  • sklearn.base.ClassifierMixin

Methods

def set_score_request(self: LinearNgramClassifier,
*,
sample_weight: bool | str | None = '$UNCHANGED$') ‑> LinearNgramClassifier
Expand source code
def func(*args, **kw):
    """Updates the `_metadata_request` attribute of the consumer (`instance`)
    for the parameters provided as `**kw`.

    This docstring is overwritten below.
    See REQUESTER_DOC for expected functionality.
    """
    if not _routing_enabled():
        raise RuntimeError(
            "This method is only available when metadata routing is enabled."
            " You can enable it using"
            " sklearn.set_config(enable_metadata_routing=True)."
        )

    if self.validate_keys and (set(kw) - set(self.keys)):
        raise TypeError(
            f"Unexpected args: {set(kw) - set(self.keys)} in {self.name}. "
            f"Accepted arguments are: {set(self.keys)}"
        )

    # This makes it possible to use the decorated method as an unbound method,
    # for instance when monkeypatching.
    # https://github.com/scikit-learn/scikit-learn/issues/28632
    if instance is None:
        _instance = args[0]
        args = args[1:]
    else:
        _instance = instance

    # Replicating python's behavior when positional args are given other than
    # `self`, and `self` is only allowed if this method is unbound.
    if args:
        raise TypeError(
            f"set_{self.name}_request() takes 0 positional argument but"
            f" {len(args)} were given"
        )

    requests = _instance._get_metadata_request()
    method_metadata_request = getattr(requests, self.name)

    for prop, alias in kw.items():
        if alias is not UNCHANGED:
            method_metadata_request.add_request(param=prop, alias=alias)
    _instance._metadata_request = requests

    return _instance

Configure whether metadata should be requested to be passed to the score method.

    Note that this method is only relevant when this estimator is used as a
    sub-estimator within a :term:`meta-estimator` and metadata routing is enabled
    with ``enable_metadata_routing=True`` (see :func:<code>sklearn.set\_config</code>).
    Please check the :ref:`User Guide <metadata_routing>` on how the routing
    mechanism works.

    The options for each parameter are:

    - <code>True</code>: metadata is requested, and passed to <code>score</code> if provided. The request is ignored if metadata is not provided.

    - <code>False</code>: metadata is not requested and the meta-estimator will not pass it to <code>score</code>.

    - <code>None</code>: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

    - <code>str</code>: metadata should be passed to the meta-estimator with this given alias instead of the original name.

    The default (<code>sklearn.utils.metadata\_routing.UNCHANGED</code>) retains the
    existing request. This allows you to change the request for some
    parameters and not others.

    !!! versionadded "Added in version:&ensp;1.3"



    Parameters
    ----------
    sample_weight : str, True, False, or None,                     default=sklearn.utils.metadata_routing.UNCHANGED
        Metadata routing for <code>sample\_weight</code> parameter in <code>score</code>.

    Returns
    -------
    self : object
        The updated object.

Inherited members

class LinearNgramRegressor (checkpoint: str = 'tfidfvectorizer',
tokenizer=None,
ngrams=2,
all_ngrams=True,
random_state=None)
Expand source code
class LinearNgramRegressor(LinearNgram, RegressorMixin):
    ...

Base class for all estimators in scikit-learn.

Inheriting from this class provides default implementations of:

  • setting and getting parameters used by GridSearchCV and friends;
  • textual and HTML representation displayed in terminals and IDEs;
  • estimator serialization;
  • parameters validation;
  • data validation;
  • feature names validation.

Read more in the :ref:User Guide <rolling_your_own_estimator>.

Notes

All estimators should specify all the parameters that can be set at the class level in their __init__ as explicit keyword arguments (no *args or **kwargs).

Examples

>>> import numpy as np
>>> from sklearn.base import BaseEstimator
>>> class MyEstimator(BaseEstimator):
...     def __init__(self, *, param=1):
...         self.param = param
...     def fit(self, X, y=None):
...         self.is_fitted_ = True
...         return self
...     def predict(self, X):
...         return np.full(shape=X.shape[0], fill_value=self.param)
>>> estimator = MyEstimator(param=2)
>>> estimator.get_params()
{'param': 2}
>>> X = np.array([[1, 2], [2, 3], [3, 4]])
>>> y = np.array([1, 0, 1])
>>> estimator.fit(X, y).predict(X)
array([2, 2, 2])
>>> estimator.set_params(param=3).fit(X, y).predict(X)
array([3, 3, 3])

LinearNgram Class - use either LinearNgramClassifier or LinearNgramRegressor rather than initializing this class directly.

Parameters

checkpoint : str
Name of vectorizer checkpoint: "countvectorizer" or "tfidfvectorizer"
ngrams
Order of ngrams to extract. 1 for unigrams, 2 for bigrams, etc.
all_ngrams
Whether to use all order ngrams <= ngrams argument
random_state
random seed for fitting

Example

from imodelsx import LinearNgramClassifier
import datasets
import numpy as np

# load data
dset = datasets.load_dataset('rotten_tomatoes')['train']
dset = dset.select(np.random.choice(len(dset), size=300, replace=False))
dset_val = datasets.load_dataset('rotten_tomatoes')['validation']
dset_val = dset_val.select(np.random.choice(len(dset_val), size=300, replace=False))


# fit a simple ngram model
m = LinearNgramClassifier()
m.fit(dset['text'], dset['label'])
preds = m.predict(dset_val['text'])
acc = (preds == dset_val['label']).mean()
print('validation acc', acc)

Ancestors

  • LinearNgram
  • sklearn.base.BaseEstimator
  • sklearn.utils._repr_html.base.ReprHTMLMixin
  • sklearn.utils._repr_html.base._HTMLDocumentationLinkMixin
  • sklearn.utils._metadata_requests._MetadataRequester
  • sklearn.base.RegressorMixin

Methods

def set_score_request(self: LinearNgramRegressor,
*,
sample_weight: bool | str | None = '$UNCHANGED$') ‑> LinearNgramRegressor
Expand source code
def func(*args, **kw):
    """Updates the `_metadata_request` attribute of the consumer (`instance`)
    for the parameters provided as `**kw`.

    This docstring is overwritten below.
    See REQUESTER_DOC for expected functionality.
    """
    if not _routing_enabled():
        raise RuntimeError(
            "This method is only available when metadata routing is enabled."
            " You can enable it using"
            " sklearn.set_config(enable_metadata_routing=True)."
        )

    if self.validate_keys and (set(kw) - set(self.keys)):
        raise TypeError(
            f"Unexpected args: {set(kw) - set(self.keys)} in {self.name}. "
            f"Accepted arguments are: {set(self.keys)}"
        )

    # This makes it possible to use the decorated method as an unbound method,
    # for instance when monkeypatching.
    # https://github.com/scikit-learn/scikit-learn/issues/28632
    if instance is None:
        _instance = args[0]
        args = args[1:]
    else:
        _instance = instance

    # Replicating python's behavior when positional args are given other than
    # `self`, and `self` is only allowed if this method is unbound.
    if args:
        raise TypeError(
            f"set_{self.name}_request() takes 0 positional argument but"
            f" {len(args)} were given"
        )

    requests = _instance._get_metadata_request()
    method_metadata_request = getattr(requests, self.name)

    for prop, alias in kw.items():
        if alias is not UNCHANGED:
            method_metadata_request.add_request(param=prop, alias=alias)
    _instance._metadata_request = requests

    return _instance

Configure whether metadata should be requested to be passed to the score method.

    Note that this method is only relevant when this estimator is used as a
    sub-estimator within a :term:`meta-estimator` and metadata routing is enabled
    with ``enable_metadata_routing=True`` (see :func:<code>sklearn.set\_config</code>).
    Please check the :ref:`User Guide <metadata_routing>` on how the routing
    mechanism works.

    The options for each parameter are:

    - <code>True</code>: metadata is requested, and passed to <code>score</code> if provided. The request is ignored if metadata is not provided.

    - <code>False</code>: metadata is not requested and the meta-estimator will not pass it to <code>score</code>.

    - <code>None</code>: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

    - <code>str</code>: metadata should be passed to the meta-estimator with this given alias instead of the original name.

    The default (<code>sklearn.utils.metadata\_routing.UNCHANGED</code>) retains the
    existing request. This allows you to change the request for some
    parameters and not others.

    !!! versionadded "Added in version:&ensp;1.3"



    Parameters
    ----------
    sample_weight : str, True, False, or None,                     default=sklearn.utils.metadata_routing.UNCHANGED
        Metadata routing for <code>sample\_weight</code> parameter in <code>score</code>.

    Returns
    -------
    self : object
        The updated object.

Inherited members