Julian Mehne

More sanity in jupyter notebooks with pandas.

python, pandas, jupyter

Writing pandas code often evolves from exploration in a notebook to polished, "regular" Python code. To speed this process up, it is key to find the right balance in the notebook between quick & dirty code and "production" code:

IMO, the following building blocks go a long way and can be used easily, directly in a notebook:

  1. Write small functions that take your data frame(s) and return new dataframes. If you skip this step your notebook will become a mess (I would also vote for pandera dataframe models on top, with column validators).
  2. Prefer method chaining when writing pandas transformations.
  3. Use a logging decorator to add basic logging to functions, where appropriate.
  4. For any row filter operation, log the number of rows before/after and how many rows were removed/kept.
  5. For any merge operation, log information about the sizes of the left, right and the merged dataframes and follow the conventions below.

Note: This article assumes that any overhead introduced by these wrappers is negligible, because you are processing a lot of data at once.

Using a logging decorator

Decorate your transformation functions with a simple log decorator. For example (this assumes that you initialize your logging somewhere else, e.g. with logging.basicConfig(level=logging.INFO):

import datetime as dt
import logging
import typing

from functools import wraps

def log_step(func: typing.Callable) -> typing.Callable:
    @wraps(func)
    def wrapper(*args, **kwargs):
        """Wraps a function that returns a single dataframe."""
        tic = dt.datetime.now()
        df = func(*args, **kwargs)
        toc = dt.datetime.now()
        logging.getLogger(func.__module__).info(
            f"{func.__name__}: shape={df.shape}, {(toc - tic).microseconds / 1e6} s"
        )
        return df
    return wrapper

We can extend this of course to log the shape of any dataframe that is returned, log the difference in shapes etc. pp. If we want to pass parameters to the decorator itself, it's a bit more involved. For a ready-made log step decorator, see sklego's log step.

Filter operations

Logging the impact of any row filter operation makes it easier to follow the flow in a notebook and make sure that if you re-run the notebook half a year later with new data, you can easily check plausibility:

import logging
import typing

import pandas as pd

logger = logging.getLogger(__name__)


def filter_rows(
    df: pd.DataFrame,
    name: str,
    bool_filter_keep: typing.Callable | None = None,
    bool_filter_remove: typing.Callable | None = None,
    only_log_changes: bool = False,
) -> pd.DataFrame:
    """
    Parameters:
        bool_filter_keep: usually a lambda that takes a dataframe as the only parameter.
            When called with a df, must return a boolean Series.
        bool_filter_remove: usually a lambda that takes a dataframe as the only parameter.
            When called with a df, must return a boolean Series.

    Example:
        df.pipe(filter_rows, bool_filter_keep=lambda df_: df_["foobar"] == 3)
    """
    if bool_filter_keep is None and bool_filter_remove is None:
        raise ValueError("Either `bool_filter_keep` or `bool_filter_remove` must be passed.")
    elif bool_filter_keep is not None and bool_filter_remove is not None:
        raise ValueError("Only `bool_filter_keep` or `bool_filter_remove` must be passed.")

    shape_in = df.shape
    if bool_filter_keep:
        mask_keep = bool_filter_keep(df)
    elif bool_filter_remove:
        mask_keep = ~bool_filter_remove(df)
    else:
        raise ValueError("At this point, one filter must exist.")

    df = df.loc[mask_keep, :]

    shape_out = df.shape
    n_in, n_out = shape_in[0], shape_out[0]
    if n_in != n_out or not only_log_changes:
        logger.info(
            f"Row filter ({name}): {n_out - n_in:+} rows, "
            f"(removed: {n_in - mask_keep.sum()}, {(n_in - mask_keep.sum()) / n_in:.2f}) "
            f"(kept: {mask_keep.sum()}, {mask_keep.sum() / n_in:.2f}) "
            f"({shape_in=}, {shape_out=})"
        )
    return df

This seems like a lot of bloat for a single line of logic df = df.loc[mask_keep, :], but it comes with clear benefits, as soon as the filters become more complex. Simplistic example:

@log_step
def transform(df):
    return df.iloc[:-2,:]

df = pd.DataFrame.from_dict(
    dict(
        foo=[1, 2, 3, 4, 5, 6, 7],
        bar=["a", "b", "c", "d", "e", "f", "g"],
    )
)

transform(df).pipe(
    filter_rows,
    bool_filter_keep=lambda df_: df_["foo"] <= 3,
    name="valid-foo",
).pipe(
    filter_rows,
    bool_filter_remove=lambda df_: df_["foo"] == 3,
    name="no-three",
)

Output:

INFO:transform:transform: returned shape=(5, 2), 0.000109 s
INFO:minimal:Row filter (valid-foo): -2 rows, (removed: 2, 0.40) (kept: 3, 0.60) (shape_in=(5, 2), shape_out=(3, 2))
INFO:minimal:Row filter (no-three): -1 rows, (removed: 1, 0.33) (kept: 2, 0.67) (shape_in=(3, 2), shape_out=(2, 2))

Note: this will always log with the logger of the module that contains the filter_rows function. It should be possible to log with the logger of the caller (see here, untested).

Merge operations

The next fragile operation in pandas codebases are merges. Make sure to:

For logging, similar to the previous approach:

def merge_dfs(
    left: pd.DataFrame,
    right: pd.DataFrame,
    name: str,
    **kwargs,
) -> pd.DataFrame:
    """
    Consider following these conventions:
    - Use pandas' `validate` parameter.
    - Always pass `on`, or `left_on` and `right_on`.
    """
    shape_left, shape_right = left.shape, right.shape

    df = left.merge(
        right=right,
        **kwargs
    )

    shape_out = df.shape
    n_left, n_out = shape_left[0], shape_out[0]
    logger.info(
        f"Merging ({name}): {n_out - n_left:+} rows, "
        f"({shape_left=}, {shape_out=}, {shape_right=})"
    )
    return df

# Simplistic example when doing method chaining:
df_a.pipe(
    merge_dfs,
    name="Merging a with b",
    right=df_b,
    on=["foo"],
    how="left",
)

Output:

INFO:minimal:Merging (Merging a with b): +1 rows, (shape_left=(7, 2), shape_out=(8, 3), shape_right=(4, 2))

Conclusion

Writing pandas operations with the chained method style lends itself to easily log additional information to keep sane and let you go fast when whipping your PoC-of-the-day.

Open points and references