More sanity in jupyter notebooks with pandas.
Writing pandas code often evolves from exploration in a notebook to polished, "regular" Python code. To speed this process up, it is key to find the right balance in the notebook between quick & dirty code and "production" code:
- Shitting out code in a notebook without any thought towards basic standards will quickly lead to the dreaded spaghetti notebooks no one understands any more after two weeks. Maybe LLMs will help with that in the future.
- Writing everything to your polished standards will slow you down too much and you will waste time on polishing code that will just sit in one of your playground repositories.
IMO, the following building blocks go a long way and can be used easily, directly in a notebook:
- Write small functions that take your data frame(s) and return new dataframes. If you skip this step your notebook will become a mess (I would also vote for pandera dataframe models on top, with column validators).
- Prefer method chaining when writing pandas transformations.
- Use a logging decorator to add basic logging to functions, where appropriate.
- For any row filter operation, log the number of rows before/after and how many rows were removed/kept.
- For any merge operation, log information about the sizes of the left, right and the merged dataframes and follow the conventions below.
Note: This article assumes that any overhead introduced by these wrappers is negligible, because you are processing a lot of data at once.
Using a logging decorator
Decorate your transformation functions with a simple log decorator.
For example (this assumes that you initialize your logging somewhere else, e.g. with logging.basicConfig(level=logging.INFO)
:
import datetime as dt
import logging
import typing
from functools import wraps
def log_step(func: typing.Callable) -> typing.Callable:
@wraps(func)
def wrapper(*args, **kwargs):
"""Wraps a function that returns a single dataframe."""
tic = dt.datetime.now()
df = func(*args, **kwargs)
toc = dt.datetime.now()
logging.getLogger(func.__module__).info(
f"{func.__name__}: shape={df.shape}, {(toc - tic).microseconds / 1e6} s"
)
return df
return wrapper
We can extend this of course to log the shape of any dataframe that is returned, log the difference in shapes etc. pp. If we want to pass parameters to the decorator itself, it's a bit more involved. For a ready-made log step decorator, see sklego's log step.
Filter operations
Logging the impact of any row filter operation makes it easier to follow the flow in a notebook and make sure that if you re-run the notebook half a year later with new data, you can easily check plausibility:
import logging
import typing
import pandas as pd
logger = logging.getLogger(__name__)
def filter_rows(
df: pd.DataFrame,
name: str,
bool_filter_keep: typing.Callable | None = None,
bool_filter_remove: typing.Callable | None = None,
only_log_changes: bool = False,
) -> pd.DataFrame:
"""
Parameters:
bool_filter_keep: usually a lambda that takes a dataframe as the only parameter.
When called with a df, must return a boolean Series.
bool_filter_remove: usually a lambda that takes a dataframe as the only parameter.
When called with a df, must return a boolean Series.
Example:
df.pipe(filter_rows, bool_filter_keep=lambda df_: df_["foobar"] == 3)
"""
if bool_filter_keep is None and bool_filter_remove is None:
raise ValueError("Either `bool_filter_keep` or `bool_filter_remove` must be passed.")
elif bool_filter_keep is not None and bool_filter_remove is not None:
raise ValueError("Only `bool_filter_keep` or `bool_filter_remove` must be passed.")
shape_in = df.shape
if bool_filter_keep:
mask_keep = bool_filter_keep(df)
elif bool_filter_remove:
mask_keep = ~bool_filter_remove(df)
else:
raise ValueError("At this point, one filter must exist.")
df = df.loc[mask_keep, :]
shape_out = df.shape
n_in, n_out = shape_in[0], shape_out[0]
if n_in != n_out or not only_log_changes:
logger.info(
f"Row filter ({name}): {n_out - n_in:+} rows, "
f"(removed: {n_in - mask_keep.sum()}, {(n_in - mask_keep.sum()) / n_in:.2f}) "
f"(kept: {mask_keep.sum()}, {mask_keep.sum() / n_in:.2f}) "
f"({shape_in=}, {shape_out=})"
)
return df
This seems like a lot of bloat for a single line of logic df = df.loc[mask_keep, :]
, but it comes with clear benefits, as soon as the filters become more complex.
Simplistic example:
@log_step
def transform(df):
return df.iloc[:-2,:]
df = pd.DataFrame.from_dict(
dict(
foo=[1, 2, 3, 4, 5, 6, 7],
bar=["a", "b", "c", "d", "e", "f", "g"],
)
)
transform(df).pipe(
filter_rows,
bool_filter_keep=lambda df_: df_["foo"] <= 3,
name="valid-foo",
).pipe(
filter_rows,
bool_filter_remove=lambda df_: df_["foo"] == 3,
name="no-three",
)
Output:
INFO:transform:transform: returned shape=(5, 2), 0.000109 s
INFO:minimal:Row filter (valid-foo): -2 rows, (removed: 2, 0.40) (kept: 3, 0.60) (shape_in=(5, 2), shape_out=(3, 2))
INFO:minimal:Row filter (no-three): -1 rows, (removed: 1, 0.33) (kept: 2, 0.67) (shape_in=(3, 2), shape_out=(2, 2))
Note: this will always log with the logger of the module that contains the filter_rows
function.
It should be possible to log with the logger of the caller (see here, untested).
Merge operations
The next fragile operation in pandas codebases are merges. Make sure to:
- Consider specifying the
validate
parameter, so that you document and check the expected merge (1:1, 1:n, n:1, n:n) - Always explicitly list the columns to merge on (i.e., pass
on
, orright_on
andleft_on
). - Always log input and output shapes.
- Often it makes sense to explicitly filter the list of columns to use in the left/right dataframes, or to at least do it for one side, usually the right one.
For logging, similar to the previous approach:
def merge_dfs(
left: pd.DataFrame,
right: pd.DataFrame,
name: str,
**kwargs,
) -> pd.DataFrame:
"""
Consider following these conventions:
- Use pandas' `validate` parameter.
- Always pass `on`, or `left_on` and `right_on`.
"""
shape_left, shape_right = left.shape, right.shape
df = left.merge(
right=right,
**kwargs
)
shape_out = df.shape
n_left, n_out = shape_left[0], shape_out[0]
logger.info(
f"Merging ({name}): {n_out - n_left:+} rows, "
f"({shape_left=}, {shape_out=}, {shape_right=})"
)
return df
# Simplistic example when doing method chaining:
df_a.pipe(
merge_dfs,
name="Merging a with b",
right=df_b,
on=["foo"],
how="left",
)
Output:
INFO:minimal:Merging (Merging a with b): +1 rows, (shape_left=(7, 2), shape_out=(8, 3), shape_right=(4, 2))
Conclusion
Writing pandas operations with the chained method style lends itself to easily log additional information to keep sane and let you go fast when whipping your PoC-of-the-day.
Open points and references
- There are existing libraries that I need to explore further. See, e.g., pandas-log.