Dataset

Loading and converting soccer tracking data.

class unravel.soccer.KloppyPolarsDataset[source]

Bases: DefaultDataset

Convert Kloppy soccer tracking data to Polars DataFrame format.

This class takes tracking data loaded via Kloppy (supporting providers like Sportec, SkillCorner, Tracab, SecondSpectrum, etc.) and converts it into a fast, efficient Polars DataFrame with computed velocities, accelerations, and ball carrier inference.

The conversion process includes: - Coordinate system standardization - Velocity and acceleration computation with optional smoothing - Ball carrier and ball owning team inference - Goalkeeper position identification - Speed and acceleration filtering to remove outliers - Optional orientation normalization (attacking left-to-right)

Parameters:

kloppy_dataset – A Kloppy TrackingDataset instance containing the raw tracking data.
ball_carrier_threshold – Maximum distance (in meters) between player and ball to be considered the ball carrier. Defaults to 25.0.
max_player_speed – Maximum realistic player speed in m/s. Values above this are capped to prevent sensor errors. Defaults to 12.0 m/s.
max_ball_speed – Maximum realistic ball speed in m/s. Values above this are capped. Defaults to 28.0 m/s.
max_player_acceleration – Maximum realistic player acceleration in m/s². Values above this are capped. Defaults to 6.0 m/s².
max_ball_acceleration – Maximum realistic ball acceleration in m/s². Values above this are capped. Defaults to 13.5 m/s².
orient_ball_owning – If True, normalize coordinates so the team with possession always attacks from left to right. Defaults to True.
add_smoothing – If True, apply Savitzky-Golay smoothing to velocities to reduce noise. Defaults to True.
**kwargs – Additional keyword arguments passed to DefaultDataset.

data

The converted Polars DataFrame with all tracking data.

Type:: pl.DataFrame

settings

Configuration and metadata for the dataset.

Type:: DefaultSettings

home_players

List of home team player objects.

Type:: List[SoccerObject]

away_players

List of away team player objects.

Type:: List[SoccerObject]

kloppy_dataset

The original Kloppy dataset.

Type:: TrackingDataset

Raises:

Exception – If kloppy_dataset is not a TrackingDataset instance.
Exception – If ball_carrier_threshold is not a float.
ValueError – If the dataset orientation is NOT_SET.
ValueError – If ball owning team must be inferred but ball_carrier_threshold is None.

Example

>>> from kloppy import sportec
>>> from unravel.soccer import KloppyPolarsDataset
>>>
>>> # Load tracking data with Kloppy
>>> kloppy_dataset = sportec.load_open_tracking_data(only_alive=True)
>>>
>>> # Convert to Polars format
>>> polars_dataset = KloppyPolarsDataset(
...     kloppy_dataset=kloppy_dataset,
...     ball_carrier_threshold=25.0,
...     max_player_speed=12.0,
...     orient_ball_owning=True
... )
>>>
>>> # Access the DataFrame
>>> df = polars_dataset.data
>>> print(df.head())
>>>
>>> # Add dummy labels for training
>>> polars_dataset.add_dummy_labels(by=["frame_id"])
>>>
>>> # Add graph IDs for grouping
>>> polars_dataset.add_graph_ids(by=["frame_id"])

Note

For non-Sportec providers, always use only_alive=True or include_empty_frames=False when loading data with Kloppy to avoid frames without ball tracking data.

Warning

If the dataset doesn’t include ball owning team information, it will be inferred using distance to ball. This may cause unexpected results in situations where the ball is contested or in the air.

See also

SoccerGraphConverter: Convert to graph structures. add_dummy_label_column(): Add labels for training. add_graph_id_column(): Add graph IDs for grouping.

__init__(kloppy_dataset, ball_carrier_threshold=25.0, max_player_speed=12.0, max_ball_speed=28.0, max_player_acceleration=6.0, max_ball_acceleration=13.5, orient_ball_owning=True, add_smoothing=True, **kwargs)[source]

Parameters:

kloppy_dataset (TrackingDataset)
ball_carrier_threshold (float)
max_player_speed (float)
max_ball_speed (float)
max_player_acceleration (float)
max_ball_acceleration (float)
orient_ball_owning (bool)
add_smoothing (bool)

convert_orientation_to_ball_owning(df)[source]

Convert field orientation so attacking team always goes left-to-right.

This method normalizes the coordinate system so that the team with possession always attacks from left to right, regardless of which half they’re in. This helps machine learning models by providing consistent attacking directionality.

When the away team has possession, all spatial coordinates (x, y) and their derivatives (vx, vy, ax, ay) are multiplied by -1.

Parameters:: df (DataFrame) – The DataFrame with STATIC_HOME_AWAY orientation.
Returns:: DataFrame with BALL_OWNING_TEAM orientation.
Return type:: pl.DataFrame
Raises:: ValueError – If orientation is already BALL_OWNING_TEAM.

Example

>>> # Typically called automatically if orient_ball_owning=True
>>> # But can be called manually:
>>> df = dataset.convert_orientation_to_ball_owning(dataset.data)

Note

This is called automatically during load() if the orient_ball_owning parameter is set to True in __init__.

The following columns are flipped when away team has possession: - x, y: Position coordinates - vx, vy: Velocity components - ax, ay: Acceleration components

See also

Kloppy Orientation documentation for more details on coordinate systems.

load()[source]

Load and process the Kloppy tracking dataset into Polars DataFrame.

This method performs the complete data transformation pipeline:

Transform coordinate system to SecondSpectrum standard
Extract player and ball metadata
Convert wide format (columns per player) to long format
Compute velocities with optional Savitzky-Golay smoothing
Compute accelerations
Filter unrealistic speed/acceleration values
Infer ball carrier and ball owning team (if not provided)
Optionally normalize orientation to ball-owning team
Infer goalkeeper positions (if position data unavailable)

The resulting DataFrame is stored in self.data and contains columns: - period_id, timestamp, frame_id: Temporal identifiers - id, team_id, position_name: Object identifiers - x, y, z: Positions - vx, vy, vz, speed: Velocities - ax, ay, az, acceleration: Accelerations - ball_state: Ball in/out of play - ball_owning_team_id: Team with possession - is_ball_carrier: Boolean flag for ball carrier - game_id: Match identifier

Returns:

Self, for method chaining.

Return type:

KloppyPolarsDataset

Raises:

ValueError – If dataset orientation is NOT_SET.
ValueError – If ball owning team inference is needed but ball_carrier_threshold is None.

Example

>>> # Typically called automatically in __init__
>>> # But can be called manually to reload:
>>> dataset.load()

Note

This method is called automatically during __init__, so you typically don’t need to call it manually unless reloading data.

Warning

If ball owning team is not provided in the data, it will be inferred using distance thresholds, which may be inaccurate during contested ball situations.

add_dummy_labels(by=['game_id', 'frame_id'], random_seed=None)[source]

Add a column of random binary labels for testing/demonstration purposes.

This method adds a ‘label’ column with random 0/1 values to the dataset. Useful for testing graph neural network pipelines before you have real labels.

Parameters:

by (List[str]) – Column names to group by before assigning labels. Each unique combination gets the same random label. Defaults to [“game_id”, “frame_id”].
random_seed (int | None) – Random seed for reproducibility. If None, labels will be different each time. Defaults to None.

Returns:

The updated DataFrame with ‘label’ column added.

Return type:

pl.DataFrame

Example

>>> # Add random labels, one per frame
>>> dataset.add_dummy_labels(by=["frame_id"])
>>>
>>> # Add labels grouped by possession
>>> dataset.add_dummy_labels(by=["ball_owning_team_id", "period_id"])
>>>
>>> # Reproducible labels
>>> dataset.add_dummy_labels(by=["frame_id"], random_seed=42)

Note

In real applications, replace this with actual labels from your data:

>>> import polars as pl
>>> labels = pl.DataFrame({"frame_id": [...], "label": [...]})
>>> dataset.data = dataset.data.join(labels, on="frame_id")

See also

add_dummy_label_column(): Underlying utility function.

add_graph_ids(by=['game_id', 'period_id'])[source]

Add a graph_id column for grouping frames into graph samples.

This method adds a ‘graph_id’ column that groups tracking frames into distinct graph samples for GNN training. This is crucial for proper train/test splitting to avoid data leakage.

Parameters:: by (List[str]) – Column names to group by. Each unique combination gets a unique graph_id. Defaults to [“game_id”, “period_id”].
Returns:: The updated DataFrame with ‘graph_id’ column added.
Return type:: pl.DataFrame

Example

>>> # Each frame is a separate graph
>>> dataset.add_graph_ids(by=["frame_id"])
>>>
>>> # Group by possession (all frames in same possession = one graph)
>>> dataset.add_graph_ids(by=["ball_owning_team_id", "period_id"])
>>>
>>> # Group by 10-frame sequences
>>> dataset.data = dataset.data.with_columns(
...     (pl.col("frame_id") // 10).alias("sequence_id")
... )
>>> dataset.add_graph_ids(by=["sequence_id"])

Important

When splitting data for training, always split by graph_id to avoid data leakage. Never split by row index:

>>> # CORRECT: Split by graph_id
>>> train, test, val = dataset.split_test_train_validation(4, 1, 1)
>>>
>>> # WRONG: Don't split by index
>>> train = dataset[:800]  # May have same game in train and test!

See also

add_graph_id_column(): Underlying utility function. split_test_train_validation(): Splitting method.

get_player_by_id(player_id)[source]

get_team_id_by_player_id(player_id)[source]

sample(sample_rate)[source]

Downsample the dataset by keeping every Nth frame.

This method reduces the temporal resolution of the data by keeping only a subset of frames. Useful for faster experimentation or when full temporal resolution is not needed.

Parameters:: sample_rate (float) – Sampling rate. For example: - 2.0 keeps every 2nd frame (halves data size) - 5.0 keeps every 5th frame (reduces to 20% of original) - 10.0 keeps every 10th frame (reduces to 10% of original)
Returns:: Self, for method chaining.
Return type:: KloppyPolarsDataset

Example

>>> # Keep every 2nd frame (50% of data)
>>> dataset.sample(sample_rate=2.0)
>>>
>>> # Keep every 5th frame (20% of data)
>>> dataset.sample(sample_rate=5.0)
>>>
>>> # Can chain with other methods
>>> dataset.sample(5.0).add_dummy_labels().add_graph_ids()

Note

This modifies self.data in-place. The original data is not preserved.

Warning

Downsampling may affect velocity and acceleration calculations if you recalculate them after sampling. It’s recommended to downsample before conversion to graphs.

from kloppy import sportec
from unravel.soccer import KloppyPolarsDataset

# Load tracking data
kloppy_dataset = sportec.load_open_tracking_data(only_alive=True)

# Convert to Polars
polars_dataset = KloppyPolarsDataset(kloppy_dataset=kloppy_dataset)

# Access the DataFrame
df = polars_dataset.dataset
print(df.head())