process_mapping.discover_dfg

process_mapping.discover_dfg(
    log,
    case_col='entity_id',
    activity_col='event',
    timestamp_col='timestamp',
    time_unit='minutes',
)

Discover a Directly-Follows Graph (DFG) from an event log.

The event log must represent a single simulation run or process execution. Logs containing multiple independent runs should be filtered prior to calling this function.

This function constructs a Directly-Follows Graph (DFG) from a case-based event log by identifying pairs of consecutive activities within each case. It returns two tables:

A node table containing activity occurrence counts.
An edge table containing directly-follows relations with frequency, transition time statistics, and transition probabilities.

Parameters

Name	Type	Description	Default
log	pandas.DataFrame	Event log in tabular form. Each row represents an event and must include a case identifier, an activity name, and a timestamp. The timestamp column must be of a datetime-like dtype.	required
case_col	str	Name of the column identifying cases (process instances). Events are ordered and analysed independently within each case. The default reflects the default column names generated by vidigi’s EventLogger.	`"entity_id"`
activity_col	str	Name of the column containing activity or event labels. The default reflects the default column names generated by vidigi’s EventLogger.	`"event"`
timestamp_col	str	Name of the column containing event timestamps. Values must be timezone-consistent and convertible to `datetime64`. The helper function :func:`vidigi.process_mapping.add_sim_timestamp()` can be used to add this column to a dataframe if provided with a sim-start-relative time column. The default reflects the default name of the column added by that helper function.	`"timestamp"`
time_unit	(seconds, minutes, hours, days, weeks)	Time unit used when computing the duration between consecutive events. Determines the scale of all time-based edge statistics. This should reflect the time unit used in your simulation.	`"seconds"`

Returns

Name	Type	Description
nodes	pandas.DataFrame	Node table with one row per activity and the following columns: - `activity` : str Activity label. - `count` : int Total number of times the activity appears in the log.
edges	pandas.DataFrame	Edge table describing directly-follows relations between activities. Each row corresponds to a directed edge `source -> target` with the following columns: - `source` : str Preceding activity. - `target` : str Succeeding activity. - `frequency` : int Number of times `target` directly follows `source`. - `mean_time` : float Mean transition time between `source` and `target`. - `median_time` : float Median transition time between `source` and `target`. - `max_time` : float Maximum observed transition time. - `min_time` : float Minimum observed transition time. - `standard_deviation_time` : float Standard deviation of transition times. - `probability` : float Conditional probability of transitioning to `target` given `source`. Computed as the edge frequency divided by the total outgoing frequency from `source`.

Raises

Name	Type	Description
	ValueError	If `time_unit` is not one of the supported values.

Notes

Case endings (i.e., events without a successor) are excluded from the edge table.
The input log is internally sorted by case_col and timestamp_col before analysis.
Transition probabilities are computed independently for each source activity and therefore sum to 1 per source (up to floating-point precision).

Examples

>>> nodes, edges = discover_dfg(
...     log=event_log,
...     case_col="case_id",
...     activity_col="activity",
...     timestamp_col="time",
...     time_unit="minutes",
... )
>>> nodes.head()
>>> edges.head()