process_mapping.discover_dfg

process_mapping.discover_dfg(
    log,
    case_col='entity_id',
    activity_col='event',
    timestamp_col='timestamp',
    time_unit='minutes',
)

Discover a Directly-Follows Graph (DFG) from an event log.

The event log must represent a single simulation run or process execution. Logs containing multiple independent runs should be filtered prior to calling this function.

This function constructs a Directly-Follows Graph (DFG) from a case-based event log by identifying pairs of consecutive activities within each case. It returns two tables:

  1. A node table containing activity occurrence counts.
  2. An edge table containing directly-follows relations with frequency, transition time statistics, and transition probabilities.

Parameters

Name Type Description Default
log pandas.DataFrame Event log in tabular form. Each row represents an event and must include a case identifier, an activity name, and a timestamp. The timestamp column must be of a datetime-like dtype. required
case_col str Name of the column identifying cases (process instances). Events are ordered and analysed independently within each case. The default reflects the default column names generated by vidigi’s EventLogger. "entity_id"
activity_col str Name of the column containing activity or event labels. The default reflects the default column names generated by vidigi’s EventLogger. "event"
timestamp_col str Name of the column containing event timestamps. Values must be timezone-consistent and convertible to datetime64. The helper function :func:vidigi.process_mapping.add_sim_timestamp() can be used to add this column to a dataframe if provided with a sim-start-relative time column. The default reflects the default name of the column added by that helper function. "timestamp"
time_unit (seconds, minutes, hours, days, weeks) Time unit used when computing the duration between consecutive events. Determines the scale of all time-based edge statistics. This should reflect the time unit used in your simulation. "seconds"

Returns

Name Type Description
nodes pandas.DataFrame Node table with one row per activity and the following columns: - activity : str Activity label. - count : int Total number of times the activity appears in the log.
edges pandas.DataFrame Edge table describing directly-follows relations between activities. Each row corresponds to a directed edge source -> target with the following columns: - source : str Preceding activity. - target : str Succeeding activity. - frequency : int Number of times target directly follows source. - mean_time : float Mean transition time between source and target. - median_time : float Median transition time between source and target. - max_time : float Maximum observed transition time. - min_time : float Minimum observed transition time. - standard_deviation_time : float Standard deviation of transition times. - probability : float Conditional probability of transitioning to target given source. Computed as the edge frequency divided by the total outgoing frequency from source.

Raises

Name Type Description
ValueError If time_unit is not one of the supported values.

Notes

  • Case endings (i.e., events without a successor) are excluded from the edge table.
  • The input log is internally sorted by case_col and timestamp_col before analysis.
  • Transition probabilities are computed independently for each source activity and therefore sum to 1 per source (up to floating-point precision).

Examples

>>> nodes, edges = discover_dfg(
...     log=event_log,
...     case_col="case_id",
...     activity_col="activity",
...     timestamp_col="time",
...     time_unit="minutes",
... )
>>> nodes.head()
>>> edges.head()
Back to top