process_mapping.discover_dfg
process_mapping.discover_dfg(
log,
case_col='entity_id',
activity_col='event',
timestamp_col='timestamp',
time_unit='minutes',
)Discover a Directly-Follows Graph (DFG) from an event log.
The event log must represent a single simulation run or process execution. Logs containing multiple independent runs should be filtered prior to calling this function.
This function constructs a Directly-Follows Graph (DFG) from a case-based event log by identifying pairs of consecutive activities within each case. It returns two tables:
- A node table containing activity occurrence counts.
- An edge table containing directly-follows relations with frequency, transition time statistics, and transition probabilities.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| log | pandas.DataFrame | Event log in tabular form. Each row represents an event and must include a case identifier, an activity name, and a timestamp. The timestamp column must be of a datetime-like dtype. | required |
| case_col | str | Name of the column identifying cases (process instances). Events are ordered and analysed independently within each case. The default reflects the default column names generated by vidigi’s EventLogger. | "entity_id" |
| activity_col | str | Name of the column containing activity or event labels. The default reflects the default column names generated by vidigi’s EventLogger. | "event" |
| timestamp_col | str | Name of the column containing event timestamps. Values must be timezone-consistent and convertible to datetime64. The helper function :func:vidigi.process_mapping.add_sim_timestamp() can be used to add this column to a dataframe if provided with a sim-start-relative time column. The default reflects the default name of the column added by that helper function. |
"timestamp" |
| time_unit | (seconds, minutes, hours, days, weeks) | Time unit used when computing the duration between consecutive events. Determines the scale of all time-based edge statistics. This should reflect the time unit used in your simulation. | "seconds" |
Returns
| Name | Type | Description |
|---|---|---|
| nodes | pandas.DataFrame | Node table with one row per activity and the following columns: - activity : str Activity label. - count : int Total number of times the activity appears in the log. |
| edges | pandas.DataFrame | Edge table describing directly-follows relations between activities. Each row corresponds to a directed edge source -> target with the following columns: - source : str Preceding activity. - target : str Succeeding activity. - frequency : int Number of times target directly follows source. - mean_time : float Mean transition time between source and target. - median_time : float Median transition time between source and target. - max_time : float Maximum observed transition time. - min_time : float Minimum observed transition time. - standard_deviation_time : float Standard deviation of transition times. - probability : float Conditional probability of transitioning to target given source. Computed as the edge frequency divided by the total outgoing frequency from source. |
Raises
| Name | Type | Description |
|---|---|---|
| ValueError | If time_unit is not one of the supported values. |
Notes
- Case endings (i.e., events without a successor) are excluded from the edge table.
- The input log is internally sorted by
case_colandtimestamp_colbefore analysis. - Transition probabilities are computed independently for each source activity and therefore sum to 1 per source (up to floating-point precision).
Examples
>>> nodes, edges = discover_dfg(
... log=event_log,
... case_col="case_id",
... activity_col="activity",
... timestamp_col="time",
... time_unit="minutes",
... )
>>> nodes.head()
>>> edges.head()