Resolving Duplicate Incident Reports Across Jurisdictions

At 02:47 a structure fire is called in from a cell phone routed through a county PSAP, dispatched as INC-44021 in Computer-Aided Dispatch (CAD). Ninety seconds later the same column of smoke trips a city traffic-camera analytic and a neighboring mutual-aid agency self-dispatches an engine, writing CTY-9930 to its own CAD with coordinates 110 metres west and a clock skewed four minutes by an un-synced NTP server. The common operating picture now shows two fires on the same block. The resource-allocation model counts two engines committed where one is, the after-action metrics inflate, and an operations chief stares at a map that lies. This page solves exactly that failure: two incident records that describe one real-world event but never match on ID, exact coordinate, or timestamp, and the Python pattern that collapses them without ever silently merging two events that were genuinely distinct. It is the deduplication concern that sits on top of Geopandas vs PyShp for Field Operations, because the library you reach for decides whether this runs in a command-center ETL node or on an offline tablet at the post.

Root Cause and Operational Impact

Cross-boundary duplicates are not a data-entry mistake to be scolded out of existence — they are structural. Overlapping CAD dispatch zones, mutual-aid auto-dispatch, parallel sensor triggers (traffic cameras, gunshot detectors, smoke analytics), and multi-agency radio traffic all generate independent records for one event by design. None of them share a primary key, and three field-specific noise sources guarantee the records never line up exactly: GPS drift and differing Coordinate Reference System (CRS) implementations move the coordinate by tens of metres, un-synced dispatch clocks skew the timestamp by minutes, and free-text incident descriptions diverge entirely between agencies. Exact-match or raw-coordinate-equality filtering catches none of it.

In an office system a duplicate row is an inconvenience you fix at month-end. In an active incident it is operationally dangerous. A double-counted event corrupts real-time situational awareness, so the operations section commits or holds resources against a phantom. It poisons the deployment metrics that drive mutual-aid billing and the FEMA Incident Status Summary (NIMS ICS-209) rollup. And it breaks every downstream spatial join — buffer math, parcel intersection, hydrant assignment — because two geometries sit where one belongs. The fix must therefore be deterministic and conservative: a wrong merge that hides a second real fire is worse than a missed merge that leaves a visible duplicate, so the pattern biases toward flagging the ambiguous case for a human rather than auto-collapsing it.

Tiered Resolution Strategy

Treat a candidate duplicate the way the CRS resolver treats a missing datum: resolve through an ordered chain from the most definitive evidence to a flagged safe default, and never let the highest-confidence tier be a silent guess. Evaluate three concurrent dimensions — spatial proximity, temporal overlap, and attribute similarity — combine them into one composite score, and route by confidence band:

Definitive identity match. If both records carry a shared correlation key — a CAD-to-CAD exchange ID, a NIEM-XML IncidentTrackingIdentification, or a deduplicated 911 ANI/ALI reference — collapse them immediately. This is the only tier allowed to merge without spatial scoring.
High-confidence spatial-temporal-attribute match. Within a tight time window and a metric proximity buffer, with matching NENA-compliant incident type codes, auto-merge and assign a master_id. Every merge emits an audit record naming both source IDs.
Ambiguous match → review queue. A composite score in the uncertain band (here 0.65–0.85) routes to a human-review queue rather than auto-merging, preserving chain-of-custody for forensic audit. Resource counts treat the pair as possibly one and surface it for the ops chief, not as a settled merge.
Distinct → keep both with audit flag. Below the threshold, retain both records, tag each with the evaluation provenance, and log the near-miss so a post-incident reviewer can see the deduplicator considered and rejected the pair.

Production Python Implementation

The resolver below runs the full path in one place: it harmonizes CRS to a metric projection, windows temporally to keep the comparison out of O(n²), scores each candidate pair on spatial, temporal, and attribute axes, then routes by confidence band and emits a structured audit record for every decision — merge, review, or keep. It uses print-free structured logging, explicit exception handling, and full type hints. It assumes geopandas >= 0.14 (Shapely 2.x / GEOS) and a projected CRS appropriate to the operational area — pick the Universal Transverse Mercator (UTM) zone for the incident, not a continental default, so buffer distances are true metres. The metric reprojection step depends on a correctly resolved input CRS; recovering one when the upstream feed omits it is the job of handling missing CRS in field-collected GPS logs.

python

import logging
from dataclasses import dataclass
from datetime import timedelta
from difflib import SequenceMatcher
from typing import Optional

import geopandas as gpd
from shapely.strtree import STRtree

logger = logging.getLogger("dedup.incidents")


@dataclass(frozen=True)
class DedupConfig:
    metric_crs: str = "EPSG:32610"     # UTM zone for the operational area
    spatial_threshold_m: float = 150.0  # proximity buffer in true metres
    time_window_min: int = 10           # ± dispatch-to-dispatch skew tolerance
    auto_merge_score: float = 0.85      # >= this: collapse automatically
    review_floor_score: float = 0.65    # [floor, auto): route to human review


def _score_pair(row, cand, cfg: DedupConfig) -> float:
    """Composite 0..1 similarity across space, time, and incident type."""
    dist = row.geometry.distance(cand.geometry)
    spatial = max(0.0, 1.0 - dist / cfg.spatial_threshold_m)

    dt_min = abs((row["dispatch_time"] - cand["dispatch_time"]).total_seconds()) / 60.0
    temporal = max(0.0, 1.0 - dt_min / cfg.time_window_min)

    # NENA-style code equality is a hard signal; fall back to description fuzz.
    if row.get("incident_code") and row["incident_code"] == cand.get("incident_code"):
        attribute = 1.0
    else:
        attribute = SequenceMatcher(
            None,
            str(row.get("description", "")).lower(),
            str(cand.get("description", "")).lower(),
        ).ratio()

    return 0.5 * spatial + 0.3 * temporal + 0.2 * attribute


def resolve_duplicate_incidents(
    incidents: gpd.GeoDataFrame, cfg: Optional[DedupConfig] = None
) -> gpd.GeoDataFrame:
    """Collapse cross-jurisdiction duplicates, emitting an audit record per decision.

    Returns the original frame with three added columns:
      master_id   – id of the surviving record (self for masters/distinct)
      dedup_state – 'master' | 'merged' | 'review' | 'distinct'
      dedup_score – composite score against the assigned master (NaN for masters)
    """
    cfg = cfg or DedupConfig()
    if incidents.empty:
        logger.info("No incidents to deduplicate; returning empty frame")
        return incidents

    try:
        work = incidents.to_crs(cfg.metric_crs)
    except Exception:
        logger.exception("CRS reprojection to %s failed; aborting dedup", cfg.metric_crs)
        raise  # a wrong projection silently corrupts every distance; never proceed

    work = work.sort_values("dispatch_time").copy()
    work["master_id"] = work.index
    work["dedup_state"] = "master"
    work["dedup_score"] = float("nan")

    geoms = list(work.geometry.values)
    tree = STRtree(geoms)
    settled: set = set()

    for idx, row in work.iterrows():
        if idx in settled:
            continue

        # Temporal pre-filter keeps the comparison out of O(n^2).
        lo = row["dispatch_time"] - timedelta(minutes=cfg.time_window_min)
        hi = row["dispatch_time"] + timedelta(minutes=cfg.time_window_min)
        in_window = work[(work["dispatch_time"] >= lo) & (work["dispatch_time"] <= hi)]

        # Spatial pre-filter via the index, then score survivors precisely.
        near_pos = tree.query(row.geometry.buffer(cfg.spatial_threshold_m))
        near_idx = set(work.iloc[near_pos].index) & set(in_window.index)
        near_idx.discard(idx)

        for cand_idx in near_idx:
            if cand_idx in settled:
                continue
            score = _score_pair(row, work.loc[cand_idx], cfg)

            if score >= cfg.auto_merge_score:
                work.loc[cand_idx, ["master_id", "dedup_state", "dedup_score"]] = (
                    idx, "merged", score,
                )
                settled.add(cand_idx)
                logger.info(
                    "MERGE master=%s merged=%s score=%.3f", idx, cand_idx, score
                )
            elif score >= cfg.review_floor_score:
                work.loc[cand_idx, ["master_id", "dedup_state", "dedup_score"]] = (
                    idx, "review", score,
                )
                logger.warning(
                    "REVIEW pair=(%s,%s) score=%.3f routed to manual queue",
                    idx, cand_idx, score,
                )
            else:
                logger.debug(
                    "DISTINCT pair=(%s,%s) score=%.3f kept separate", idx, cand_idx, score
                )

    # Re-attach results to the caller's original CRS frame by index.
    out = incidents.copy()
    out[["master_id", "dedup_state", "dedup_score"]] = work[
        ["master_id", "dedup_state", "dedup_score"]
    ]
    return out

Persist master_id, dedup_state, and dedup_score straight to the audit store; they are the chain-of-custody record that lets a reviewer reconstruct why two records became one. Never overwrite the source IDs in place — masters and merged records must both survive so a mistaken merge is reversible. The same scoring philosophy underpins automated attribute validation rules: normalize the incident-type codes there before they reach the attribute axis here, or the fuzzy fallback will carry the whole decision.

Validation Checklist

Confirm each item before a deduplication build is cleared for field deployment:

The metric CRS in DedupConfig is the UTM zone for the actual incident area, not a continental or web-mercator default — buffer distances must be true metres.
Every merged and review record carries master_id and dedup_score written to the audit log alongside both source IDs.
auto_merge_score and review_floor_score are tuned against a labeled sample from this jurisdiction, not left at the example defaults.
A definitive identity tier (CAD exchange ID / NIEM IncidentTrackingIdentification) short-circuits scoring when a shared correlation key exists.
Records in the review band route to a human queue and are excluded from auto-merge resource counts until cleared.
Source records are never overwritten in place — both master and merged rows survive so any merge is reversible.
Dispatch timestamps are normalized to a single timezone/UTC before windowing, so NTP skew between agency clocks is bounded by time_window_min, not by timezone offset.

Edge Cases and Gotchas

Axis-order inversion. A neighboring agency exporting GeoJSON as lat/lon while yours is lon/lat transposes a point thousands of kilometres away, so it never enters any candidate set and the duplicate survives unmerged. Enforce always_xy=True on every transform and bounds-check coordinates against the incident extent before scoring.
Null-island drift. A (0, 0) fix from a sensor that failed to acquire is within spatial_threshold_m of every other (0, 0) failure, so naive proximity merges unrelated events into one phantom incident at the equator. Filter and quarantine null-island coordinates before the spatial index is built.
Clock skew beyond the window. Un-synced agency CAD clocks can exceed time_window_min, pushing a true duplicate outside the temporal pre-filter so it is never compared. Normalize all timestamps to UTC on ingest and widen the window only with a compensating raise in auto_merge_score, or genuine duplicates fall through silently.
Agency-specific datum anomalies. A mutual-aid feed still publishing NAD27 or a State Plane grid lands 10–100 m off your WGS 84 basemap, inflating the measured distance so a real duplicate scores below threshold. Reproject with a real datum-transformation grid per the coordinate reference systems for disaster zones workflow before the records ever reach this resolver.
Offline device replay. A tablet syncing after an outage replays a batch of already-merged records; without idempotency the resolver merges them a second time and corrupts the master_id chain. Key the audit store on source ID so a re-seen record is recognized as settled, not re-scored.

Geopandas vs PyShp for Field Operations — choosing the spatial library that decides whether this runs on a command-center node or an offline tablet.
Coordinate Reference Systems for Disaster Zones — datum-aware reprojection so cross-agency feeds align before scoring.
Automated Attribute Validation Rules — normalizing incident-type codes that feed the attribute-similarity axis.

Up: Geopandas vs PyShp for Field Operations