Data Quality

Understanding limitations and quality issues in our data

Overview

The NJFOG Court Data project aggregates data from multiple government sources, each with its own formatting conventions and limitations. This page documents known data quality issues to help users interpret the data correctly.

We display data as-is from source systems rather than attempting to guess or reconstruct missing information. This preserves accuracy while making limitations visible.

Court Cases (ACMS Data)

Court case data comes from the NJ Judiciary's Automated Case Management System (ACMS) via the PAB0231 fixed-width export format. This mainframe-era format has significant limitations.

Known Issues

Name Truncation

Case captions and party names are stored in fixed-width fields that truncate long names. For example, "TOWNSHIP OF MOUNT LAUREL" might appear as "TOWNSHIP OF MOUNT LA". This affects approximately 15-20% of case titles.

Inconsistent Formatting

Party names may appear in different formats (e.g., "JOHN SMITH" vs "SMITH, JOHN"). Government agency names may use abbreviations or full names inconsistently.

Missing Data

Some fields may be blank or contain placeholder values. Disposition dates are only available for closed cases.

What We Do

Display data exactly as received from the source
Show hover tooltips explaining potential truncation
We do NOT guess or complete truncated names

GRC Complaints

GRC complaint data is scraped from the Government Records Council's public decision search system. Names and details are extracted from HTML pages and PDF documents.

Known Issues

Name Variations

The same person or agency may appear with different spellings or abbreviations across complaints (e.g., "Newark PD" vs "Newark Police Department" vs "City of Newark Police").

Extraction Errors

Automated extraction from PDF documents may occasionally miss or misparse information, particularly for older complaints with poor OCR quality.

Missing Documents

Some older GRC complaints have broken PDF links. We've identified and flagged these cases (e.g., 2005-69, 2005-125).

Data Updates

GRC data is scraped daily via automated GitHub Actions. New decisions are typically added within 24 hours of publication.

OAL Contested Cases

Office of Administrative Law data was obtained via OPRA request and covers matters closed since January 1, 2019.

Known Issues

Privacy Abbreviations

Party names in OAL cases may be abbreviated or redacted for privacy, particularly in education, child welfare, and medical cases.

Date Range

Only cases closed since January 1, 2019 are included. Pending cases and older closed cases are not in this dataset.

Entity Normalization

We extract and normalize entity names from case titles and GRC complaints to enable cross-referencing and analysis. This is an imperfect process.

How It Works

1. Extraction

Party names are extracted from case titles using pattern matching (e.g., splitting on "VS"). Government entities are identified using keyword patterns.

2. Cleaning

Names are cleaned by removing titles, standardizing spacing, and handling common abbreviations.

3. Deduplication

Similar names are merged using fuzzy matching and manual rules. Known truncated names are mapped to complete versions where possible.

Limitations

Some entities may still be duplicated under slightly different names
Truncated source data means some entities can never be fully identified
Entity type classification (government vs individual) may be incorrect

Report Data Issues

If you notice data quality issues, incorrect information, or have corrections to suggest, please let us know:

Contact NJFOG View Full Methodology

Methodology→Download Data→

Overview

We display data as-is from source systems rather than attempting to guess or reconstruct missing information. This preserves accuracy while making limitations visible.