Official Haryana source, mirrored and documented

Research delivery page for the Sonipat polling-station dataset.

Built from the official district PDF, preserved in multiple formats, and wrapped with clear provenance, previews, and validation notes.

Browse the data How to confirm it

Source lock

8-Sonipat.pdf

Download source PDF

What’s here

Raw booth-wise CSV
Page-level and line-level PDF exports
Assembly stats and descriptive tests
Exact scripts used to build the bundle

Designed for

A shareable research microsite where the PDF, the extracted files, the method, and the verification path all live in one place.

Project framing

What this website is doing

1. Mirror the research folder

The site ships the actual project files inside a mirrored folder tree so the website is not just a summary. It is the bundle itself.

2. Explain how the data was built

The extraction and analysis scripts are downloadable, and the steps from official PDF to CSV are shown in plain language.

3. Make verification easy

The final section is devoted to source checks, hashes, page references, and the safest way to audit the outputs.

Source route

Where the PDF came from

This section shows both the original Haryana CEO source and the mirrored local copy used in the project.

Original website page

Haryana CEO polling-stations page

The main source website page where the district polling-station lists are hosted.

Open source website

Direct file used

Exact PDF extracted in this project

The direct official Sonipat PDF URL used for extraction.

Open direct PDF URL

How to navigate there

Manual retrieval steps

Download mirrored PDF copy

Research bundle

The mirrored folder structure

Everything below is deployable on Cloudflare Pages as static files, including the downloadable data and scripts.

Hosting strategy

The data is hosted directly with the website. On Cloudflare Pages that means the CSVs, PDF, and scripts are just static assets under the same domain, so links stay simple and nothing depends on a separate storage backend.

For this project that is enough, because the largest user-facing file is the line export and it is still small enough to serve as a normal static download.

Method

How the data was obtained and transformed

Download the official district PDF

The starting point is the official Haryana CEO polling-station PDF for Sonipat district, preserved here without modification.

Extract source-faithful text

Two preservation layers were created: one row per PDF page and one row per text line, so the original wording remains inspectable.

Build a booth-wise raw CSV

The PDF was then turned into one row per polling station with constituency and page references, while keeping the extracted booth text raw rather than over-cleaned.

Run descriptive tests

A stats-ready feature layer was created from the raw text to support structural comparisons across assembly constituencies.

Data room

Downloads and live previews

The previews below are curated windows into the files. The downloads remain the authoritative deliverables.

Booth-wise raw CSV preview

First rows from the station-level dataset with page references.

Full PDF text by page

Page-level preservation export for near-verbatim review.

Full PDF text by line

Line-level export for manual auditing and reconstruction.

Findings

What the descriptive tests surfaced

These are structure-level findings from the extracted polling-station text. They are useful for orientation and comparison, not as causal claims.

Station counts by assembly

How the raw polling-station rows distribute across the six assembly segments.

Urban-coded booth text share

Proxy based on mentions like ward, sector, colony, apartment, tower, or city.

Schedule-caste or chaupal references

Text features that frequently mark localized or caste-specific site references.

Government-building mention share

How often booth text explicitly references government institutions.

Verification

How to confirm the outputs yourself

Check the source first

Download the official PDF from the mirrored source folder.
Compare its SHA-256 hash with the value shown on this page.
Confirm the PDF page count and the constituency page starts.

Audit the CSVs

Open the raw station CSV and inspect any polling station row you care about.
Use the page-range columns to jump back into the source PDF.
Cross-check the same content against the page-level or line-level exports.

Rebuild it

Download the scripts from the mirrored research folder.
Run the extraction and analysis scripts locally on the same PDF.
Confirm that the CSVs and test outputs reproduce the same numbers.