Official Haryana source, mirrored and documented
Research delivery page for the Sonipat polling-station dataset.
Built from the official district PDF, preserved in multiple formats, and wrapped with clear provenance, previews, and validation notes.
- Raw booth-wise CSV
- Page-level and line-level PDF exports
- Assembly stats and descriptive tests
- Exact scripts used to build the bundle
A shareable research microsite where the PDF, the extracted files, the method, and the verification path all live in one place.
Project framing
What this website is doing
1. Mirror the research folder
The site ships the actual project files inside a mirrored folder tree so the website is not just a summary. It is the bundle itself.
2. Explain how the data was built
The extraction and analysis scripts are downloadable, and the steps from official PDF to CSV are shown in plain language.
3. Make verification easy
The final section is devoted to source checks, hashes, page references, and the safest way to audit the outputs.
Source route
Where the PDF came from
This section shows both the original Haryana CEO source and the mirrored local copy used in the project.
Haryana CEO polling-stations page
The main source website page where the district polling-station lists are hosted.
Open source websiteExact PDF extracted in this project
The direct official Sonipat PDF URL used for extraction.
Open direct PDF URLManual retrieval steps
Download mirrored PDF copyResearch bundle
The mirrored folder structure
Everything below is deployable on Cloudflare Pages as static files, including the downloadable data and scripts.
Hosting strategy
The data is hosted directly with the website. On Cloudflare Pages that means the CSVs, PDF, and scripts are just static assets under the same domain, so links stay simple and nothing depends on a separate storage backend.
For this project that is enough, because the largest user-facing file is the line export and it is still small enough to serve as a normal static download.
Method
How the data was obtained and transformed
Download the official district PDF
The starting point is the official Haryana CEO polling-station PDF for Sonipat district, preserved here without modification.
Extract source-faithful text
Two preservation layers were created: one row per PDF page and one row per text line, so the original wording remains inspectable.
Build a booth-wise raw CSV
The PDF was then turned into one row per polling station with constituency and page references, while keeping the extracted booth text raw rather than over-cleaned.
Run descriptive tests
A stats-ready feature layer was created from the raw text to support structural comparisons across assembly constituencies.
Data room
Downloads and live previews
The previews below are curated windows into the files. The downloads remain the authoritative deliverables.
Booth-wise raw CSV preview
First rows from the station-level dataset with page references.
Full PDF text by page
Page-level preservation export for near-verbatim review.
Full PDF text by line
Line-level export for manual auditing and reconstruction.
Findings
What the descriptive tests surfaced
These are structure-level findings from the extracted polling-station text. They are useful for orientation and comparison, not as causal claims.
Station counts by assembly
How the raw polling-station rows distribute across the six assembly segments.
Urban-coded booth text share
Proxy based on mentions like ward, sector, colony, apartment, tower, or city.
Schedule-caste or chaupal references
Text features that frequently mark localized or caste-specific site references.
Government-building mention share
How often booth text explicitly references government institutions.
Verification
How to confirm the outputs yourself
Check the source first
- Download the official PDF from the mirrored source folder.
- Compare its SHA-256 hash with the value shown on this page.
- Confirm the PDF page count and the constituency page starts.
Audit the CSVs
- Open the raw station CSV and inspect any polling station row you care about.
- Use the page-range columns to jump back into the source PDF.
- Cross-check the same content against the page-level or line-level exports.
Rebuild it
- Download the scripts from the mirrored research folder.
- Run the extraction and analysis scripts locally on the same PDF.
- Confirm that the CSVs and test outputs reproduce the same numbers.