๐ Grants ETL Pipeline โ Rust + Transformer-Based Classification
๐ Overview
I built an end-to-end ETL pipeline to ingest, classify, and analyze Canadian government grant data. The project combines:
- โก High-performance data extraction using Rust
- ๐ง Semantic classification using BERT (zero-shot)
- ๐ Structured output ready for downstream analytics and dashboarding
This project demonstrates systems design, data engineering, and applied NLP in a production-style pipeline.
๐งฉ Extraction Layer (Rust)
The Problem
The Grants Canada portal has no accessible API โ only an HTML-rendered search interface. I needed a way to extract structured data at scale.
The Solution
I built a custom scraper targeting the paginated search endpoint:
https://search.open.canada.ca/grants/?page={}&sort=agreement_start_date+desc
Key Decisions
I initially started with Python but switched to Rust for performance at scale. The Rust scraper uses:
-
scraperโ for HTML parsing -
csvโ for structured output
Designed to handle large-scale ingestion efficiently without extreme usage of memory or runtime.
Outcome
โ
Successfully extracted structured grant data into CSV
โ
Significantly faster ingestion vs. the prior Python-based workflow
๐ Sample Record
Agreement: European Space Agency (ESA)'s Space Weather Training Course
Agreement Number: 25COBLLAMY
Date Range: Mar 11, 2026 โ Mar 27, 2026
Description: Supports Canadian students attending international space training events
Recipient: Canadian Space Agency
Amount: $1,000.00
Location: La Prairie, Quebec, CA
๐ง Transformation + Classification
Objective
Categorize grants into meaningful sectors for analytics and discovery โ making the data explorable beyond raw fields.
Categories
CATEGORIES = [
"Housing & Shelter",
"Education & Training",
"Employment & Entrepreneurship",
"Business & Innovation",
"Health & Wellness",
"Environment & Energy",
"Community & Nonprofits",
"Research & Academia",
"Indigenous Programs",
"Public Safety & Emergency Services",
"Agriculture & Rural Development",
"Arts, Culture & Heritage",
"Civic & Democratic Engagement"
]
๐ค Model Choice
I evaluated two approaches:
| Approach | Verdict |
|---|---|
| Traditional ML (clustering) | Requires labeled data, less semantic |
| BERT via Hugging Face (zero-shot) | โ Selected |
Why zero-shot BERT?
- No labeled dataset required
- Strong semantic understanding out-of-the-box
- Fast to implement and iterate
โ๏ธ Inference Pipeline
print("Running classification...")
predictions = []
for text in df['text']:
result = classifier(text, candidate_labels=CATEGORIES)
predictions.append({
'predicted_category': result['labels'][0],
'confidence_score': result['scores'][0]
})
Each grant description gets mapped to its most semantically relevant category, with a confidence score attached.
๐งผ Data Quality
The source data was highly structured and clean, which meant:
- Minimal preprocessing required
- Faster iteration on modeling and pipeline integration
- No time lost on data wrangling before getting to the interesting parts
๐ฆ Next Steps
The pipeline is actively being extended:
- ๐๏ธ Load Layer โ Persist classified data in a database
- ๐ Analytics Dashboard โ Visualize funding trends by category, region, and time
- โฑ๏ธ Pipeline Orchestration โ Automate ingestion + inference end-to-end
๐ก Key Takeaways
- Rust is a legit choice for ETL scraping โ not just systems programming. The performance gains over Python are real and measurable.
- Zero-shot BERT punches above its weight for classification tasks without labeled data. It's a great first-pass model.
- Modular pipeline design pays off early โ separating extraction, transformation, and load made iteration much faster.
- Don't over-engineer โ the right tool for each layer matters more than using a single stack.
๐ Links
- ๐ GitHub: github.com/Sher213/GrantsInvestments
Open to opportunities in Data Science, ML Engineering, and Data Engineering โ feel free to reach out at alisher213@outlook.com