Over the past few weeks, several large open-source programmes have experienced data exposure incidents — contributor email addresses, phone numbers, and personal metadata leaked through insecure database configurations and overly permissive API endpoints. The developer community has rightly asked difficult questions about how programmes handling tens of thousands of student developers should be architecting their systems.

Contributors to Social Summer of Code (SSoC) 2026 have asked me the same question:
"SSoC collects our email addresses and phone numbers for onboarding and certification. How do you prevent the kind of database leaks we've recently seen elsewhere?"
This article is my answer. Not a marketing piece. Not a "we're unhackable" press release. A genuine engineering deep-dive into the architectural decisions that shape how SSoC handles contributor data — and the philosophy behind those decisions.
The central thesis is simple:
Security is not primarily about encryption, firewalls, or databases. Security starts with architecture. The best defence is often ensuring sensitive data never reaches systems that don't need it.
I call this philosophy Secure Isolation by Design.
The Threat Model
Before discussing architecture, let's establish what we're defending against. SSoC collects the following PII (Personally Identifiable Information) during registration:
| Data Type | Purpose | Sensitivity |
|---|---|---|
| Full Name | Display, certificates | Low |
| Email Address | Onboarding, notifications, certificates | High |
| Phone Number | Emergency contact, WhatsApp groups | High |
| GitHub Username | Scoring, attribution | Low (public) |
| LinkedIn URL | Networking, verification | Low (public) |
| Discord Username | Community communication | Medium |
The threat model is straightforward:
- External attackers probing public infrastructure for exposed databases, APIs, or admin panels
- Accidental exposure through misconfigured deployments, leaked environment variables, or overly broad API responses
- Supply chain risks from dependencies with access to runtime data
- Insider risk from overly broad access to administrative systems
The typical response to these threats is defence-in-depth: encrypt the database, add authentication, set up WAF rules, rotate keys, monitor access logs. All of those are valid. But they all assume the sensitive data is in the system in the first place.
What if it isn't?
Architecture Overview
SSoC 2026 serves hundreds of open-source projects, real-time leaderboards, contributor profiles, certificate verification, and administrative tooling for a programme with 50,000+ contributors. Here's how the system is structured:
graph TB
subgraph "Public Internet"
USER[Contributors / Public]
end
subgraph "Public Platform (Static Site)"
SITE[Static Website<br/>React + Vite]
LB_DATA[user-scores.js<br/>Leaderboard Data]
PR_DATA[pr-list.js<br/>PR Details]
PROJ_DATA[projects.json<br/>Project Metadata]
CERT_DATA[Certificate Data]
end
subgraph "Build Pipeline (Offline)"
SCORING[Node.js Scoring Engine<br/>single.js]
GH_API[GitHub GraphQL API<br/>Public PR Data]
BONUS[Bonus Points Repo<br/>GitHub]
end
subgraph "Administrative Layer (Local Only)"
GC[Ground Control<br/>Admin Interface]
MASTER[Master TSV<br/>Registration Data]
LOCAL_SCRIPTS[Local Processing<br/>Validation Scripts]
end
USER -->|HTTPS| SITE
SITE --- LB_DATA
SITE --- PR_DATA
SITE --- PROJ_DATA
SITE --- CERT_DATA
GH_API -->|Public metadata| SCORING
BONUS -->|Bonus points| SCORING
SCORING -->|Generated static files| LB_DATA
SCORING -->|Generated static files| PR_DATA
MASTER -->|Local fetch only| GC
MASTER -->|Local input| LOCAL_SCRIPTS
style MASTER fill:#ff6b6b,stroke:#c0392b,color:#fff
style GC fill:#ff6b6b,stroke:#c0392b,color:#fff
style LOCAL_SCRIPTS fill:#ff6b6b,stroke:#c0392b,color:#fff
style SITE fill:#2ecc71,stroke:#27ae60,color:#fff
style LB_DATA fill:#2ecc71,stroke:#27ae60,color:#fff
style PR_DATA fill:#2ecc71,stroke:#27ae60,color:#fff
style PROJ_DATA fill:#2ecc71,stroke:#27ae60,color:#fff
The red components never touch the public internet. The green components contain zero PII. The two layers share no runtime connection.
1. The Public Platform: A Database-Free Architecture
The SSoC public website — the portal that 50,000+ contributors interact with — is a static React application built with Vite. It has:
- No backend server — no Express, no Fastify, no serverless functions
- No database — no MongoDB, no PostgreSQL, no Firebase Realtime Database
- No authentication layer — no JWT tokens, no session cookies, no OAuth flows
- No API endpoints — no REST routes, no GraphQL resolvers, no WebSocket connections
Everything the public sees is pre-generated.
How Leaderboards Work Without a Database
The scoring engine is a Node.js script (single.js) that runs offline. It:
- Queries the GitHub GraphQL API for public PR metadata across all programme repositories
- Evaluates each PR against scoring rules (difficulty labels, blacklists, registered users)
- Fetches bonus points from a separate GitHub repository
- Generates static JavaScript files containing the computed results
// Output: user-scores.js (loaded via <script> tag)
window.userScores = {
users: {
"contributor-a": {
totalScore: 450,
prCount: 12,
prsByLevel: { Easy: 3, Medium: 5, Hard: 2, Advanced: 2 },
bonusScore: 50,
// No email. No phone. No PII.
},
// ... 50,000 more entries
},
summary: { totalPRs: 87432, totalPoints: 2156000 }
};
The output files are committed and deployed as static assets. The website loads them via <script> tags — no fetch calls, no API requests, no database queries.
"The safest database query is the one you never have to make."
This isn't a limitation. It's the design. A static architecture means:
| Attack Vector | Typical Web App | SSoC Public Platform |
|---|---|---|
| SQL/NoSQL Injection | Possible | Not applicable — no database |
| API enumeration | Possible | Not applicable — no API |
| Authentication bypass | Possible | Not applicable — no auth layer |
| Session hijacking | Possible | Not applicable — no sessions |
| Server-side request forgery | Possible | Not applicable — no server |
| Database credential leak | Possible | Not applicable — no credentials |
| Exposed admin API | Possible | Not applicable — no API |
The attack surface isn't "well-defended." It largely doesn't exist.
What About Dynamic Features?
The platform has interactive features — search, filtering, score breakdowns, certificate validation, PR lookup. All of these operate entirely client-side against the pre-generated data. The search bar on the Projects page doesn't query a database; it filters a JSON array already loaded in the browser.
// Client-side search — no server round-trip
const filtered = projects.filter(p =>
p.name.toLowerCase().includes(query.toLowerCase()) ||
p.owner.toLowerCase().includes(query.toLowerCase()) ||
p.techStack.some(t => t.toLowerCase().includes(query.toLowerCase()))
);
2. PII Separation: The Data That Never Deploys
SSoC collects contributor PII during registration through Google Forms. This data flows into a Google Sheet, which is exported as a TSV (Tab-Separated Values) file for administrative processing.
Here's the critical architectural decision: this file never enters the deployment pipeline.
graph LR
subgraph "Registration Flow"
FORM[Google Form] -->|Responses| SHEET[Google Sheet]
SHEET -->|Manual TSV export| LOCAL[Local Machine<br/>MasterSheetsData.tsv]
end
subgraph "Public Deployment"
BUILD[Vite Build] -->|Static assets| CDN[GitHub Pages / CDN]
end
LOCAL -.-x|NEVER| BUILD
style LOCAL fill:#ff6b6b,stroke:#c0392b,color:#fff
style CDN fill:#2ecc71,stroke:#27ae60,color:#fff
The TSV contains email addresses, phone numbers, LinkedIn URLs, and role information. It exists on my local machine. It is listed in .gitignore. It is never committed to the repository. It is never included in the build output. It is never uploaded to any server.
Data Minimisation in Practice
The scoring engine needs to know which GitHub usernames are registered participants — but it doesn't need their email addresses or phone numbers to calculate scores. So the input to the scoring engine is a minimal users.json file:
["contributor-a", "contributor-b", "contributor-c"]
Just usernames. Generated from the Master TSV by extracting a single column. The scoring engine never sees the full registration data.
This is GDPR's data minimisation principle in practice: each system component receives only the minimum data required for its specific function.
3. Ground Control: The Admin Interface That Can't Be Replicated
Ground Control is the internal administrative interface for SSoC. It provides:
- Contributor validation (name formatting, email checks, phone number cleanup)
- Duplicate detection across registration entries
- PR cross-referencing with scoring data
- Diagnostic panels for data quality
- Discord ID and GitHub username lookup
The route exists in the React application at /ground-control. It ships in the production bundle. You can navigate to it right now. But here's what happens when you do:
// Ground Control's data loading
useEffect(() => {
fetch("/LocalData/MasterSheetsData.tsv")
.then(res => res.text())
.then(text => parseTSV(text))
.catch(() => setError("Data not available"));
}, []);
It tries to fetch MasterSheetsData.tsv from a local path. On the production deployment, that file doesn't exist. The fetch returns a 404. Ground Control renders an empty state. There's nothing to see.
This is not security through obscurity. The route isn't hidden (it's linked from the Tools page). The code isn't obfuscated. The approach is simpler than that: the administrative interface depends on data that is architecturally absent from the public deployment. Even if you find the route, read the source code, and understand exactly how it works, you cannot reproduce the administrative workflow because the underlying dataset isn't there.
graph TB
subgraph "Production Deployment"
ROUTE[/ground-control Route]
FETCH[fetch /LocalData/MasterSheetsData.tsv]
FOUR04[404 Not Found]
EMPTY[Empty State UI]
end
subgraph "Local Development"
ROUTE_L[/ground-control Route]
FETCH_L[fetch /LocalData/MasterSheetsData.tsv]
DATA[MasterSheetsData.tsv<br/>Email, Phone, PII]
FULL[Full Admin Interface]
end
ROUTE --> FETCH --> FOUR04 --> EMPTY
ROUTE_L --> FETCH_L --> DATA --> FULL
style FOUR04 fill:#e74c3c,stroke:#c0392b,color:#fff
style DATA fill:#ff6b6b,stroke:#c0392b,color:#fff
style EMPTY fill:#95a5a6,stroke:#7f8c8d,color:#fff
style FULL fill:#2ecc71,stroke:#27ae60,color:#fff
Why Not Remove the Route Entirely?
A reasonable question. The answer is developer workflow. Ground Control is used during local development and programme operations. Maintaining a separate build configuration to strip it from production adds complexity and creates a divergence between development and production builds — which introduces its own class of bugs. The simpler, more reliable approach: let the route exist, ensure it has no data to display.
4. Local-First Administration
Administrative processing — the work that actually touches PII — happens on my local machine. Not on a server. Not in a cloud function. Not behind an authenticated API. Locally.
This includes:
- Email validation — regex checks, domain verification, duplicate detection
- Phone number formatting — international format normalisation, country code validation
- LinkedIn URL cleanup — extracting usernames from various URL formats
- Name formatting — Title Case normalisation, Unicode handling
- CSV/TSV processing — cross-referencing registration data with raid completions, scoring data
- Certificate generation — batch processing with PII for personalisation
graph LR
subgraph "Local Machine (Air-Gapped from Public)"
TSV[Master TSV<br/>Full PII]
SCRIPTS[Processing Scripts]
VALIDATION[Validation Output]
CERTS[Certificate Data]
end
subgraph "Outputs (PII-Free)"
USERS[users.json<br/>Usernames only]
PROJECTS[projects.json<br/>Public metadata]
SCORES[user-scores.js<br/>Scores only]
end
TSV --> SCRIPTS
SCRIPTS --> VALIDATION
SCRIPTS --> USERS
SCRIPTS --> PROJECTS
SCRIPTS --> SCORES
style TSV fill:#ff6b6b,stroke:#c0392b,color:#fff
style SCRIPTS fill:#f39c12,stroke:#e67e22,color:#fff
style USERS fill:#2ecc71,stroke:#27ae60,color:#fff
style PROJECTS fill:#2ecc71,stroke:#27ae60,color:#fff
style SCORES fill:#2ecc71,stroke:#27ae60,color:#fff
Why Local Beats Cloud for Administrative PII Processing
The conventional approach would be to build an authenticated admin dashboard backed by a database:
Browser → HTTPS → Load Balancer → API Server → Database (with PII)
Every component in that chain is an attack surface. The API server needs authentication — which can be bypassed. The database needs credentials — which can leak. The load balancer needs configuration — which can be misconfigured. The HTTPS termination needs certificates — which can expire. The API responses need to be scoped — which can be over-permissive.
The local approach:
Local filesystem → Local script → Local output
One component. No network. No authentication to bypass. No credentials to leak. No API to probe. No database to dump.
"Every exposed API becomes part of your attack surface."
When you process PII locally, you have zero network attack surface for that processing. A remote attacker cannot intercept, probe, or exfiltrate data from a process that never touches the network.
5. Minimal Attack Surface: Complexity as the Enemy
The most underappreciated security principle is this: every component you add to a system is a component that can fail, be misconfigured, or be exploited.
The Attack Surface Comparison
| Component | Typical CRUD App | SSoC Architecture |
|---|---|---|
| Web Server | Express/Nginx with routes | Static file server (CDN) |
| Database | MongoDB/PostgreSQL | None |
| Authentication | JWT/Sessions/OAuth | None (public data) |
| API Layer | REST/GraphQL endpoints | None |
| Admin Panel | Authenticated web UI | Local-only (data-dependent) |
| Environment Variables | DB_URL, API_KEY, JWT_SECRET | None in production |
| Background Jobs | Queue workers, cron jobs | Offline scripts (manual) |
| File Upload Processing | Multipart handlers | None |
| Email Service | SMTP/SendGrid integration | Separate, not linked to platform |
| PII Storage | In production database | Local filesystem only |
Each row where SSoC has "None" is an entire category of vulnerabilities that doesn't apply. Not because we've defended against them, but because the architecture doesn't create the conditions for them to exist.
The Dependency Argument
A React application built with Vite still has node_modules with hundreds of packages. Isn't that a supply chain risk?
Yes — at build time. But at runtime, the deployed output is static HTML, CSS, and JavaScript. There's no node_modules on the server. No require() calls that could be hijacked. No dynamic imports from npm. The supply chain risk is confined to the build step, which runs locally, not in production.
6. Why I Didn't Build a Typical CRUD App
Most web development tutorials teach this architecture:
// The tutorial approach
const express = require('express');
const mongoose = require('mongoose');
mongoose.connect(process.env.MONGODB_URI); // PII in the database
const UserSchema = new mongoose.Schema({
name: String,
email: String, // PII
phone: String, // PII
github: String,
score: Number,
});
app.get('/api/users', async (req, res) => {
const users = await User.find({}); // All PII exposed via API
res.json(users);
});
app.get('/api/users/:id', async (req, res) => {
const user = await User.findById(req.params.id); // Individual PII exposed
res.json(user);
});
This is the default architecture that most developers reach for. It works. It's well-documented. It's what bootcamps teach. And for many applications, it's appropriate.
But for a programme handling 50,000 contributor records, this architecture means:
- Every contributor's PII is one misconfigured query away from exposure — forget to add
.select('-email -phone')to one route and you've leaked everything - The database connection string is a single point of compromise — one leaked environment variable and the entire dataset is accessible
- Every API endpoint is a probe target — attackers can enumerate
/api/users/1,/api/users/2, etc. - The admin panel shares infrastructure with the public site — a vulnerability in one can compromise the other
I'm not criticising beginners who build this way. I'm saying that when you're responsible for 50,000 people's personal data, you should ask: does this data need to be in a production database at all?
For SSoC, the answer was no.
Leaderboard scores can be pre-computed. Project metadata is public. Certificate data can be generated offline. The only operations that genuinely need PII — registration processing, onboarding communications, certificate personalisation — don't need a production database. They need a local spreadsheet and some scripts.
"The best way to protect sensitive data is to keep it out of places it never needed to be."
7. FinTech Lessons: A Decade of Building for Regulation
My software engineering philosophy has been shaped by more than a decade building FinTech systems in London. In financial services, security and privacy aren't features you add — they're constraints you design within. GDPR isn't a checklist; it's an engineering mindset.
Several principles from that experience directly influenced SSoC's architecture:
Privacy by Design
GDPR Article 25 requires that data protection is integrated into processing activities and business practices, from the design stage. For SSoC, this meant deciding before writing any code that PII would not enter the public deployment pipeline.
Least Privilege
In FinTech, every system component receives the minimum access required for its function. The scoring engine doesn't need email addresses to calculate PR scores, so it doesn't receive them. The public website doesn't need phone numbers to display leaderboards, so it doesn't have them.
Data Minimisation
GDPR Article 5(1)(c): personal data shall be adequate, relevant, and limited to what is necessary. The scoring engine's input is a list of GitHub usernames. Not a full registration export. Not a database view. A flat array of strings.
Separation of Duties
In financial systems, the person who initiates a transaction shouldn't be the same person who approves it. In SSoC, the system that serves public content is architecturally separate from the system that processes PII. They don't share databases, APIs, servers, or deployment pipelines.
Defence in Depth
No single control is sufficient. SSoC's security doesn't rely on "the database is password-protected" or "the admin panel requires authentication." It relies on multiple layers:
- PII is absent from the public deployment (architectural isolation)
- Administrative data files are gitignored (source control isolation)
- Admin interfaces depend on local data (functional isolation)
- Processing happens locally (network isolation)
- Static architecture eliminates entire vulnerability classes (attack surface reduction)
Secure Defaults
The default state of the SSoC public platform — freshly deployed, no configuration — contains zero PII. You don't have to remember to enable encryption, set up access controls, or configure firewall rules. The default is secure because there's nothing sensitive to protect.
8. Threat Modelling: What Could Still Go Wrong
I promised honesty. Here it is.
What This Architecture Protects Against
- Remote database dumps — there's no database to dump
- API enumeration of PII — there's no API serving PII
- Admin panel compromise — the admin panel has no data in production
- Credential leaks exposing PII — there are no database credentials in production
- Supply chain attacks at runtime — there's no server-side code execution in production
What This Architecture Does NOT Protect Against
No architecture is "hack-proof." Anyone who claims otherwise is selling something. Here are the real residual risks — and what we do about each one.
1. GitHub Account Compromise
If an attacker gains access to the GitHub account that owns the repository, they can push a modified user-scores.js containing malicious JavaScript. Every visitor's browser executes it. No database needed — the static site itself becomes the attack vector.
Mitigation: 2FA on all GitHub accounts, branch protection rules, signed commits, code review for all changes to data files.
2. npm Supply Chain (Build-Time)
The build process runs npm install and vite build. If a dependency is compromised — and this has happened before (event-stream, ua-parser-js, colors) — the built output could contain injected code. It ships as static files, but those static files execute in 50,000 browsers.
Mitigation: Lock file (package-lock.json) pins exact versions, npm audit before builds, minimal dependency footprint, local builds (not CI/CD that could be tampered with).
3. Google Account Compromise
The Master TSV originates from Google Sheets. 2FA bypass, session hijacking, OAuth token theft — if someone gets into that Google account, they have every email address and phone number ever submitted. The platform architecture protects the deployment, not the data source.
Mitigation: Google Advanced Protection, hardware security keys, limited sharing (single owner), regular access review.
4. Local Machine Compromise
Disk encryption helps, but if the local machine is compromised — malware, physical access, remote exploit — the Master TSV is right there on the filesystem. No amount of network separation helps when the attacker is already on the machine where the data lives.
Mitigation: Full-disk encryption, OS-level security updates, endpoint protection, physical security, minimal data retention (delete old exports).
5. Bonus Points Repository Tampering
The scoring engine fetches bonus points from a separate GitHub repository. If that repository is compromised, an attacker could award themselves thousands of points or penalise other contributors. The scoring engine trusts this input completely.
Mitigation: Repository access limited to programme administrators, branch protection, commit history auditing, bonus point totals reviewed during each scoring run.
6. DNS/CDN Hijacking
If an attacker compromises DNS records or the GitHub Pages serving infrastructure, they can serve a modified version of the site to all visitors. The static files in the repository are genuine, but what users' browsers actually receive could be different.
Mitigation: DNSSEC where supported, GitHub Pages' own infrastructure security, HTTPS enforcement, Subresource Integrity (SRI) for critical scripts.
7. Social Engineering and Insider Threat
Someone could impersonate a legitimate need for registration data, or a programme administrator with legitimate access could misuse it.
Mitigation: Strict need-to-know policy, minimal number of people with access to raw registration data, audit trails for data exports.
graph TB
subgraph "Eliminated by Architecture (Green)"
A1[Database Dump]
A2[API Enumeration]
A3[Admin Panel Exploit]
A4[Credential Leak]
A5[Runtime Supply Chain]
end
subgraph "Mitigated by Operational Security (Amber)"
B1[GitHub Account Compromise]
B2[npm Supply Chain - Build Time]
B3[Google Account Compromise]
B4[Local Machine Compromise]
B5[DNS/CDN Hijacking]
B6[Bonus Repo Tampering]
end
subgraph "Residual Risk (Red)"
C1[Social Engineering]
C2[Insider Threat]
C3[Physical Access]
end
style A1 fill:#2ecc71,stroke:#27ae60,color:#fff
style A2 fill:#2ecc71,stroke:#27ae60,color:#fff
style A3 fill:#2ecc71,stroke:#27ae60,color:#fff
style A4 fill:#2ecc71,stroke:#27ae60,color:#fff
style A5 fill:#2ecc71,stroke:#27ae60,color:#fff
style B1 fill:#f39c12,stroke:#e67e22,color:#fff
style B2 fill:#f39c12,stroke:#e67e22,color:#fff
style B3 fill:#f39c12,stroke:#e67e22,color:#fff
style B4 fill:#f39c12,stroke:#e67e22,color:#fff
style B5 fill:#f39c12,stroke:#e67e22,color:#fff
style B6 fill:#f39c12,stroke:#e67e22,color:#fff
style C1 fill:#e74c3c,stroke:#c0392b,color:#fff
style C2 fill:#e74c3c,stroke:#c0392b,color:#fff
style C3 fill:#e74c3c,stroke:#c0392b,color:#fff
"Architecture is your first security control."
But it's not your only one.
9. The Data Flow: End to End
Let's trace the complete lifecycle of a contributor's data through SSoC:
Registration
Contributor fills Google Form
→ Response stored in Google Sheet (PII: name, email, phone, GitHub, LinkedIn)
→ Sheet exported as TSV to local machine
→ TSV is gitignored, never committed
Scoring
Local script extracts GitHub usernames from TSV
→ Generates users.json (usernames only, no PII)
→ users.json committed to scoring engine repo
→ Scoring engine queries GitHub API for public PR data
→ Scores computed, static JS files generated
→ Static files deployed to public website
Administration
Ground Control loads TSV from local filesystem
→ Validates names, emails, phone numbers
→ Cross-references with scoring data
→ All processing happens in-browser, locally
→ No data sent to any server
Certification
Certificate generation script reads TSV locally
→ Generates certificate data with names and IDs
→ Certificate verification uses name + ID (minimal PII)
→ Deployed to public site for verification
At no point in this flow does the full registration dataset — with email addresses and phone numbers — enter a system accessible from the public internet.
10. Engineering Trade-Offs
This architecture is not free. There are real costs:
What We Give Up
- Real-time leaderboards — scores update when the scoring engine runs, not continuously. There's a delay between a PR being merged and scores updating.
- Self-service profile editing — contributors can't update their own information through the platform because there's no database to update.
- Automated notifications — the platform can't send emails or push notifications because it has no backend.
- Multi-admin access — Ground Control works on one machine with one dataset. There's no shared admin dashboard.
- Scalable operations — everything that touches PII requires manual steps. At 50,000 contributors, this means careful batch processing.
What We Gain
- Zero PII exposure through the public platform — the most critical win
- Minimal operational security burden — no database credentials to rotate, no API keys to manage, no sessions to expire
- Audit simplicity — the entire public deployment can be inspected as static files
- Deployment simplicity — push static files to GitHub Pages, done
- Resilience — no database to crash, no API to overload, no server to go down
The trade-offs are real, but for a seasonal programme with batch-oriented workflows, they're acceptable. A real-time trading platform couldn't work this way. A seasonal open-source programme can.
11. Lessons for Developers
If you're building a platform that handles user data, here's what I'd encourage you to think about:
Before You npm install express mongoose
Ask: does this data need to be in a production database?
- If you need real-time reads and writes: yes, use a database
- If you need user authentication: yes, you need a backend
- If you can pre-compute and serve static results: maybe you don't
The Pre-Computation Question
Many applications that look dynamic are actually batch-processable:
- Leaderboards that update hourly → pre-compute and deploy
- Dashboards with daily metrics → generate static JSON overnight
- Profile pages with relatively stable data → generate at build time
- Documentation sites → static generation (this is already mainstream)
The PII Proximity Principle
The further PII is from your public infrastructure, the smaller your blast radius if something goes wrong.
PII in production database → one query away from exposure
PII in separate internal service → one network hop away
PII on local machine only → air-gapped from public internet
PII never collected → zero risk (but rarely practical)
Move left on this spectrum wherever you can.
Complexity Budget
Every component has a security cost. Budget for it:
Static file server: Low complexity, low risk
API with public endpoints: Medium complexity, medium risk
Database with PII: High complexity, high risk
Admin panel on same infra: Multiplied risk (shared blast radius)
If you can achieve your goals with fewer components, you should.
12. The /security Page: Making Transparency a Feature
Writing this blog post made something clear: the security architecture shouldn't just be documented externally — it should be visible to every contributor directly on the platform.
So we built a dedicated /security page. It lives at /security and is linked from the Tools page.
What It Covers
The page is split into two halves:
"How We Protect Your Data" — the platform's side:
- Three overview cards: No Production Database, No Backend Server, PII Never Deployed
- A data collection table showing exactly what we collect, why, and whether it's ever public — with green "Never public" badges for email, phone, and Discord
- Expandable accordion sections explaining the static architecture, PII separation, local-first administration, and exactly what the public website can and cannot see
- An attack vector comparison table showing 7 common web app vulnerabilities and why they don't apply to our architecture
"Protect Yourself" — the participant's side:
- GitHub account security — enabling 2FA, reviewing OAuth apps, hiding your email from commits
- Google account security — 2FA, Security Checkup, revoking unused app permissions
- Discord security — 2FA, Nitro scam awareness, QR code warnings, DM caution
- Passwords & authentication — password managers, haveibeenpwned.com, authenticator apps over SMS
- Recognising phishing — a clear statement that SSoC will never ask for your password, and how to verify suspicious messages
- Protecting your code — never committing secrets, using
git diffbefore pushing, rotating leaked credentials immediately - Device security — full-disk encryption, OS updates, public Wi-Fi caution
Why a Security Page Matters
Most open-source programmes don't have a security page. Most probably don't need one. But when you're handling 50,000 contributors' personal data and the community is asking questions about data safety, transparency isn't optional — it's infrastructure.
The page ends with an honest footer: "No system is 100% secure — but architecture determines how much risk exists."
That's the same message as this blog post, distilled into a single sentence on a page that every contributor can find.
Key Engineering Takeaways
Architecture is your first security control. Encryption, authentication, and firewalls matter — but they defend data that's already in the danger zone. Architecture determines whether it gets there at all.
Static-first isn't just a performance strategy. It eliminates entire classes of vulnerabilities: injection, authentication bypass, session hijacking, API enumeration. The attack surface you don't create is the one that can never be exploited.
PII separation is non-negotiable at scale. When you're responsible for 50,000 people's personal data, the question isn't "how do we secure the database?" It's "does this data need to be in a database at all?"
Local processing is underrated. Not everything needs to be a cloud service. Administrative tasks that touch PII can often be performed locally, eliminating network attack surface entirely.
Honest threat modelling beats false confidence. No system is "hack-proof." The goal is to reduce attack surface, minimise blast radius, and be transparent about residual risks.
Trade-offs are real and worth making. We give up real-time updates and self-service features in exchange for a dramatically reduced security burden. For a seasonal programme, that trade-off is correct.
Conclusion
The question contributors ask — "How do you protect our data?" — has a simple answer: we keep it away from the public internet.
Not behind a firewall. Not encrypted in a database. Not protected by authentication. Away. On a local machine, processed by local scripts, never deployed to any public server.
This isn't the right architecture for every application. Real-time collaborative tools need databases. E-commerce platforms need payment processing. Social networks need user-generated content storage. But for a seasonal open-source programme that computes leaderboards, generates certificates, and displays project metadata — a static-first, PII-separated, local-admin architecture eliminates more risk than any amount of runtime security controls could provide.
"The strongest security often comes from reducing complexity and limiting what can be reached in the first place."
The next time you're designing a system that handles user data, before you spin up a database and wire up API endpoints, ask yourself: does this data actually need to be here?
You might find that the safest architecture is the one that keeps sensitive data out of the blast zone entirely.
Praveen Kumar Purushothaman is VP & Director of Engineering at Social Summer of Code (SSoC). He has spent over a decade building FinTech systems in London, where GDPR, privacy-by-design, and minimising attack surfaces are fundamental engineering principles. SSoC 2026 serves 50,000+ contributors across hundreds of open-source projects.


