Reference implementation built on a non-confidential 19-account practice ledger so the full workflow can stay public, runnable, and auditable. Full source on GitHub.
Architecture overview
The system is designed for transparency over scale. Every component is legible and traceable so a reviewer can follow a single journal entry from ingestion to exception report without hitting a black box.
- Data layer. SQLite with a normalised enterprise schema, chosen over a warehouse so the database can be opened, inspected, and queried without infrastructure.
- ML engine. scikit-learn with Z-score anomaly detection. I chose a statistical model instead of deep learning because the threshold and scoring logic are readable in the repo.
- Automation layer. Python task scheduler with declared dependency management. 17 tasks, explicit sequencing.
- Interface. Jupyter notebooks for the interactive walkthrough. No separate frontend in the core prototype.
Key components
1. Data processing engine
- Batch transaction processing against the practice ledger.
- Automated data validation on ingestion.
- Exception logging with account-level granularity.
2. Machine learning module
Z-score anomaly detection at a 2.5σ threshold. The threshold is conservative on a clean practice ledger because it flags roughly the top one percent of movements without generating noise. On a real ledger, the threshold would need calibration per account type.
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
class FinancialAnomalyDetector:
def __init__(self, contamination=0.1, random_state=42):
self.model = IsolationForest(
contamination=contamination,
random_state=random_state,
n_estimators=100,
)
self.scaler = StandardScaler()
self.is_fitted = False
def fit(self, X):
X_scaled = self.scaler.fit_transform(X)
self.model.fit(X_scaled)
self.is_fitted = True
def predict(self, X):
if not self.is_fitted:
raise ValueError("Model must be fitted before prediction")
X_scaled = self.scaler.transform(X)
predictions = self.model.predict(X_scaled)
anomaly_scores = self.model.score_samples(X_scaled)
return predictions, anomaly_scores
3. Workflow automation
17 close tasks modelled with declared dependencies. A delay in any task surfaces its downstream impact immediately rather than becoming visible only at the end of the cycle.
from enum import Enum
class TaskStatus(Enum):
PENDING = "pending"
IN_PROGRESS = "in_progress"
COMPLETED = "completed"
FAILED = "failed"
class WorkflowManager:
def __init__(self, config_path):
with open(config_path, "r") as f:
self.config = json.load(f)
self.tasks = {}
self.initialize_tasks()
def get_ready_tasks(self):
return [
t for t in self.tasks.values()
if t.status == TaskStatus.PENDING
and all(
self.tasks[dep].status == TaskStatus.COMPLETED
for dep in t.dependencies
)
]
Database schema
CREATE TABLE transactions (
id INTEGER PRIMARY KEY,
account_id TEXT NOT NULL,
amount DECIMAL(10,2),
transaction_date DATE,
category TEXT,
risk_score DECIMAL(3,2)
);
CREATE TABLE close_tasks (
id INTEGER PRIMARY KEY,
task_name TEXT NOT NULL,
status TEXT DEFAULT 'pending',
dependencies TEXT, -- JSON array of task IDs
started_at DATETIME,
completed_at DATETIME
);
CREATE TABLE anomalies (
id INTEGER PRIMARY KEY,
transaction_id INTEGER REFERENCES transactions(id),
anomaly_score DECIMAL(5,4),
flagged_at DATETIME,
reviewed BOOLEAN DEFAULT FALSE
);
Performance characteristics
| Metric | Value |
|---|---|
| Monthly transactions | 1,000+ |
| Anomaly classification time | Under 2 seconds |
| Z-score threshold | 2.5σ |
| Model confidence signal | 87% confidence-style output |
These are prototype figures on a clean practice ledger. Real-world performance depends on data quality, volume, and the variance characteristics of the specific ledger.
Deployment requirements
Prerequisites
- Python 3.8 or later
- 4 GB RAM minimum
- 50 GB storage
Setup
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python init_database.py
Dependencies
scikit-learn==1.3.0
pandas==2.0.3
numpy==1.24.3
python-dotenv==1.0.0
Configuration
DATABASE_URL=sqlite:///financial_data.db
ML_MODEL_PATH=./models/anomaly_detector.pkl
Workflow definition
{
"tasks": [
{
"id": "data_collection",
"name": "Data Collection",
"dependencies": [],
"automation": true,
"duration_estimate": 30
},
{
"id": "validation",
"name": "Data Validation",
"dependencies": ["data_collection"],
"automation": true,
"duration_estimate": 45
}
]
}
Security considerations
- Role-based access control on the API layer.
- Data encryption at rest and in transit.
- Audit logging for all transactions and anomaly reviews.
Testing
# Unit tests
python -m pytest tests/ -v
# Integration tests
python -m pytest tests/integration/ -v
Full source code and additional documentation are available on GitHub.