{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "4148d95d",
   "metadata": {
    "id": "4148d95d",
    "language": "markdown"
   },
   "source": [
    "# Pattern Portal Real-Data Lab\n",
    "\n",
    "A compact notebook companion for the Real-Data Cases page. This version is JupyterLite-compatible: it avoids packages that require native wheels and uses the small CSV files shipped with the site instead of remote dataset downloads."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5dcb580c",
   "metadata": {
    "id": "5dcb580c",
    "language": "markdown"
   },
   "source": [
    "## Install Dependencies\n",
    "Run this cell first in JupyterLite. `yfinance` is intentionally excluded because its current dependency chain includes native/browser-incompatible wheels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3bc287ac",
   "metadata": {
    "id": "3bc287ac",
    "language": "python"
   },
   "outputs": [],
   "source": [
    "%pip install pandas numpy scikit-learn matplotlib -q"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fa60acac",
   "metadata": {
    "id": "fa60acac",
    "language": "python"
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from pathlib import Path\n",
    "from sklearn.ensemble import HistGradientBoostingRegressor, RandomForestClassifier, RandomForestRegressor\n",
    "from sklearn.metrics import (\n",
    "    average_precision_score,\n",
    "    classification_report,\n",
    "    confusion_matrix,\n",
    "    mean_absolute_error,\n",
    "    root_mean_squared_error,\n",
    "    r2_score,\n",
    ")\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "RANDOM_STATE = 42\n",
    "\n",
    "def read_portal_csv(path):\n",
    "    \"\"\"Read a Pattern Portal CSV in JupyterLite or local Jupyter.\"\"\"\n",
    "    site_path = path if path.startswith('/') else f'/{path}'\n",
    "    try:\n",
    "        open_url = __import__('pyodide.http', fromlist=['open_url']).open_url\n",
    "        return pd.read_csv(open_url(site_path))\n",
    "    except Exception:\n",
    "        relative = site_path.lstrip('/')\n",
    "        candidates = [\n",
    "            Path(relative),\n",
    "            Path('..') / relative,\n",
    "            Path.cwd() / relative,\n",
    "            Path.cwd().parent / relative,\n",
    "        ]\n",
    "        for candidate in candidates:\n",
    "            if candidate.exists():\n",
    "                return pd.read_csv(candidate)\n",
    "        raise FileNotFoundError(f'Could not find {site_path} in JupyterLite or local paths.')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "48af1cc8",
   "metadata": {
    "id": "48af1cc8",
    "language": "markdown"
   },
   "source": [
    "## Case 1: Housing Regression\n",
    "Pipeline: load the local housing sample, split, train a baseline, report MAE/RMSE/R2, and inspect the largest errors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a137717a",
   "metadata": {
    "id": "a137717a",
    "language": "python"
   },
   "outputs": [],
   "source": [
    "housing = read_portal_csv('/cases/datasets/housing_sample.csv')\n",
    "X = housing.drop(columns=['median_house_value'])\n",
    "y = housing['median_house_value']\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    X, y, test_size=0.3, random_state=RANDOM_STATE\n",
    ")\n",
    "\n",
    "model = HistGradientBoostingRegressor(random_state=RANDOM_STATE, max_iter=80)\n",
    "model.fit(X_train, y_train)\n",
    "pred = model.predict(X_test)\n",
    "\n",
    "print('MAE:', round(mean_absolute_error(y_test, pred), 4))\n",
    "print('RMSE:', round(root_mean_squared_error(y_test, pred), 4))\n",
    "print('R2:', round(r2_score(y_test, pred), 4))\n",
    "pd.DataFrame({'actual': y_test, 'pred': pred, 'abs_error': np.abs(y_test - pred)}).sort_values('abs_error', ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1962e7d9",
   "metadata": {
    "id": "1962e7d9",
    "language": "markdown"
   },
   "source": [
    "## Case 2: Fraud Classification\n",
    "Use the local mini fraud sample to practice the workflow: split with class balance, fit a baseline, and inspect precision/recall behavior."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "966127d4",
   "metadata": {
    "id": "966127d4",
    "language": "python"
   },
   "outputs": [],
   "source": [
    "fraud = read_portal_csv('/cases/datasets/fraud_sample.csv')\n",
    "X = fraud.drop(columns=['is_fraud'])\n",
    "y = fraud['is_fraud']\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    X, y, test_size=0.4, stratify=y, random_state=RANDOM_STATE\n",
    ")\n",
    "\n",
    "model = RandomForestClassifier(\n",
    "    n_estimators=80, class_weight='balanced', random_state=RANDOM_STATE\n",
    ")\n",
    "model.fit(X_train, y_train)\n",
    "proba = model.predict_proba(X_test)[:, 1]\n",
    "pred = (proba >= 0.35).astype(int)\n",
    "\n",
    "print(confusion_matrix(y_test, pred))\n",
    "print(classification_report(y_test, pred, digits=3, zero_division=0))\n",
    "print('PR-AUC:', round(average_precision_score(y_test, proba), 4))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7d309a5",
   "metadata": {
    "id": "f7d309a5",
    "language": "markdown"
   },
   "source": [
    "## Case 3: Time-Series Forecast\n",
    "Use local energy-demand data: sort by time, create lag features from the past only, split by time, and compare against a naive baseline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c285a30a",
   "metadata": {
    "id": "c285a30a",
    "language": "python"
   },
   "outputs": [],
   "source": [
    "daily = read_portal_csv('/cases/datasets/energy_demand_sample.csv')\n",
    "daily['date'] = pd.to_datetime(daily['date'])\n",
    "daily = daily.sort_values('date').set_index('date')\n",
    "\n",
    "for lag in [1, 2, 3]:\n",
    "    daily[f'lag_{lag}'] = daily['demand_kwh'].shift(lag)\n",
    "daily['rolling_3'] = daily['demand_kwh'].shift(1).rolling(3).mean()\n",
    "daily = daily.dropna()\n",
    "\n",
    "split = int(len(daily) * 0.7)\n",
    "train, test = daily.iloc[:split], daily.iloc[split:]\n",
    "features = ['temp_c', 'is_weekend', 'lag_1', 'lag_2', 'lag_3', 'rolling_3']\n",
    "\n",
    "model = RandomForestRegressor(n_estimators=80, random_state=RANDOM_STATE)\n",
    "model.fit(train[features], train['demand_kwh'])\n",
    "pred = model.predict(test[features])\n",
    "naive = test['lag_1']\n",
    "\n",
    "print('Model MAE:', round(mean_absolute_error(test['demand_kwh'], pred), 4))\n",
    "print('Naive MAE:', round(mean_absolute_error(test['demand_kwh'], naive), 4))\n",
    "pd.DataFrame({'actual': test['demand_kwh'], 'model': pred, 'naive': naive})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cad79b46",
   "metadata": {
    "id": "cad79b46",
    "language": "markdown"
   },
   "source": [
    "## Case 4: Market Backtest\n",
    "JupyterLite cannot use `yfinance` reliably because live market clients often depend on native networking wheels. This cell uses the local OHLCV sample instead. Market indicators are heuristic until validated with point-in-time data, costs, slippage, and walk-forward testing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ed0bd914",
   "metadata": {
    "id": "ed0bd914",
    "language": "python"
   },
   "outputs": [],
   "source": [
    "px = read_portal_csv('/cases/datasets/market_ohlcv_sample.csv')\n",
    "px['date'] = pd.to_datetime(px['date'])\n",
    "px = px.sort_values('date').set_index('date')\n",
    "close = px['close']\n",
    "returns = close.pct_change().fillna(0)\n",
    "\n",
    "fast = close.rolling(3).mean()\n",
    "slow = close.rolling(5).mean()\n",
    "signal = (fast > slow).astype(int).shift(1).fillna(0)\n",
    "\n",
    "turnover = signal.diff().abs().fillna(signal.abs())\n",
    "cost = turnover * 0.0005\n",
    "strategy = signal * returns - cost\n",
    "equity = (1 + strategy).cumprod()\n",
    "drawdown = equity / equity.cummax() - 1\n",
    "sharpe = np.nan if strategy.std() == 0 else strategy.mean() / strategy.std() * np.sqrt(252)\n",
    "\n",
    "print('Sharpe:', round(float(sharpe), 4) if not np.isnan(sharpe) else 'n/a')\n",
    "print('Max drawdown:', round(float(drawdown.min()), 4))\n",
    "print('Annual turnover:', round(float(turnover.mean() * 252), 4))\n",
    "pd.DataFrame({'close': close, 'signal': signal, 'strategy': strategy, 'equity': equity})"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
