Data & Experimentation

Use this page to describe the datasets, their provenance, data quality considerations, and how exploratory analysis is conducted.

Data Inventory

Dataset Source Location Notes
Example Raw Dataset Internal CRM export data/raw/example.csv Daily extract; contains PII (mask or anonymize)
Processed Features Derived via clean_dataframe data/processed/features.parquet Updated after each ETL run
External Reference Public API data/external/reference.json Refresh monthly

Keep this table current as new datasets are added or existing ones change.

Generation & Extraction

  • Acquisition scripts: document where code lives (e.g., scripts/, external ETL).
  • Refresh cadence: daily/weekly/monthly? Automate via cron, Airflow, etc.
  • Access requirements: credentials, VPN, API tokens. Store secrets securely (never commit).

Placeholder text

Describe how raw data is pulled. Include sample queries, API endpoints,
and expected schemas. Note any governance requirements or limits.

Data Quality & Profiling

  • Outline checks for missing values, duplicates, schema drift.
  • Record profiling summaries (e.g., pandas-profiling, Great Expectations).
  • Log anomalies and remediation steps.

Exploratory Analysis

  • Notebook convention (e.g., 01.0-abc-eda.ipynb).
  • Where to store outputs (plots, tables) during exploration.
  • How insights feed back into feature engineering or modelling.

Transformation Pipeline

  • Reference helper functions/modules (load_csv, clean_dataframe, select_columns).
  • Describe feature engineering steps, transformations, encodings, scaling.
  • Maintain versioned transformation logic (config files, code branches).

Data Governance

  • PII handling (masking, hashing, removal).
  • Compliance requirements (GDPR, CCPA, internal policies).
  • Approval process for sharing data externally.

Back to top

© 2025 UC San Diego - Data Science Capstones