AI
50 AI Prompts for Python Data Science and Analysis
50 practical AI prompts for Python data science. Covers pandas, numpy, matplotlib, seaborn, scikit-learn, data cleaning, feature engineering, machine learning pipelines, model evaluation, and exploratory data analysis.
How to Use These Data Science Prompts
Python is the primary language of data science. pandas, numpy, matplotlib, and scikit-learn form the core stack used in most data analysis and machine learning projects. These 50 prompts accelerate every phase of the data science workflow: data loading and cleaning, exploratory analysis, visualisation, feature engineering, model training, and evaluation.
For best results, always include a sample of your DataFrame when prompting for pandas transformations. Paste the output of df.head() and df.dtypes and describe the goal you are trying to achieve. Data shape and column types change almost everything about how a transformation should be written.
Pandas, NumPy, and Data Cleaning (Prompts 1-18)
Prompt 1: Write a pandas data cleaning pipeline for a CSV of customer orders. Handle missing values: fill numeric columns with the median, fill categorical columns with the mode, and drop rows where the order_id is null. Convert the order_date column from string to datetime. Remove duplicate rows based on order_id. Log the number of rows dropped at each step. Return the cleaned DataFrame.
Prompt 2: Create a pandas function that standardises inconsistent string values in a product_category column. Build a mapping dictionary of known variations to canonical values (e.g., electronics, Electronics, ELECTRONICS all map to Electronics). Apply the mapping with str.strip() and str.title() preprocessing. Flag unmapped values in a separate column for manual review. Prompt 3: Write a pandas merge and aggregation pipeline. Merge an orders DataFrame with a products DataFrame on product_id. Compute monthly revenue per product category. Pivot the result so each category is a column and each month is a row. Fill missing month/category combinations with 0. Show the complete pipeline.
- Prompt 4: Create a pandas function that detects and reports data quality issues. Report: percentage of null values per column, columns with more than 50% nulls, numeric columns with negative values where only positive is expected, date columns with out-of-range values, and categorical columns with cardinality above a threshold.
- Prompt 5: Write a pandas time series resampling function. Given a DataFrame of daily sales with a date index, resample to weekly totals, compute the 4-week rolling average, compute the year-over-year percentage change for each week, and identify weeks where sales dropped more than 20% from the rolling average.
- Prompt 6: Create a numpy function that normalises a 2D array of features. Implement min-max normalisation so each column scales to 0-1. Implement z-score standardisation so each column has mean 0 and standard deviation 1. Return both versions and the scaler parameters needed to inverse-transform predictions.
- Prompt 7: Write a pandas function that encodes categorical variables for machine learning. Apply one-hot encoding to low-cardinality columns (fewer than 10 unique values). Apply target encoding to high-cardinality columns using the mean of the target variable per category (computed on the training set only). Return the transformed DataFrame and the encoders.
- Prompt 8: Create a pandas function that handles outliers. Detect outliers using the IQR method for numeric columns. For each column with outliers, offer three strategies: cap (clip to the 1st and 99th percentile), drop (remove the row), or transform (apply log transformation). Show how to apply each strategy.
- Prompt 9: Write a pandas window function to compute customer lifetime metrics. For each customer in an orders DataFrame, compute: total spend, number of orders, average order value, days since first order, days since last order, and purchase frequency (orders per month). Use groupby and agg.
- Prompt 10: Create a numpy matrix operation for a recommendation system. Given a user-item rating matrix with NaN for unrated items, compute cosine similarity between all users, find the top 5 most similar users for a target user, and compute weighted average ratings from similar users to predict missing ratings.
- Prompt 11: Write a pandas function to split a dataset into train, validation, and test sets while preserving the class distribution. Accept ratios for each split, stratify on the target column, reset the index on each split, and return three DataFrames with a summary of class distribution in each.
- Prompt 12: Create a pandas function for feature selection using correlation. Compute the correlation matrix, identify pairs of features with correlation above 0.9, build a set of features to drop (keeping one from each highly correlated pair), and return the reduced DataFrame with a report of dropped features and their correlates.
- Prompt 13: Write a pandas pipeline using the pipe method. Chain five transformation steps: drop null rows, encode categoricals, normalise numerics, add interaction features, and select the final feature set. Show how pipe makes the pipeline readable and testable by making each step a named function.
- Prompt 14: Create a pandas function that computes cohort retention analysis. Given a transactions DataFrame with user_id and transaction_date, compute the first purchase month for each user, assign cohort labels, compute the percentage of each cohort still purchasing in subsequent months, and return a pivot table suitable for a heatmap.
- Prompt 15: Write a numpy function that implements gradient descent for linear regression from scratch. Initialise random weights, compute the mean squared error loss, compute gradients with respect to weights and bias, update weights using the learning rate, and iterate for a specified number of epochs. Return the learned weights and the loss history.
- Prompt 16: Create a pandas function that detects and handles imbalanced classes in a classification dataset. Report the class distribution. Apply SMOTE oversampling to the minority class using imbalanced-learn. Apply random undersampling as an alternative. Return both balanced datasets and a comparison of class distributions.
- Prompt 17: Write a pandas rolling feature engineering function for a time series. Compute rolling mean and standard deviation for windows of 7, 14, and 28 days. Compute lag features for 1, 7, and 14 day lags. Compute the exponentially weighted moving average with a span of 7. Drop rows with NaN from insufficient window history.
- Prompt 18: Create a pandas function that validates a DataFrame against a schema. Define expected column names, data types, value ranges for numeric columns, allowed values for categorical columns, and not-null constraints. Return a validation report listing all violations and a boolean pass/fail result.
Visualisation, Machine Learning, and Model Evaluation (Prompts 19-50)
Prompt 19: Write a matplotlib and seaborn exploratory data analysis (EDA) dashboard for a DataFrame. Create a figure with subplots showing: distribution of each numeric column as a histogram with a KDE curve, a correlation heatmap, box plots of numeric columns grouped by the target variable, and a count plot of the target class distribution.
Prompt 20: Create a seaborn visualisation for time series data. Plot daily revenue as a line chart, overlay a 30-day rolling average, shade the area between the minimum and maximum for each month, add vertical lines for known event dates, and add a confidence band using the standard deviation. Prompt 21: Write a scikit-learn machine learning pipeline for binary classification. Preprocess the data with a ColumnTransformer (StandardScaler for numeric, OneHotEncoder for categorical), train a RandomForestClassifier, evaluate with 5-fold stratified cross-validation, and report accuracy, precision, recall, F1, and AUC-ROC for each fold and the mean.
- Prompt 22: Create a scikit-learn hyperparameter tuning setup using GridSearchCV. Define a parameter grid for a GradientBoostingClassifier, run grid search with 5-fold CV, print the best parameters and best cross-validation score, refit on the full training set with best parameters, and evaluate on the held-out test set.
- Prompt 23: Write a model comparison script that trains Logistic Regression, Random Forest, and XGBoost on the same dataset and compares them by AUC-ROC, precision-recall AUC, training time, and inference time. Plot ROC curves for all three models on the same axes.
- Prompt 24: Create a SHAP value analysis for a trained XGBoost model. Compute SHAP values for the test set, plot the global feature importance bar chart, plot a SHAP beeswarm plot, and show a force plot for a single prediction explaining why the model predicted that class.
- Prompt 25: Write a pandas and matplotlib report that produces a PDF report with multiple pages. Use matplotlib PdfPages to combine an EDA section, a model performance section with confusion matrix and classification report, and a feature importance section. Add a title page with the report date.
FAQ
Should I use pandas or polars for large datasets?
Polars for large datasets (millions of rows). Polars is 5-30x faster than pandas for most operations because it is written in Rust, uses lazy evaluation, and automatically parallelises across CPU cores. Pandas for smaller datasets, for compatibility with existing code and libraries, and for the richer ecosystem of integrations. Polars has a pandas-like API that is easy to learn if you already know pandas. For new projects working with large data, polars is the better starting point.
What is the best Python library for machine learning in 2026?
scikit-learn for classical machine learning (classification, regression, clustering, feature engineering, model evaluation). XGBoost and LightGBM for gradient boosting on tabular data — they outperform random forests on most tabular benchmarks. PyTorch for deep learning. HuggingFace Transformers for NLP and large language model fine-tuning. For most business analytics and predictive modelling tasks, XGBoost with proper feature engineering remains highly competitive with more complex approaches.
How do I handle very large datasets that do not fit in memory?
Process in chunks using pandas read_csv with the chunksize parameter, then aggregate results. Use Dask for parallel, out-of-core pandas operations with the same API. Use polars lazy mode which streams data through the computation without loading it all into memory. Use DuckDB for SQL queries on Parquet or CSV files stored on disk — DuckDB is extremely fast for analytical queries on data that exceeds RAM. For truly massive data, use Spark or BigQuery.
What is feature engineering and why does it matter?
Feature engineering is creating new input variables from existing data to improve model performance. Examples: extracting day of week and hour from a timestamp column, computing the ratio of two numeric columns, encoding a postcode as distance from a city centre, and creating interaction terms between related features. Feature engineering often improves model performance more than choosing a more complex model. A well-engineered feature set with a simple model frequently outperforms a raw feature set with a complex model.
Related free tools
If you want to turn this topic into action, use one of ShortIQ's free tools for campaign planning, UTM structure, or QR distribution.
Continue Reading
Explore more guides on link shortener SaaS strategy, Bitly alternatives, and white label link management.
Free newsletter
Get new guides in your inbox
We publish practical guides on dev tooling, prompt engineering, marketing workflows, and deployment. No fluff — straight to the point.
No spam. Unsubscribe any time.
Was this article helpful?
Tell us if this guide solved the problem or what was still missing. We use this to improve the blog and only follow up if you explicitly allow it.