Python Data Processing: Pandas for Business Automation

Python Data Processing: Pandas for Business Automation

Master Pandas for Business Automation: Process Data at Scale

Pandas is Python's powerhouse for data manipulation. If you're spending hours in Excel cleaning, transforming, and analyzing data, Pandas can do it in seconds—and automate the entire process. Here's how to leverage Pandas for business automation.

Why Pandas for Business?

  • Speed: Process millions of rows in seconds
  • Reproducibility: Same process every time
  • Integration: Connect to databases, APIs, files
  • Complexity: Handle transformations Excel can't
  • Automation: Schedule and run unattended

Getting Started with Pandas

import pandas as pd

# Read data from various sources
df_csv = pd.read_csv('sales_data.csv')
df_excel = pd.read_excel('inventory.xlsx', sheet_name='Stock')
df_sql = pd.read_sql('SELECT * FROM orders', connection)
df_json = pd.read_json('api_response.json')

Common Business Transformations

Data Cleaning

# Remove duplicates
df = df.drop_duplicates(subset=['email'])

# Handle missing values
df['phone'] = df['phone'].fillna('Unknown')
df = df.dropna(subset=['email'])  # Remove rows without email

# Standardize text
df['name'] = df['name'].str.strip().str.title()
df['email'] = df['email'].str.lower()

# Fix data types
df['date'] = pd.to_datetime(df['date'])
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')

Filtering and Querying

# Filter rows
active_customers = df[df['status'] == 'active']
high_value = df[df['revenue'] > 10000]
recent = df[df['date'] >= '2024-01-01']

# Complex conditions
vip_customers = df[
    (df['revenue'] > 50000) &
    (df['orders'] >= 10) &
    (df['status'] == 'active')
]

Aggregation and Grouping

# Group by and aggregate
sales_by_region = df.groupby('region').agg({
    'revenue': 'sum',
    'orders': 'count',
    'customer_id': 'nunique'
}).reset_index()

# Pivot tables
monthly_sales = df.pivot_table(
    values='revenue',
    index='product',
    columns='month',
    aggfunc='sum'
)

Merging Data Sources

# Combine datasets
orders_with_customers = pd.merge(
    orders_df,
    customers_df,
    on='customer_id',
    how='left'
)

# Append datasets
all_sales = pd.concat([q1_sales, q2_sales, q3_sales, q4_sales])

Business Automation Examples

1. Daily Sales Report

def generate_daily_report():
    # Load today's sales
    df = pd.read_sql('''
        SELECT * FROM orders
        WHERE DATE(created_at) = CURRENT_DATE
    ''', connection)

    # Calculate metrics
    report = {
        'total_revenue': df['amount'].sum(),
        'order_count': len(df),
        'avg_order_value': df['amount'].mean(),
        'top_products': df.groupby('product')['amount'].sum().nlargest(5)
    }

    # Export
    df.to_excel(f'daily_report_{today}.xlsx', index=False)

    return report

2. Customer Segmentation

def segment_customers(df):
    # Calculate RFM metrics
    rfm = df.groupby('customer_id').agg({
        'order_date': lambda x: (today - x.max()).days,  # Recency
        'order_id': 'count',  # Frequency
        'amount': 'sum'  # Monetary
    }).rename(columns={
        'order_date': 'recency',
        'order_id': 'frequency',
        'amount': 'monetary'
    })

    # Score each metric
    rfm['r_score'] = pd.qcut(rfm['recency'], 4, labels=[4,3,2,1])
    rfm['f_score'] = pd.qcut(rfm['frequency'], 4, labels=[1,2,3,4])
    rfm['m_score'] = pd.qcut(rfm['monetary'], 4, labels=[1,2,3,4])

    # Create segment labels
    rfm['segment'] = rfm.apply(assign_segment, axis=1)

    return rfm

3. Inventory Reconciliation

def reconcile_inventory():
    system_inventory = pd.read_sql("SELECT * FROM inventory", db)
    physical_count = pd.read_excel("physical_count.xlsx")

    # Merge and compare
    comparison = pd.merge(
        system_inventory,
        physical_count,
        on='sku',
        suffixes=('_system', '_physical')
    )

    comparison['discrepancy'] = (
        comparison['quantity_physical'] - comparison['quantity_system']
    )

    # Flag issues
    issues = comparison[comparison['discrepancy'] != 0]

    return issues

Scheduling Pandas Scripts

Using Cron (Linux/Mac)

# Run daily at 6 AM
0 6 * * * /usr/bin/python3 /path/to/daily_report.py

Using Task Scheduler (Windows)

Create a scheduled task that runs your Python script.

Using n8n/Make

  • Schedule trigger
  • Call Python script via HTTP endpoint
  • Distribute results via email/Slack

Performance Tips

  • Use appropriate dtypes: Reduce memory usage
  • Process in chunks: For very large files
  • Use vectorized operations: Avoid loops
  • Filter early: Reduce data before processing
# Process large file in chunks
chunks = pd.read_csv('huge_file.csv', chunksize=100000)
results = []
for chunk in chunks:
    processed = process_chunk(chunk)
    results.append(processed)
final = pd.concat(results)

Need Data Automation Help?

Our Python team builds custom data processing pipelines that automate your reporting and analysis workflows.

Contact us to discuss your data automation needs.

0 comments

Leave a comment

Please note, comments need to be approved before they are published.