Master Pandas for Business Automation: Process Data at Scale
Pandas is Python's powerhouse for data manipulation. If you're spending hours in Excel cleaning, transforming, and analyzing data, Pandas can do it in seconds—and automate the entire process. Here's how to leverage Pandas for business automation.
Why Pandas for Business?
- Speed: Process millions of rows in seconds
- Reproducibility: Same process every time
- Integration: Connect to databases, APIs, files
- Complexity: Handle transformations Excel can't
- Automation: Schedule and run unattended
Getting Started with Pandas
import pandas as pd
# Read data from various sources
df_csv = pd.read_csv('sales_data.csv')
df_excel = pd.read_excel('inventory.xlsx', sheet_name='Stock')
df_sql = pd.read_sql('SELECT * FROM orders', connection)
df_json = pd.read_json('api_response.json')
Common Business Transformations
Data Cleaning
# Remove duplicates
df = df.drop_duplicates(subset=['email'])
# Handle missing values
df['phone'] = df['phone'].fillna('Unknown')
df = df.dropna(subset=['email']) # Remove rows without email
# Standardize text
df['name'] = df['name'].str.strip().str.title()
df['email'] = df['email'].str.lower()
# Fix data types
df['date'] = pd.to_datetime(df['date'])
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
Filtering and Querying
# Filter rows
active_customers = df[df['status'] == 'active']
high_value = df[df['revenue'] > 10000]
recent = df[df['date'] >= '2024-01-01']
# Complex conditions
vip_customers = df[
(df['revenue'] > 50000) &
(df['orders'] >= 10) &
(df['status'] == 'active')
]
Aggregation and Grouping
# Group by and aggregate
sales_by_region = df.groupby('region').agg({
'revenue': 'sum',
'orders': 'count',
'customer_id': 'nunique'
}).reset_index()
# Pivot tables
monthly_sales = df.pivot_table(
values='revenue',
index='product',
columns='month',
aggfunc='sum'
)
Merging Data Sources
# Combine datasets
orders_with_customers = pd.merge(
orders_df,
customers_df,
on='customer_id',
how='left'
)
# Append datasets
all_sales = pd.concat([q1_sales, q2_sales, q3_sales, q4_sales])
Business Automation Examples
1. Daily Sales Report
def generate_daily_report():
# Load today's sales
df = pd.read_sql('''
SELECT * FROM orders
WHERE DATE(created_at) = CURRENT_DATE
''', connection)
# Calculate metrics
report = {
'total_revenue': df['amount'].sum(),
'order_count': len(df),
'avg_order_value': df['amount'].mean(),
'top_products': df.groupby('product')['amount'].sum().nlargest(5)
}
# Export
df.to_excel(f'daily_report_{today}.xlsx', index=False)
return report
2. Customer Segmentation
def segment_customers(df):
# Calculate RFM metrics
rfm = df.groupby('customer_id').agg({
'order_date': lambda x: (today - x.max()).days, # Recency
'order_id': 'count', # Frequency
'amount': 'sum' # Monetary
}).rename(columns={
'order_date': 'recency',
'order_id': 'frequency',
'amount': 'monetary'
})
# Score each metric
rfm['r_score'] = pd.qcut(rfm['recency'], 4, labels=[4,3,2,1])
rfm['f_score'] = pd.qcut(rfm['frequency'], 4, labels=[1,2,3,4])
rfm['m_score'] = pd.qcut(rfm['monetary'], 4, labels=[1,2,3,4])
# Create segment labels
rfm['segment'] = rfm.apply(assign_segment, axis=1)
return rfm
3. Inventory Reconciliation
def reconcile_inventory():
system_inventory = pd.read_sql("SELECT * FROM inventory", db)
physical_count = pd.read_excel("physical_count.xlsx")
# Merge and compare
comparison = pd.merge(
system_inventory,
physical_count,
on='sku',
suffixes=('_system', '_physical')
)
comparison['discrepancy'] = (
comparison['quantity_physical'] - comparison['quantity_system']
)
# Flag issues
issues = comparison[comparison['discrepancy'] != 0]
return issues
Scheduling Pandas Scripts
Using Cron (Linux/Mac)
# Run daily at 6 AM 0 6 * * * /usr/bin/python3 /path/to/daily_report.py
Using Task Scheduler (Windows)
Create a scheduled task that runs your Python script.
Using n8n/Make
- Schedule trigger
- Call Python script via HTTP endpoint
- Distribute results via email/Slack
Performance Tips
- Use appropriate dtypes: Reduce memory usage
- Process in chunks: For very large files
- Use vectorized operations: Avoid loops
- Filter early: Reduce data before processing
# Process large file in chunks
chunks = pd.read_csv('huge_file.csv', chunksize=100000)
results = []
for chunk in chunks:
processed = process_chunk(chunk)
results.append(processed)
final = pd.concat(results)
Need Data Automation Help?
Our Python team builds custom data processing pipelines that automate your reporting and analysis workflows.
Contact us to discuss your data automation needs.
0 comments