In the previous post we explored the foundational concepts behind DataFrames. Now it’s time to get hands-on. This practical guide walks you through real-world DataFrame operations, helping you transition from conceptual understanding to effective data manipulation and analysis.
We’ll work through a complete example using employee and sales data to demonstrate each concept in context, building from basic operations to advanced analytical techniques.
Prerequisites
Before diving in, ensure you have:
- Python 3.x installed
- Basic Python knowledge (lists, dictionaries, functions)
- Pandas library installed (
pip install pandas) - NumPy for numerical operations (
pip install numpy) - Familiarity with the concepts from Part 1: Understanding DataFrames
Our Working Example: Company Data Analysis
Throughout this guide, we’ll analyze a company’s employee and sales data to demonstrate DataFrame concepts. This multi-table scenario will help you understand when and how to apply different DataFrame operations in realistic business contexts.
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# Employee data
employee_data = {
'employee_id': [101, 102, 103, 104, 105, 106, 107, 108],
'name': ['Alice Johnson', 'Bob Smith', 'Charlie Brown', 'Diana Prince',
'Eve Wilson', 'Frank Miller', 'Grace Lee', 'Henry Davis'],
'department': ['Engineering', 'Sales', 'Engineering', 'Marketing',
'Sales', 'Engineering', 'Marketing', 'Sales'],
'salary': [95000, 65000, 88000, 72000, 68000, 92000, 75000, 71000],
'hire_date': ['2020-01-15', '2019-03-20', '2018-07-10', '2021-02-01',
'2019-11-15', '2017-05-30', '2020-08-12', '2021-06-25'],
'performance_score': [8.5, 7.2, 9.1, 8.0, 7.8, 8.9, 8.3, 7.5]
}
# Sales data
sales_data = {
'sale_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010],
'employee_id': [102, 105, 108, 102, 105, 108, 102, 105, 108, 102],
'sale_date': ['2024-01-15', '2024-01-18', '2024-01-20', '2024-02-05',
'2024-02-10', '2024-02-15', '2024-03-01', '2024-03-08',
'2024-03-12', '2024-03-20'],
'amount': [15000, 22000, 18000, 25000, 19000, 21000, 17000, 24000, 20000, 16000],
'product_category': ['Software', 'Hardware', 'Software', 'Hardware',
'Software', 'Hardware', 'Software', 'Hardware',
'Software', 'Hardware']
}
Creating DataFrames: From Data to Structure
Basic DataFrame Creation
Let’s start by creating DataFrames from our datasets:
# Create DataFrames from dictionaries
employees = pd.DataFrame(employee_data)
sales = pd.DataFrame(sales_data)
print("Employee DataFrame:")
print(employees)
print(f"\nShape: {employees.shape}")
print(f"Columns: {list(employees.columns)}")
print("\nSales DataFrame:")
print(sales.head())
print(f"\nShape: {sales.shape}")
Data Type Optimization
Proper data types are crucial for performance and functionality:
# Convert date strings to datetime objects
employees['hire_date'] = pd.to_datetime(employees['hire_date'])
sales['sale_date'] = pd.to_datetime(sales['sale_date'])
# Optimize categorical data
employees['department'] = employees['department'].astype('category')
sales['product_category'] = sales['product_category'].astype('category')
# Set meaningful indices
employees.set_index('employee_id', inplace=True)
sales.set_index('sale_id', inplace=True)
print("Optimized data types:")
print(employees.dtypes)
print(f"\nEmployee index: {employees.index.name}")
Creating DataFrames from Multiple Sources
# From CSV (simulated)
# employees = pd.read_csv('employees.csv', index_col='employee_id', parse_dates=['hire_date'])
# From lists of dictionaries
additional_employees = [
{'employee_id': 109, 'name': 'Ivy Chen', 'department': 'Engineering',
'salary': 89000, 'hire_date': '2022-01-10', 'performance_score': 8.7},
{'employee_id': 110, 'name': 'Jack Wilson', 'department': 'Marketing',
'salary': 73000, 'hire_date': '2022-03-15', 'performance_score': 8.1}
]
new_employees = pd.DataFrame(additional_employees)
new_employees['hire_date'] = pd.to_datetime(new_employees['hire_date'])
new_employees['department'] = new_employees['department'].astype('category')
new_employees.set_index('employee_id', inplace=True)
print("New employees DataFrame:")
print(new_employees)
Essential DataFrame Operations: Data Exploration and Manipulation
Data Inspection and Summary
Always start by understanding your data:
# Basic information
print("=== EMPLOYEE DATA OVERVIEW ===")
print(f"Shape: {employees.shape}")
print(f"Memory usage: {employees.memory_usage(deep=True).sum()} bytes")
# Statistical summary
print("\nNumerical columns summary:")
print(employees.describe())
# Categorical data summary
print(f"\nDepartment distribution:")
print(employees['department'].value_counts())
# Missing data check
print(f"\nMissing values:")
print(employees.isnull().sum())
# Data types and info
print(f"\nData types:")
print(employees.dtypes)
Filtering and Selection
Find specific subsets of your data:
# Single condition filtering
high_performers = employees[employees['performance_score'] > 8.0]
print("High performers (score > 8.0):")
print(high_performers[['name', 'department', 'performance_score']])
# Multiple conditions
senior_engineers = employees[
(employees['department'] == 'Engineering') &
(employees['salary'] > 90000)
]
print(f"\nSenior Engineers (Engineering + salary > 90k):")
print(senior_engineers[['name', 'salary']])
# Using isin() for multiple values
sales_marketing = employees[employees['department'].isin(['Sales', 'Marketing'])]
print(f"\nSales and Marketing employees:")
print(sales_marketing['department'].value_counts())
# Date-based filtering
recent_hires = employees[employees['hire_date'] > '2020-01-01']
print(f"\nRecent hires (after 2020):")
print(recent_hires[['name', 'hire_date']].sort_values('hire_date'))
Column Operations and Transformations
Create new insights from existing data:
# Calculate years of experience
current_date = pd.Timestamp('2024-01-01')
employees['years_experience'] = (current_date - employees['hire_date']).dt.days / 365.25
# Create salary bands
def categorize_salary(salary):
if salary < 70000:
return 'Entry Level'
elif salary < 85000:
return 'Mid Level'
else:
return 'Senior Level'
employees['salary_band'] = employees['salary'].apply(categorize_salary)
# Performance categories using cut()
employees['performance_category'] = pd.cut(
employees['performance_score'],
bins=[0, 7.5, 8.5, 10],
labels=['Needs Improvement', 'Good', 'Excellent']
)
print("Enhanced employee data:")
print(employees[['name', 'years_experience', 'salary_band', 'performance_category']].head())
Sorting and Ranking
Organize data for analysis:
# Sort by multiple columns
sorted_employees = employees.sort_values(['department', 'salary'], ascending=[True, False])
print("Employees sorted by department, then salary (desc):")
print(sorted_employees[['name', 'department', 'salary']])
# Ranking within groups
employees['salary_rank_in_dept'] = employees.groupby('department')['salary'].rank(
method='dense', ascending=False
)
print(f"\nSalary rankings within departments:")
print(employees[['name', 'department', 'salary', 'salary_rank_in_dept']].sort_values(['department', 'salary_rank_in_dept']))
Advanced DataFrame Techniques: Real-World Applications
Grouping and Aggregation
Analyze data across different dimensions:
# Department-wise analysis
dept_analysis = employees.groupby('department').agg({
'salary': ['mean', 'min', 'max', 'std'],
'performance_score': ['mean', 'count'],
'years_experience': 'mean'
}).round(2)
print("Department Analysis:")
print(dept_analysis)
# Multiple grouping levels
performance_salary_analysis = employees.groupby(['performance_category', 'department']).agg({
'salary': 'mean',
'name': 'count'
}).round(2)
print(f"\nPerformance vs Department Analysis:")
print(performance_salary_analysis)
# Custom aggregation functions
def salary_range(series):
return series.max() - series.min()
custom_analysis = employees.groupby('department').agg({
'salary': [salary_range, 'mean'],
'performance_score': lambda x: x.quantile(0.75)
})
print(f"\nCustom aggregation:")
print(custom_analysis)
Merging and Joining DataFrames
Combine data from multiple sources:
# Add sales performance to employee data
sales_summary = sales.groupby('employee_id').agg({
'amount': ['sum', 'count', 'mean'],
'sale_date': ['min', 'max']
}).round(2)
# Flatten column names
sales_summary.columns = ['total_sales', 'num_sales', 'avg_sale', 'first_sale', 'last_sale']
print("Sales summary by employee:")
print(sales_summary)
# Merge with employee data
employee_sales = employees.merge(sales_summary, left_index=True, right_index=True, how='left')
print(f"\nEmployees with sales data:")
print(employee_sales[['name', 'department', 'total_sales', 'num_sales']].fillna(0))
# Different join types
inner_join = employees.merge(sales_summary, left_index=True, right_index=True, how='inner')
print(f"\nOnly employees with sales (inner join): {len(inner_join)} employees")
outer_join = employees.merge(sales_summary, left_index=True, right_index=True, how='outer')
print(f"All employees and sales data (outer join): {len(outer_join)} records")
Pivot Tables and Cross-Tabulation
Reshape data for analysis:
# Sales pivot table
sales_pivot = sales.pivot_table(
values='amount',
index=sales['sale_date'].dt.month,
columns='product_category',
aggfunc=['sum', 'count'],
fill_value=0
)
print("Monthly sales by product category:")
print(sales_pivot)
# Cross-tabulation
dept_performance_crosstab = pd.crosstab(
employees['department'],
employees['performance_category'],
margins=True
)
print(f"\nDepartment vs Performance cross-tabulation:")
print(dept_performance_crosstab)
# Percentage cross-tabulation
dept_performance_pct = pd.crosstab(
employees['department'],
employees['performance_category'],
normalize='index'
).round(3)
print(f"\nPercentage distribution:")
print(dept_performance_pct)
Time Series Operations
Work with temporal data:
# Set sales date as index for time series analysis
sales_ts = sales.set_index('sale_date').sort_index()
# Resample to monthly data
monthly_sales = sales_ts.resample('M').agg({
'amount': 'sum',
'employee_id': 'count'
})
monthly_sales.columns = ['total_amount', 'num_sales']
print("Monthly sales summary:")
print(monthly_sales)
# Rolling averages
sales_ts['rolling_avg'] = sales_ts['amount'].rolling(window=3).mean()
print(f"\nSales with 3-period rolling average:")
print(sales_ts[['amount', 'rolling_avg']].head(6))
# Date-based filtering and analysis
q1_sales = sales_ts['2024-01':'2024-03']
print(f"\nQ1 2024 sales: ${q1_sales['amount'].sum():,.2f}")
Data Quality and Maintenance
Handling Missing Data
# Create some missing data for demonstration
employees_with_missing = employees.copy()
employees_with_missing.loc[employees_with_missing.index[0], 'performance_score'] = np.nan
employees_with_missing.loc[employees_with_missing.index[2], 'salary'] = np.nan
print("Missing data analysis:")
print(employees_with_missing.isnull().sum())
# Different strategies for handling missing data
print(f"\nFill missing performance scores with department average:")
dept_avg_performance = employees_with_missing.groupby('department')['performance_score'].transform('mean')
employees_with_missing['performance_score'].fillna(dept_avg_performance, inplace=True)
print(f"Fill missing salary with median:")
median_salary = employees_with_missing['salary'].median()
employees_with_missing['salary'].fillna(median_salary, inplace=True)
print("After filling missing values:")
print(employees_with_missing.isnull().sum())
Data Validation and Cleaning
# Detect outliers using IQR method
def detect_outliers(df, column):
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return df[(df[column] < lower_bound) | (df[column] > upper_bound)]
salary_outliers = detect_outliers(employees, 'salary')
print(f"Salary outliers:")
print(salary_outliers[['name', 'salary']])
# Data validation
def validate_employee_data(df):
issues = []
# Check for negative salaries
if (df['salary'] < 0).any():
issues.append("Negative salaries found")
# Check for future hire dates
if (df['hire_date'] > pd.Timestamp.now()).any():
issues.append("Future hire dates found")
# Check performance score range
if ((df['performance_score'] < 0) | (df['performance_score'] > 10)).any():
issues.append("Performance scores outside valid range (0-10)")
return issues
validation_results = validate_employee_data(employees)
print(f"\nData validation results: {validation_results if validation_results else 'All checks passed'}")
Performance Optimization: Working with Large DataFrames
Memory Optimization
# Memory usage analysis
def analyze_memory_usage(df):
memory_usage = df.memory_usage(deep=True)
total_memory = memory_usage.sum()
print(f"Total memory usage: {total_memory:,.0f} bytes ({total_memory/1024/1024:.2f} MB)")
print("\nMemory by column:")
for col in df.columns:
print(f"{col}: {memory_usage[col]:,.0f} bytes")
return memory_usage
print("Current memory usage:")
analyze_memory_usage(employees)
# Optimize data types
def optimize_dataframe(df):
optimized = df.copy()
# Optimize integer columns
for col in optimized.select_dtypes(include=['int64']).columns:
col_min = optimized[col].min()
col_max = optimized[col].max()
if col_min >= 0:
if col_max < 255:
optimized[col] = optimized[col].astype('uint8')
elif col_max < 65535:
optimized[col] = optimized[col].astype('uint16')
elif col_max < 4294967295:
optimized[col] = optimized[col].astype('uint32')
else:
if col_min > -128 and col_max < 127:
optimized[col] = optimized[col].astype('int8')
elif col_min > -32768 and col_max < 32767:
optimized[col] = optimized[col].astype('int16')
# Optimize float columns
for col in optimized.select_dtypes(include=['float64']).columns:
optimized[col] = optimized[col].astype('float32')
return optimized
optimized_employees = optimize_dataframe(employees)
print(f"\nAfter optimization:")
analyze_memory_usage(optimized_employees)
Efficient Operations
# Vectorized operations vs loops
import time
# Create larger dataset for performance testing
large_employees = pd.concat([employees] * 1000, ignore_index=True)
# Slow: iterative approach
start_time = time.time()
bonus_slow = []
for _, row in large_employees.iterrows():
if row['performance_score'] > 8.0:
bonus_slow.append(row['salary'] * 0.1)
else:
bonus_slow.append(row['salary'] * 0.05)
slow_time = time.time() - start_time
# Fast: vectorized approach
start_time = time.time()
bonus_fast = np.where(
large_employees['performance_score'] > 8.0,
large_employees['salary'] * 0.1,
large_employees['salary'] * 0.05
)
fast_time = time.time() - start_time
print(f"Iterative approach: {slow_time:.4f} seconds")
print(f"Vectorized approach: {fast_time:.4f} seconds")
print(f"Speedup: {slow_time/fast_time:.1f}x faster")
Quick Reference: Essential DataFrame Operations
Data Exploration Methods
# Essential exploration methods
df.head(n) # First n rows
df.tail(n) # Last n rows
df.info() # Data types and memory usage
df.describe() # Statistical summary
df.shape # Dimensions (rows, columns)
df.columns # Column names
df.index # Index information
df.dtypes # Data types
df.memory_usage(deep=True) # Memory usage by column
Selection and Filtering
# Selection methods
df['column'] # Single column
df[['col1', 'col2']] # Multiple columns
df.loc[row_indexer, col_indexer] # Label-based selection
df.iloc[row_indexer, col_indexer] # Position-based selection
df.query('column > value') # SQL-like filtering
df[df['column'] > value] # Boolean indexing
df.isin(values) # Check membership
Aggregation and Grouping
# Aggregation methods
df.groupby('column').agg(func) # Group and aggregate
df.pivot_table(values, index, columns) # Pivot table
df.resample('M').sum() # Time-based resampling
pd.crosstab(df['col1'], df['col2']) # Cross-tabulation
df.value_counts() # Count unique values
Data Manipulation
# Manipulation methods
df.merge(other, on='key') # Join DataFrames
pd.concat([df1, df2]) # Concatenate DataFrames
df.sort_values('column') # Sort by values
df.drop_duplicates() # Remove duplicates
df.fillna(value) # Fill missing values
df.dropna() # Drop missing values
df.apply(func) # Apply function
df.assign(new_col=values) # Add new column
Putting It All Together: Complete Analysis
Here’s a comprehensive analysis combining all the techniques we’ve learned:
def comprehensive_company_analysis(employees_df, sales_df):
"""
Perform comprehensive analysis of company data
"""
print("=== COMPREHENSIVE COMPANY ANALYSIS ===")
# 1. Data Overview
print(f"Total Employees: {len(employees_df)}")
print(f"Total Sales Records: {len(sales_df)}")
print(f"Analysis Period: {sales_df['sale_date'].min()} to {sales_df['sale_date'].max()}")
# 2. Department Analysis
dept_stats = employees_df.groupby('department').agg({
'salary': ['count', 'mean', 'std'],
'performance_score': 'mean',
'years_experience': 'mean'
}).round(2)
print(f"\n=== DEPARTMENT ANALYSIS ===")
print(dept_stats)
# 3. Sales Performance
sales_by_employee = sales_df.groupby('employee_id').agg({
'amount': ['sum', 'count', 'mean']
}).round(2)
# 4. Top Performers
combined_data = employees_df.merge(sales_by_employee, left_index=True, right_index=True, how='left')
combined_data.columns = list(employees_df.columns) + ['total_sales', 'num_sales', 'avg_sale']
top_sellers = combined_data.nlargest(3, 'total_sales')
print(f"\n=== TOP SALES PERFORMERS ===")
print(top_sellers[['name', 'department', 'total_sales', 'num_sales']])
# 5. Correlation Analysis
numeric_cols = ['salary', 'performance_score', 'years_experience', 'total_sales']
correlation_matrix = combined_data[numeric_cols].corr().round(3)
print(f"\n=== CORRELATION ANALYSIS ===")
print(correlation_matrix)
return combined_data
# Run comprehensive analysis
final_analysis = comprehensive_company_analysis(employees, sales)
Next Steps in Your DataFrame Journey
- Advanced Analytics: Learn statistical analysis and machine learning with pandas
- Data Visualization: Combine with matplotlib/seaborn for compelling charts
- Database Integration: Connect to SQL databases and APIs
- Performance Scaling: Explore Dask for larger datasets
- Time Series Analysis: Deep dive into temporal data patterns
Key Takeaways
- Start with Data Understanding: Always explore your data before analysis
- Use Vectorized Operations: Avoid loops for better performance
- Leverage Grouping: GroupBy is your friend for analytical insights
- Handle Missing Data Thoughtfully: Choose appropriate strategies for your use case
- Optimize Memory Usage: Proper data types save memory and improve performance
- Practice with Real Data: Apply these concepts to your own datasets
Additional Resources
Official Documentation
- Pandas DataFrame Documentation - Complete API reference
- Pandas User Guide - DataFrames - Official introduction to DataFrames
- Pandas Cookbook - Common DataFrame operations and patterns
Foundational Concepts
- Understanding Pandas DataFrames: The Foundation of Modern Data Analysis - Our foundational guide to DataFrame concepts
- Tidy Data Paper - Hadley Wickham’s seminal paper on data structure principles
- The Split-Apply-Combine Strategy - Understanding groupby operations conceptually
In-Depth Guides
- Python Data Science Handbook - Chapter 3 - Jake VanderPlas’s foundational pandas guide
- Modern Pandas - Tom Augspurger’s comprehensive modern pandas techniques
- Effective Pandas - Best practices and patterns for DataFrame usage
- Pandas Illustrated - Visual guide to DataFrame operations
Performance and Advanced Topics
- Pandas Performance Tips - Official performance optimization guide
- Why Pandas Uses So Much Memory - Understanding memory usage patterns
- Apache Arrow and Pandas - Next-generation columnar data formats

