Pandas is a powerful Python library for data analysis and data manipulation.
It provides data structures and tools that allow you to quickly explore, clean, and analyze data.
The two primary data structures in Pandas are Series (1D) and DataFrame (2D), both of which are highly versatile for handling and manipulating data.
In this tutorial, we will cover:
- Installing Pandas
- Basic Pandas Objects (Series and DataFrame)
- Creating DataFrames and Series
- Basic Data Exploration
- Data Selection and Filtering
- Modifying Data in DataFrames
- Descriptive Statistics
Let's dive in and explore Pandas step-by-step!
1. Installing Pandas
To install Pandas, use the following command:
pip install pandas
Once installed, you can import it in your script or Jupyter Notebook as follows:
import pandas as pd
2. Basic Pandas Objects: Series and DataFrame
Series
A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floats, etc.). It is similar to a single column in a table or Excel sheet.
import pandas as pd # Creating a Series data = [10, 20, 30, 40] series = pd.Series(data) print(series)
Output:
0 10 1 20 2 30 3 40 dtype: int64
- Each value has a default index starting from 0.
DataFrame
A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous data structure with labeled axes (rows and columns). It is similar to a spreadsheet or SQL table.
# Creating a DataFrame data = { "Name": ["Alice", "Bob", "Charlie"], "Age": [25, 30, 35], "City": ["New York", "Los Angeles", "Chicago"] } df = pd.DataFrame(data) print(df)
Output:
Name Age City 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago
3. Creating DataFrames and Series
You can create Pandas DataFrames and Series from lists, dictionaries, and even external data sources like CSV files.
Creating a Series from a Dictionary
# Creating a Series from a dictionary data = {"A": 1, "B": 2, "C": 3} series = pd.Series(data) print(series)
Output:
A 1 B 2 C 3 dtype: int64
Creating a DataFrame from a List of Dictionaries
# Creating a DataFrame from a list of dictionaries data = [ {"Name": "Alice", "Age": 25}, {"Name": "Bob", "Age": 30}, {"Name": "Charlie", "Age": 35} ] df = pd.DataFrame(data) print(df)
Output:
Name Age 0 Alice 25 1 Bob 30 2 Charlie 35
Loading Data from a CSV File
You can also load data directly from a CSV file:
df = pd.read_csv("sample.csv") print(df.head()) # Display the first few rows
- Explanation: read_csv() reads the CSV file into a DataFrame, and head() displays the first few rows.
4. Basic Data Exploration
Once you load data into a DataFrame, it’s useful to explore it before diving into analysis.
Example 1: Displaying the Top Rows
# Display the first few rows of the DataFrame print(df.head())
Example 2: Viewing Data Types
# Get the data types of each column print(df.dtypes)
Example 3: Displaying Basic Information
# Display basic info about the DataFrame print(df.info())
Example 4: Getting Summary Statistics
# Display summary statistics for numerical columns print(df.describe())
- Explanation: describe() provides statistics like mean, median, minimum, and maximum values for numeric columns.
5. Data Selection and Filtering
Pandas offers various ways to select and filter data in DataFrames.
Selecting Columns
You can select a single column by specifying its name in square brackets:
# Select a single column print(df["Name"])
To select multiple columns, provide a list of column names:
# Select multiple columns print(df[["Name", "Age"]])
Selecting Rows by Index with iloc
Use iloc for positional-based selection.
# Select the first row print(df.iloc[0])
Selecting Rows by Condition
You can filter rows based on conditions:
# Filter rows where Age is greater than 30 filtered_df = df[df["Age"] > 30] print(filtered_df)
Output:
Name Age City 2 Charlie 35 Chicago
6. Modifying Data in DataFrames
DataFrames are mutable, which means you can modify values within them. This includes updating, adding, and removing columns and rows.
Adding a New Column
# Add a new column df["Salary"] = [50000, 60000, 70000] print(df)
Output:
Name Age City Salary 0 Alice 25 New York 50000 1 Bob 30 Los Angeles 60000 2 Charlie 35 Chicago 70000
Updating Values in a Column
# Update values in the "Salary" column df["Salary"] = df["Salary"] + 5000 print(df)
Output:
Name Age City Salary 0 Alice 25 New York 55000 1 Bob 30 Los Angeles 65000 2 Charlie 35 Chicago 75000
Deleting a Column
# Delete a column df.drop(columns=["Salary"], inplace=True) print(df)
Output:
Name Age City 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago
Adding a New Row
You can add a new row by using append().
# Add a new row new_row = {"Name": "David", "Age": 28, "City": "San Francisco"} df = df.append(new_row, ignore_index=True) print(df)
Output:
Name Age City 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago 3 David 28 San Francisco
7. Descriptive Statistics
Pandas provides many functions for calculating descriptive statistics on DataFrames.
Example 1: Mean, Median, and Sum
# Calculate the mean, median, and sum of the "Age" column mean_age = df["Age"].mean() median_age = df["Age"].median() sum_age = df["Age"].sum() print("Mean Age:", mean_age) print("Median Age:", median_age) print("Sum of Ages:", sum_age)
Example 2: Grouping Data with groupby()
You can use groupby() to group data based on a specific column and perform calculations.
# Calculate the mean age by City grouped_df = df.groupby("City")["Age"].mean() print(grouped_df)
Output:
City Chicago 35.0 Los Angeles 30.0 New York 25.0 San Francisco 28.0 Name: Age, dtype: float64
Example 3: Value Counts
To count occurrences of unique values in a column, use value_counts().
# Count occurrences of each city city_counts = df["City"].value_counts() print(city_counts)
Output:
New York 1 Los Angeles 1 Chicago 1 San Francisco 1 Name: City, dtype: int64
Summary of Key Concepts in Pandas
Concept | Description |
---|---|
Series | A 1D labeled array, similar to a single column in Excel. |
DataFrame | A 2D labeled data structure, like a table or spreadsheet. |
Basic Exploration | Use head(), info(), describe() to understand data at a glance. |
Selection and Filtering | Select data with column names, iloc, and filtering conditions. |
Modifying Data | Add, update, or delete rows and columns within the DataFrame. |
Descriptive Statistics | Use functions like mean(), sum(), and groupby() for statistical analysis. |
Conclusion
In this tutorial, we explored the basics of Pandas in Python, covering:
- Installing Pandas and creating basic data structures (Series and DataFrames).
- Loading and exploring data, modifying values, and filtering data.
- Performing statistical operations and grouping data for analysis.