Home » Getting Started with Pandas in Python: A Beginner’s Tutorial

Getting Started with Pandas in Python: A Beginner’s Tutorial

Java SE 11 Developer (Upgrade) [1Z0-817]
1 Year Subscription
Oracle Java Certification
Spring Framework Basics Video Course
Java SE 11 Programmer II [1Z0-816] Practice Tests
Java SE 11 Programmer I [1Z0-815] Practice Tests

Pandas is a powerful Python library for data analysis and data manipulation.

It provides data structures and tools that allow you to quickly explore, clean, and analyze data.

The two primary data structures in Pandas are Series (1D) and DataFrame (2D), both of which are highly versatile for handling and manipulating data.

In this tutorial, we will cover:

  1. Installing Pandas
  2. Basic Pandas Objects (Series and DataFrame)
  3. Creating DataFrames and Series
  4. Basic Data Exploration
  5. Data Selection and Filtering
  6. Modifying Data in DataFrames
  7. Descriptive Statistics

Let's dive in and explore Pandas step-by-step!

1. Installing Pandas

To install Pandas, use the following command:

pip install pandas

Once installed, you can import it in your script or Jupyter Notebook as follows:

import pandas as pd

2. Basic Pandas Objects: Series and DataFrame

Series

A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floats, etc.). It is similar to a single column in a table or Excel sheet.

import pandas as pd

# Creating a Series
data = [10, 20, 30, 40]
series = pd.Series(data)
print(series)

Output:

0    10
1    20
2    30
3    40
dtype: int64
  • Each value has a default index starting from 0.

DataFrame

A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous data structure with labeled axes (rows and columns). It is similar to a spreadsheet or SQL table.

# Creating a DataFrame
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "City": ["New York", "Los Angeles", "Chicago"]
}
df = pd.DataFrame(data)
print(df)

Output:

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

3. Creating DataFrames and Series

You can create Pandas DataFrames and Series from lists, dictionaries, and even external data sources like CSV files.

Creating a Series from a Dictionary

# Creating a Series from a dictionary
data = {"A": 1, "B": 2, "C": 3}
series = pd.Series(data)
print(series)

Output:

A    1
B    2
C    3
dtype: int64

Creating a DataFrame from a List of Dictionaries

# Creating a DataFrame from a list of dictionaries
data = [
    {"Name": "Alice", "Age": 25},
    {"Name": "Bob", "Age": 30},
    {"Name": "Charlie", "Age": 35}
]
df = pd.DataFrame(data)
print(df)

Output:

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35

Loading Data from a CSV File

You can also load data directly from a CSV file:

df = pd.read_csv("sample.csv")
print(df.head())  # Display the first few rows
  • Explanation: read_csv() reads the CSV file into a DataFrame, and head() displays the first few rows.

4. Basic Data Exploration

Once you load data into a DataFrame, it’s useful to explore it before diving into analysis.

Example 1: Displaying the Top Rows

# Display the first few rows of the DataFrame
print(df.head())

Example 2: Viewing Data Types

# Get the data types of each column
print(df.dtypes)

Example 3: Displaying Basic Information

# Display basic info about the DataFrame
print(df.info())

Example 4: Getting Summary Statistics

# Display summary statistics for numerical columns
print(df.describe())
  • Explanation: describe() provides statistics like mean, median, minimum, and maximum values for numeric columns.

5. Data Selection and Filtering

Pandas offers various ways to select and filter data in DataFrames.

Selecting Columns

You can select a single column by specifying its name in square brackets:

# Select a single column
print(df["Name"])

To select multiple columns, provide a list of column names:

# Select multiple columns
print(df[["Name", "Age"]])

Selecting Rows by Index with iloc

Use iloc for positional-based selection.

# Select the first row
print(df.iloc[0])

Selecting Rows by Condition

You can filter rows based on conditions:

# Filter rows where Age is greater than 30
filtered_df = df[df["Age"] > 30]
print(filtered_df)

Output:

      Name  Age         City
2  Charlie   35      Chicago

6. Modifying Data in DataFrames

DataFrames are mutable, which means you can modify values within them. This includes updating, adding, and removing columns and rows.

Adding a New Column

# Add a new column
df["Salary"] = [50000, 60000, 70000]
print(df)

Output:

      Name  Age         City  Salary
0    Alice   25     New York  50000
1      Bob   30  Los Angeles  60000
2  Charlie   35      Chicago  70000

Updating Values in a Column

# Update values in the "Salary" column
df["Salary"] = df["Salary"] + 5000
print(df)

Output:

      Name  Age         City  Salary
0    Alice   25     New York  55000
1      Bob   30  Los Angeles  65000
2  Charlie   35      Chicago  75000

Deleting a Column

# Delete a column
df.drop(columns=["Salary"], inplace=True)
print(df)

Output:

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

Adding a New Row

You can add a new row by using append().

# Add a new row
new_row = {"Name": "David", "Age": 28, "City": "San Francisco"}
df = df.append(new_row, ignore_index=True)
print(df)

Output:

      Name  Age           City
0    Alice   25       New York
1      Bob   30    Los Angeles
2  Charlie   35        Chicago
3    David   28  San Francisco

7. Descriptive Statistics

Pandas provides many functions for calculating descriptive statistics on DataFrames.

Example 1: Mean, Median, and Sum

# Calculate the mean, median, and sum of the "Age" column
mean_age = df["Age"].mean()
median_age = df["Age"].median()
sum_age = df["Age"].sum()

print("Mean Age:", mean_age)
print("Median Age:", median_age)
print("Sum of Ages:", sum_age)

Example 2: Grouping Data with groupby()

You can use groupby() to group data based on a specific column and perform calculations.

# Calculate the mean age by City
grouped_df = df.groupby("City")["Age"].mean()
print(grouped_df)

Output:

City
Chicago           35.0
Los Angeles       30.0
New York          25.0
San Francisco     28.0
Name: Age, dtype: float64

Example 3: Value Counts

To count occurrences of unique values in a column, use value_counts().

# Count occurrences of each city
city_counts = df["City"].value_counts()
print(city_counts)

Output:

New York         1
Los Angeles      1
Chicago          1
San Francisco    1
Name: City, dtype: int64

Summary of Key Concepts in Pandas

Concept Description
Series A 1D labeled array, similar to a single column in Excel.
DataFrame A 2D labeled data structure, like a table or spreadsheet.
Basic Exploration Use head(), info(), describe() to understand data at a glance.
Selection and Filtering Select data with column names, iloc, and filtering conditions.
Modifying Data Add, update, or delete rows and columns within the DataFrame.
Descriptive Statistics Use functions like mean(), sum(), and groupby() for statistical analysis.

Conclusion

In this tutorial, we explored the basics of Pandas in Python, covering:

  • Installing Pandas and creating basic data structures (Series and DataFrames).
  • Loading and exploring data, modifying values, and filtering data.
  • Performing statistical operations and grouping data for analysis.

You may also like

Leave a Comment

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More