Home » Python Pandas: Reading CSV Files with read_csv()

Python Pandas: Reading CSV Files with read_csv()

Java SE 11 Programmer II [1Z0-816] Practice Tests
Java SE 11 Developer (Upgrade) [1Z0-817]
Oracle Java Certification
Java SE 11 Programmer I [1Z0-815] Practice Tests
Spring Framework Basics Video Course
1 Year Subscription

The read_csv() function in Pandas is one of the most powerful and flexible functions for reading data from CSV files.

CSV (Comma-Separated Values) files are one of the most commonly used formats for storing and exchanging data due to their simplicity and compatibility with many software systems.

In this tutorial, we will cover:

  1. Basic CSV Reading
  2. Specifying Delimiters
  3. Setting Headers and Column Names
  4. Selecting Columns to Load
  5. Handling Missing Data
  6. Parsing Dates
  7. Loading Large CSV Files

Let’s explore each with code examples.

1. Basic CSV Reading

The simplest way to read a CSV file is to provide the file path to read_csv(). Pandas will automatically read the CSV and load it into a DataFrame.

Example 1: Basic CSV Reading

import pandas as pd

# Load a CSV file into a DataFrame
df = pd.read_csv("sample.csv")
print(df.head())  # Display the first few rows
  • Explanation: This loads the file sample.csv into a DataFrame df. The head() function displays the first few rows.

2. Specifying Delimiters

Not all CSV files use a comma as the delimiter. Some use tabs, semicolons, or other characters. You can specify the delimiter using the sep parameter.

Example 2: Reading a File with a Different Delimiter

# Load a CSV file with a semicolon delimiter
df = pd.read_csv("sample_semicolon.csv", sep=";")
print(df.head())
  • Explanation: Here, sep=”;” specifies that semicolons separate the columns.

3. Setting Headers and Column Names

By default, Pandas uses the first row of the CSV as the header. You can change this behavior by setting header=None to remove headers or by specifying column names manually.

Example 3: CSV without Headers

# Load a CSV file without headers
df = pd.read_csv("sample_no_header.csv", header=None)
print(df.head())
  • Explanation: Setting header=None tells Pandas that the CSV file doesn’t have a header row, so it automatically assigns numeric column names.

Example 4: Specifying Column Names

# Load a CSV and assign custom column names
df = pd.read_csv("sample_no_header.csv", header=None, names=["Column1", "Column2", "Column3"])
print(df.head())
  • Explanation: By passing a list to names, you define custom column names.

4. Selecting Columns to Load

If you only need specific columns from the CSV, you can specify them using the usecols parameter to save memory and speed up loading.

Example 5: Selecting Specific Columns

# Load only specific columns
df = pd.read_csv("sample.csv", usecols=["Column1", "Column2"])
print(df.head())
  • Explanation: Only the columns “Column1” and “Column2” are loaded into the DataFrame.

5. Handling Missing Data

You may encounter missing data in a CSV file, represented by empty cells or specific placeholders like “N/A”, “NULL”, or “-“. Pandas allows you to handle missing data directly while reading.

Example 6: Specifying Missing Data Values

# Treat specific values as NaN (missing data)
df = pd.read_csv("sample_with_missing.csv", na_values=["N/A", "NULL", "-"])
print(df.head())
  • Explanation: The na_values parameter replaces specified values with NaN (Not a Number), indicating missing data.

Example 7: Filling Missing Data with Default Values

# Fill missing data with a default value while reading
df = pd.read_csv("sample_with_missing.csv").fillna(0)
print(df.head())
  • Explanation: Using fillna(0) fills all missing values with 0 immediately after loading the CSV.

6. Parsing Dates

If your CSV file contains dates, you can instruct Pandas to parse these columns as datetime objects. This enables easy manipulation of dates for analysis and filtering.

Example 8: Parsing Dates in CSV

# Load a CSV and parse the "Date" column as datetime
df = pd.read_csv("sample_with_dates.csv", parse_dates=["Date"])
print(df.info())
  • Explanation: parse_dates=[“Date”] automatically converts the “Date” column to a datetime object.

Example 9: Combining Multiple Columns into a Single Date Column

# Combine "Year", "Month", and "Day" columns into a single date column
df = pd.read_csv("sample_with_year_month_day.csv", parse_dates=[["Year", "Month", "Day"]])
print(df.head())
  • Explanation: parse_dates=[[“Year”, “Month”, “Day”]] combines columns “Year”, “Month”, and “Day” into a single datetime column.

7. Loading Large CSV Files

For very large files, loading the entire CSV into memory might not be feasible. You can handle large files more efficiently by reading data in chunks, limiting the number of rows, or by reducing the memory usage of each data type.

Example 10: Reading Data in Chunks

# Read data in chunks of 1000 rows
chunk_iter = pd.read_csv("large_file.csv", chunksize=1000)

# Process each chunk (example: count rows)
total_rows = 0
for chunk in chunk_iter:
    total_rows += len(chunk)
print("Total rows:", total_rows)
  • Explanation: chunksize=1000 reads the file in chunks of 1000 rows, allowing you to process each chunk separately. Here, we count the total rows without loading the entire file into memory.

Example 11: Limiting Rows and Specifying Data Types

# Limit rows and set data types for efficiency
df = pd.read_csv("large_file.csv", nrows=5000, dtype={"Column1": "int32", "Column2": "float32"})
print(df.info())
  • Explanation: nrows=5000 limits the load to the first 5000 rows, while dtype specifies more memory-efficient data types for columns.

Summary of Key read_csv() Parameters

Parameter Description
filepath_or_buffer Path or URL to the CSV file.
sep Specifies the column delimiter (default is ,).
header Specifies the row to use as column names, or None if the file has no header row.
names Defines custom column names.
usecols Specifies a subset of columns to load.
na_values Defines custom values to treat as missing (NaN).
parse_dates Parses columns as datetime objects.
chunksize Reads the file in chunks of specified row count, useful for large files.
nrows Limits the number of rows to read.
dtype Specifies data types for columns, which can reduce memory usage for large datasets.

Example Summary: Full Workflow with read_csv()

To demonstrate multiple features in one workflow, here’s a comprehensive example:

# Comprehensive CSV reading example
df = pd.read_csv(
    "sample.csv",
    sep=",",
    header=0,
    names=["ID", "Name", "Age", "City", "Salary", "Date"],
    usecols=["ID", "Name", "Salary", "Date"],
    na_values=["N/A", "NULL"],
    parse_dates=["Date"],
    dtype={"ID": "int32", "Salary": "float32"}
)

# Display the first few rows and column data types
print(df.head())
print(df.dtypes)

Explanation:

  • This example reads sample.csv with specific columns, custom missing values, date parsing, and optimized data types.

Conclusion

In this tutorial, we covered the read_csv() function in Pandas, with examples for:

  • Basic CSV reading.
  • Specifying delimiters, headers, and custom column names.
  • Selecting specific columns and handling missing data.
  • Parsing dates and reading large files efficiently.

You may also like

Leave a Comment

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More