The read_csv() function in Pandas is one of the most powerful and flexible functions for reading data from CSV files.
CSV (Comma-Separated Values) files are one of the most commonly used formats for storing and exchanging data due to their simplicity and compatibility with many software systems.
In this tutorial, we will cover:
- Basic CSV Reading
- Specifying Delimiters
- Setting Headers and Column Names
- Selecting Columns to Load
- Handling Missing Data
- Parsing Dates
- Loading Large CSV Files
Let’s explore each with code examples.
1. Basic CSV Reading
The simplest way to read a CSV file is to provide the file path to read_csv(). Pandas will automatically read the CSV and load it into a DataFrame.
Example 1: Basic CSV Reading
import pandas as pd # Load a CSV file into a DataFrame df = pd.read_csv("sample.csv") print(df.head()) # Display the first few rows
- Explanation: This loads the file sample.csv into a DataFrame df. The head() function displays the first few rows.
2. Specifying Delimiters
Not all CSV files use a comma as the delimiter. Some use tabs, semicolons, or other characters. You can specify the delimiter using the sep parameter.
Example 2: Reading a File with a Different Delimiter
# Load a CSV file with a semicolon delimiter df = pd.read_csv("sample_semicolon.csv", sep=";") print(df.head())
- Explanation: Here, sep=”;” specifies that semicolons separate the columns.
3. Setting Headers and Column Names
By default, Pandas uses the first row of the CSV as the header. You can change this behavior by setting header=None to remove headers or by specifying column names manually.
Example 3: CSV without Headers
# Load a CSV file without headers df = pd.read_csv("sample_no_header.csv", header=None) print(df.head())
- Explanation: Setting header=None tells Pandas that the CSV file doesn’t have a header row, so it automatically assigns numeric column names.
Example 4: Specifying Column Names
# Load a CSV and assign custom column names df = pd.read_csv("sample_no_header.csv", header=None, names=["Column1", "Column2", "Column3"]) print(df.head())
- Explanation: By passing a list to names, you define custom column names.
4. Selecting Columns to Load
If you only need specific columns from the CSV, you can specify them using the usecols parameter to save memory and speed up loading.
Example 5: Selecting Specific Columns
# Load only specific columns df = pd.read_csv("sample.csv", usecols=["Column1", "Column2"]) print(df.head())
- Explanation: Only the columns “Column1” and “Column2” are loaded into the DataFrame.
5. Handling Missing Data
You may encounter missing data in a CSV file, represented by empty cells or specific placeholders like “N/A”, “NULL”, or “-“. Pandas allows you to handle missing data directly while reading.
Example 6: Specifying Missing Data Values
# Treat specific values as NaN (missing data) df = pd.read_csv("sample_with_missing.csv", na_values=["N/A", "NULL", "-"]) print(df.head())
- Explanation: The na_values parameter replaces specified values with NaN (Not a Number), indicating missing data.
Example 7: Filling Missing Data with Default Values
# Fill missing data with a default value while reading df = pd.read_csv("sample_with_missing.csv").fillna(0) print(df.head())
- Explanation: Using fillna(0) fills all missing values with 0 immediately after loading the CSV.
6. Parsing Dates
If your CSV file contains dates, you can instruct Pandas to parse these columns as datetime objects. This enables easy manipulation of dates for analysis and filtering.
Example 8: Parsing Dates in CSV
# Load a CSV and parse the "Date" column as datetime df = pd.read_csv("sample_with_dates.csv", parse_dates=["Date"]) print(df.info())
- Explanation: parse_dates=[“Date”] automatically converts the “Date” column to a datetime object.
Example 9: Combining Multiple Columns into a Single Date Column
# Combine "Year", "Month", and "Day" columns into a single date column df = pd.read_csv("sample_with_year_month_day.csv", parse_dates=[["Year", "Month", "Day"]]) print(df.head())
- Explanation: parse_dates=[[“Year”, “Month”, “Day”]] combines columns “Year”, “Month”, and “Day” into a single datetime column.
7. Loading Large CSV Files
For very large files, loading the entire CSV into memory might not be feasible. You can handle large files more efficiently by reading data in chunks, limiting the number of rows, or by reducing the memory usage of each data type.
Example 10: Reading Data in Chunks
# Read data in chunks of 1000 rows chunk_iter = pd.read_csv("large_file.csv", chunksize=1000) # Process each chunk (example: count rows) total_rows = 0 for chunk in chunk_iter: total_rows += len(chunk) print("Total rows:", total_rows)
- Explanation: chunksize=1000 reads the file in chunks of 1000 rows, allowing you to process each chunk separately. Here, we count the total rows without loading the entire file into memory.
Example 11: Limiting Rows and Specifying Data Types
# Limit rows and set data types for efficiency df = pd.read_csv("large_file.csv", nrows=5000, dtype={"Column1": "int32", "Column2": "float32"}) print(df.info())
- Explanation: nrows=5000 limits the load to the first 5000 rows, while dtype specifies more memory-efficient data types for columns.
Summary of Key read_csv() Parameters
Parameter | Description |
---|---|
filepath_or_buffer | Path or URL to the CSV file. |
sep | Specifies the column delimiter (default is ,). |
header | Specifies the row to use as column names, or None if the file has no header row. |
names | Defines custom column names. |
usecols | Specifies a subset of columns to load. |
na_values | Defines custom values to treat as missing (NaN). |
parse_dates | Parses columns as datetime objects. |
chunksize | Reads the file in chunks of specified row count, useful for large files. |
nrows | Limits the number of rows to read. |
dtype | Specifies data types for columns, which can reduce memory usage for large datasets. |
Example Summary: Full Workflow with read_csv()
To demonstrate multiple features in one workflow, here’s a comprehensive example:
# Comprehensive CSV reading example df = pd.read_csv( "sample.csv", sep=",", header=0, names=["ID", "Name", "Age", "City", "Salary", "Date"], usecols=["ID", "Name", "Salary", "Date"], na_values=["N/A", "NULL"], parse_dates=["Date"], dtype={"ID": "int32", "Salary": "float32"} ) # Display the first few rows and column data types print(df.head()) print(df.dtypes)
Explanation:
- This example reads sample.csv with specific columns, custom missing values, date parsing, and optimized data types.
Conclusion
In this tutorial, we covered the read_csv() function in Pandas, with examples for:
- Basic CSV reading.
- Specifying delimiters, headers, and custom column names.
- Selecting specific columns and handling missing data.
- Parsing dates and reading large files efficiently.