In this article we will look at using the Pandas read_csv() function to read a CSV file into a DataFrame
Syntax
Following is the Syntax of read_csv() function.
pandas.read_csv(filepath_or_buffer, sep=NoDefault.no_default, delimiter=None, header='infer', names=NoDefault.no_default, index_col=None, usecols=None, squeeze=None, prefix=NoDefault.no_default, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=None, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal='.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, encoding_errors='strict', dialect=None, error_bad_lines=None, warn_bad_lines=None, on_bad_lines=None, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None, storage_options=None)
The list of parameters can be bewildering and many of these you will never use.
Read a CSV file into DataFrame
In this example we will load a population csv file using the read.csv function
# Import pandas import pandas as pd # Read CSV file into DataFrame df = pd.read_csv('population.csv') print(df)
When you run this you will see the following, I use Visual Studio code
Rank CCA3 Country/Territory Capital Continent 2022 Population Area (km²) Density (per km²) Growth Rate World Population Percentage
0 36 AFG Afghanistan Kabul Asia 41128771 652230 63.0587 1.0257 0.52
1 138 ALB Albania Tirana Europe 2842321 28748 98.8702 0.9957 0.04
2 34 DZA Algeria Algiers Africa 44903225 2381741 18.8531 1.0164 0.56
3 213 ASM American Samoa Pago Pago Oceania 44273 199 222.4774 0.9831 0.00
4 203 AND Andorra Andorra la Vella Europe 79824 468 170.5641 1.0100 0.00
.. … … … … … … … … … …
229 226 WLF Wallis and Futuna Mata-Utu Oceania 11572 142 81.4930 0.9953 0.00
230 172 ESH Western Sahara El Aaiún Africa 575986 266000 2.1654 1.0184 0.01
231 46 YEM Yemen Sanaa Asia 33696614 527968 63.8232 1.0217 0.42
232 63 ZMB Zambia Lusaka Africa 20017675 752612 26.5976 1.0280 0.25
233 74 ZWE Zimbabwe Harare Africa 16320537 390757 41.7665 1.0204 0.20
By default, it reads the first rows off the CSV file as column names and it creates an incremental numerical number as an index which starts from zero.
You can use either the sep or delimiter to specify the separator of the columns. The default is a comma, which is what the sample file is.
Set Column as Index
You can set a specific column as an index using index_col as param.
This param takes values {int, str, sequence of int / str, or False, optional, default None}.
I like the look of Rank, so lets use that
import pandas as pd # Read CSV file into DataFrame df = pd.read_csv('population.csv', index_col='Rank') print(df)
Run this and you will see this
CCA3 Country/Territory Capital Continent 2022 Population Area (km²) Density (per km²) Growth Rate World Population Percentage
Rank
36 AFG Afghanistan Kabul Asia 41128771 652230 63.0587 1.0257 0.52
138 ALB Albania Tirana Europe 2842321 28748 98.8702 0.9957 0.04
34 DZA Algeria Algiers Africa 44903225 2381741 18.8531 1.0164 0.56
Skip Rows
Sometimes you may need to skip the first rows or skip footer rows, you can use the skiprows and skipfooter parameters.
import pandas as pd # Read CSV file into DataFrame df = pd.read_csv('population.csv', index_col='Rank') print(df) df = pd.read_csv('population.csv', header=None, skiprows=5) print(df)
Ignore Column Names
By default, the first row is used as a header and assigned as the DataFrame column names.
If you do not want to consider the first row as a data record then use header=None param and use the names param to specify column names.
Not specifying names results in column names with numerical numbers.
import pandas as pd # Read CSV file into DataFrame df = pd.read_csv('population.csv', index_col='Rank') print(df) columns = ['rank','code','capital','continent' ,'population','area', 'density','growth', 'percentage'] df = pd.read_csv('population.csv', header=None,names=columns,skiprows=1) print(df)
Running this will result in this
rank code capital continent population area density growth percentage
36 AFG Afghanistan Kabul Asia 41128771 652230 63.0587 1.0257 0.52
138 ALB Albania Tirana Europe 2842321 28748 98.8702 0.9957 0.04
34 DZA Algeria Algiers Africa 44903225 2381741 18.8531 1.0164 0.56
Load only Certain Columns
You can use the usecols param and select columns to load from the CSV file.
This takes columns as a list of strings or a list of integers.
import pandas as pd # Read CSV file into DataFrame df = pd.read_csv('population.csv', usecols =['Country/Territory','Capital','Continent']) print(df)
When you run this you will see the following
Country/Territory Capital Continent
0 Afghanistan Kabul Asia
1 Albania Tirana Europe
2 Algeria Algiers Africa
3 American Samoa Pago Pago Oceania
4 Andorra Andorra la Vella Europe
References
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
https://github.com/programmershelp/maxpython/tree/main/pandas/readcsv