Before you use pandas install it in your PC depending on your environment you are working. If you are using anaconda or miniconda use conda install pandas in the respective interpreter. If you use normal python IDE the download and install pandas using pip by the command pip install pandas. If you had any difficulties in installing the pandas check this link: installing pandas.
After installing it in your PC import the pandas package into your program or interpreter.
>>>import pandas as pd
pd is just an alias for pandas instead of typing the whole name and it is assumed as a standard practice.
pandas is a fast, powerful, flexible, and easy to use open-source data analysis and manipulation tool, built on top of the Python programming language.
we now know that pandas is a data manipulation tool but the question is
what kind of data do pandas handle?
When working with tabular data, such as data stored in spreadsheets or databases, Pandas is the right tool for you. Pandas will help you to explore, clean, and process your data. In Pandas, a data table is called a DataFrame.
Pandas DataFrame Representation
I want to store the data of some movies I know the name of the movie, year released, and director.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
>>> import pandas as pd | |
>>> df = pd.DataFrame({"Name":['The Godfather','The Dark Knight','Fight Club','The matrix'], | |
"Year":[1972,2008,1999,1999], | |
"Director":['Ford Coppola','Christopher Nolan','David Fincher','Lana and lilly']}) | |
>>> df | |
Name Year Director | |
0 The Godfather 1972 Ford Coppola | |
1 The Dark Knight 2008 Christopher Nolan | |
2 Fight Club 1999 David Fincher | |
3 The matrix 1999 Lana and lilly | |
>>> df['Year'] | |
0 1972 | |
1 2008 | |
2 1999 | |
3 1999 | |
Name: Year, dtype: int64 | |
>>> df["Name"] | |
0 The Godfather | |
1 The Dark Knight | |
2 Fight Club | |
3 The matrix | |
Name: Name, dtype: object | |
>>> |
When selecting the single column from the pandas DataFrame the result will be Series. The selecting of a single column from the python dictionary is the same as the pandas DataFrame here.
You can create a single column or Series from first.
>>>ages=pd.Series([19,16,32,29],name="age")
>>>ages
0 19
1 16
2 32
3 29
Name: age, dtype: int64
We stored the movie's information in df to get the lasted movie from all the collection of movies in the data we need to do something on the DataFrame.
>>>df["year"].max( )
2008
The latest movie was released in the year 2008. Not only max pandas will also provide you many functionalities.
If you need basic statistics of a numerical table in the DataFrame use describe( ) method. df.describe( )
How to read and write data
I had a CSV file with data I had to read it to pandas DataFrame. pandas gave read_csv( ) method to the operations on the file. pandas also support the other file formats like SQL, Excel, JSON, etc. each have a method with the prefix
read_*( )
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
>>> import pandas as pd | |
>>> titanic = pd.read_csv("titanic.csv") | |
>>> titanic | |
PassengerId Survived Pclass ... Fare Cabin Embarked | |
0 1 0 3 ... 7.2500 NaN S | |
1 2 1 1 ... 71.2833 C85 C | |
2 3 1 3 ... 7.9250 NaN S | |
3 4 1 1 ... 53.1000 C123 S | |
4 5 0 3 ... 8.0500 NaN S | |
.. ... ... ... ... ... ... ... | |
886 887 0 2 ... 13.0000 NaN S | |
887 888 1 1 ... 30.0000 B42 S | |
888 889 0 3 ... 23.4500 NaN S | |
889 890 1 1 ... 30.0000 C148 C | |
890 891 0 3 ... 7.7500 NaN Q | |
[891 rows x 12 columns] | |
>>> titanic.head(5) | |
PassengerId Survived Pclass ... Fare Cabin Embarked | |
0 1 0 3 ... 7.2500 NaN S | |
1 2 1 1 ... 71.2833 C85 C | |
2 3 1 3 ... 7.9250 NaN S | |
3 4 1 1 ... 53.1000 C123 S | |
4 5 0 3 ... 8.0500 NaN S | |
[5 rows x 12 columns] | |
>>> titanic.tail(10) | |
PassengerId Survived Pclass ... Fare Cabin Embarked | |
881 882 0 3 ... 7.8958 NaN S | |
882 883 0 3 ... 10.5167 NaN S | |
883 884 0 2 ... 10.5000 NaN S | |
884 885 0 3 ... 7.0500 NaN S | |
885 886 0 3 ... 29.1250 NaN Q | |
886 887 0 2 ... 13.0000 NaN S | |
887 888 1 1 ... 30.0000 B42 S | |
888 889 0 3 ... 23.4500 NaN S | |
889 890 1 1 ... 30.0000 C148 C | |
890 891 0 3 ... 7.7500 NaN Q | |
[10 rows x 12 columns] | |
>>> titanic.dtypes | |
PassengerId int64 | |
Survived int64 | |
Pclass int64 | |
Name object | |
Sex object | |
Age float64 | |
SibSp int64 | |
Parch int64 | |
Ticket object | |
Fare float64 | |
Cabin object | |
Embarked object | |
dtype: object | |
>>> |
You can get the datatypes details of all Series in the DataFrame by the attribute dtypes in the pandas.
You read the data from a CSV file to pandas DataFrames now you have to extract the data as your useful format. While read_*( ) methods read the data to_*( ) extract the data from the pandas. To know more about the reading and writing methods check this link: pandas_methods.
>>>titanic.to_excel('titianic.xlsv', sheet_name="passengers")
>>> titanic.info()
If you want to know more than datatypes then we use info( ) method on the DataFrame. It provides you index value, Exact RAM used to hold the DataFrame as well. We can know the missing value numbers in the File. The info method is used to know the technical information about the data.