Pandas is a Python module for working with tabular data (i.e., data in a table with rows and columns). Tabular data has a lot of the same functionality as SQL or Excel, but Pandas adds the power of Python.
- A DataFrame is an object that stores data as rows and columns. You can think of a DataFrame as a spreadsheet or as a SQL table
- You can manually create a DataFrame or fill it with data from a CSV, an Excel spreadsheet, or a SQL query.
- DataFrames have rows and columns. Each column has a name, which is a string.
- Each row has an index, which is an integer.
- DataFrames can contain many different data types: strings, ints, floats, tuples, etc.
Please refer to how to install pandas to install it for python3.6
Pandas have following three data structures
- Series
- DataFrames
- Panels
we mostly will work with DataFrames. series can be considered as subset of DataFrames.
Creating Series and DataFrames
my_list=[1,3,5,6,8,11,18] s1 = pd.Series(my_list) s2 = pd.Series([1,3,5,6,8,11,18]) s3 = pd.Series(np.arange(100)) s4=pd.DataFrame(my_list) df2 = pd.DataFrame(np.random.randint(low=0, high=10, size=(5, 4)),columns=['a', 'b', 'c', 'd']) >>> type(my_list) <class 'list'> >>> type(s1) <class 'pandas.core.series.Series'> >>> type(s4) <class 'pandas.core.frame.DataFrame'> # Creating DataFrame using dict sales = [{'account': 'Jones LLC', 'Jun': 150, 'Jul': 200, 'Aug': 140}, {'account': 'Alpha Co', 'Jun': 200, 'Jul': 210, 'Aug': 215}, {'account': 'Blue Inc', 'Jun': 50, 'Jul': 90, 'Aug': 95 }] df = pd.DataFrame(sales) # sales = {'account': ['Jones LLC', 'Alpha Co', 'Blue Inc'], 'Jun': [111, 222, 55], 'Jul': [222, 210, 90], 'Aug': [140, 215, 95]} df = pd.DataFrame.from_dict(sales) >>> df.dtypes Feb int64 Jan int64 Mar int64 account object dtype: object
df2 = pd.DataFrame(data,columns=['Store ID','Location','Number of Employees'])
print df2
Reading from and writing into CSV files
df = pd.read_csv('my-csv-file.csv') df.to_csv('new-csv-file.csv') #displaying top 5 rows df.head() #displaying top 10 rows df.head(10) #statistical information about df df.info()
df columns can be accessed by two ways
df['age']
df.age
type of this data
print(type(df['age']) print(type(df.age) <class 'pandas.core.series.Series'>
Accessing rows
print(df.loc[2]) #Python is zero indexed
df.loc[3:7]
df.loc[:4]
df.loc[2:]
Creating sub dataframe
# create using specific columns
df2 =df[['clinic_north','clinic_south']]
# create using certian conditon
df2=df[df.month == 'january']
df2=df[df.age > 27]
df2=df[df.city != 'Mumbai']
# using multyiple conditions / In Python, | means "or" and & means "and".
df2 = df[(df.age < 30) | (df.name == 'Martha Jones')] # (curly brackets are must)
df2 = df[df.month.isin(['January','February','March'])]
# Reset index to fetch correct row number
df2.reset_index(drop=True)
Modifying DataFrames
#adding a column
df['height']=[1,2,3,4,5,5]
#adding same value column
df['In Stock?'] = True
df['In Stock?'] = 'Yes'
#New column derived from other columns
df['Revenue'] = df.Price - df['Cost to Manufacture']
#changing cap
from strings import lower
df['Lowercase Name'] = df['Name'].apply(lower)
Using lambda function to modify
get_last_name = lambda x: x.split(' ')[-1]
df['last_name'] = df.name.apply(get_last_name)
Columns can be renamed using
df.columns = ['ID', 'Title','Category','Year Released','Rating']
#selective renaming
df.rename(columns={'name': 'First Name','age': 'Age'},inplace=True)
Using rename with only the columns keyword will create a new DataFrame, leaving your original DataFrame unchanged. That’s why we also passed in the keyword argument inplace=True. Using inplace=True lets us edit the original DataFrame.
wokring with rows amd columns
import pandas as pd
orders = pd.read_csv('shoefly.csv')
print orders.head(5)
orders['shoe_source'] = orders.shoe_material.apply(lambda x:
'animal' if x == 'leather'else 'vegan')
orders['salutation'] = orders.apply(lambda row:
'Dear Mr. ' + row['last_name']
if row['gender'] == 'male'
else 'Dear Ms. ' + row['last_name'],
axis=1)
Aggregations
- mean Average of all values in column
- std Standard deviation
- median Median
- max Maximum value in column
- min Minimum value in column
- count Number of values in column
- nunique Number of unique values in column
- unique List of unique values in column
num_colors =orders.price.max()
num_colors =orders.shoe_color.nunique()
#groupby
df.groupby('column1').column2.measurement()
#example
grades = df.groupby('student').grade.mean()
pricey_shoes =orders.groupby('shoe_type').price.max()
#groupby with muiple columns
shoe_counts = orders.groupby(['shoe_type','shoe_color']).id.count().reset_index()
Sometimes, the operation that you want to perform is more complicated than mean or count. In those cases, you can use the apply method and lambda functions, just like we did for individual column operations. Note that the input to our lambda function will always be a list of values.
cheap_shoes = orders.groupby('shoe_color').price.apply(lambda x: np.percentile(x,25)).reset_index()
print(cheap_shoes)
Pivots using dataframe
import pandas as pd
user_visits = pd.read_csv('page_visits.csv')
print(user_visits.head())
click_source=user_visits.groupby('utm_source').id.count().reset_index()
print(click_source)
click_source_by_month=user_visits.groupby(['utm_source','month']).id.count().reset_index()
print(click_source_by_month)
click_source_by_month_pivot = click_source_by_month.pivot(index='utm_source',columns='month',values='id').reset_index()
print(click_source_by_month_pivot)
Merge two dataframes
#this will happen if both dataframes have one column common
sales_vs_targets = pd.merge(sales,targets)
#another way to merge
new_df = orders.merge(customers)
#multile merge
new_df = orders.merge(customers).merge(products)
#merge of columns name do not match
orders_products = pd.merge(orders,products.rename(columns={'id': 'product_id'}))
#another way to merge if columns name do not match
orders_products = pd.merge(orders,products, left_on='product_id', right_on='id', suffixes=['_orders','_products'])
#outermerge
pd.merge(company_a, company_b, how='outer')
#left
store_a_b_left = pd.merge(store_a, store_b, how='left')
#right
store_a_b_right = pd.merge(store_a, store_b, how='right')
#concatenation
pd.concat([df1, df2])
Query DataFrame
#simple condition
crushing_it = sales_vs_targets[sales_vs_targets.revenue > sales_vs_targets.target]
#multiple conditions
results= all_data[(all_data.revenue > all_data.target) &(all_data.women>all_data.men)]
print(results)