Learning Pandas by Experimenting with Data
A learning-focused walkthrough of Pandas, NumPy, SciPy, and Matplotlib, exploring how they work together for data analysis through hands-on experimentation with Series, DataFrames, and real datasets.
Before I begin to dive into this topic, I want to be able to re-summarise the statistics module I covered in the last post. This post and the last one can be merged into one whole topic, as it has all to do with dealing with data. After taking it to ChatGPT to gain confirmation on my own thoughts of Pandas, SciPy, NumPy, and MatPlotLib, I received the following summary.
All these modules are used by statisticians and for data analysis, even when it comes to using non-numerical data. The point of using NumPy and SciPy is to aggregate data into chunks and then transform or translate that data into information that is understandable. We then use MatPlotLib and/or Pandas to gain visualised insights that display the information we want in a better format.
To re-iterate, SciPy and NumPy both use data with advanced mathematical methods to perform fast maths which also happens to prevent deterioration in compute time, enabling us to gain the information we need quickly and accurately. We then would use Pandas and MatPlotLib to give us that information in visualised methods like graphs and tables.
Hopefully, that introduction to the topic will be enough for as it was for me to understand. Pandas from a top level view is all about making tables using dataframes and series.
On my initial inspection of Series, I came to think of it as being something very similar to Python’s enumerate() method, which brings back the indices for each element within a dataset. Dataframes was more about taking a series of information over to a table, which is kept organised.
I should make mention that tables are viewed as data structures and not visualisations, and so this whole topic would lean more towards being like NumPy.
Here is a mental model to grasp this information a little more clearly.
1
2
3
4
5
6
7
8
9
Raw Data (CSV, Excel, JSON, DB)
↓
Pandas (tables, cleaning, filtering, transforming)
↓
NumPy / SciPy (math, statistics, algorithms)
↓
Pandas
↓
Matplotlib / Seaborn (visualization)
For the most part of Statistics and Pandas, I want to focus as much as I can on using the information more than writing about it. I’m not typically the best mathematician, but I do like dealing with tables and graphs, and I can use or experiment with these types of data.
For me, it’s more about being able to understand the tools I can use to generate real data from, and while it’s not imposed to memorise every tool within Python. It’s more about knowing and possessing the ability to work with such tools.
Again, these tools are mostly used by data scientists, and we’ll come across them when it comes to things like Machine Learning or creating real Algorithms.
Using Series
First, we’ll need to import both Pandas and NumPy:
1
2
import numpy as np
import pandas as pd
Now that we have them in our Python program, we can begin to explore the functions available:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
for item in dir(pd):
print(item)
# ArrowDtype
# BooleanDtype
# Categorical
# CategoricalDtype
# CategoricalIndex
# DataFrame
# DateOffset
# DatetimeIndex
# DatetimeTZDtype
# Series
# ...
This should bring back a list of modules or functions all available within Pandas, and I would just go and explore them a little. The most used functions throughout the course contents are Series and DataFrame.
Let’s get a list and have its content listed alongside its index numbers:
1
2
3
4
5
6
7
8
9
10
nums = [1,2,3,4,5]
s = pd.Series(nums)
print(s)
# 0 1
# 1 2
# 2 3
# 3 4
# 4 5
# dtype: int64
We can instead use a custom index:
1
2
3
4
5
6
7
8
9
10
11
12
nums = [1,2,3,4,5]
s = pd.Series(nums, index=[1,2,3,4,5]) # you can use an existing list or even the range function here
s = pd.Series(nums, range=(0,6)) # you can use an existing list or even the range function here
s = pd.Series(nums, index=nums) # you can use an existing list or even the range function here
print(s)
# 1 1
# 2 2
# 3 3
# 4 4
# 5 5
# dtype: int64
We can do the same using a dictionary:
1
2
3
4
5
6
7
8
9
dct = {name:"Sheikh Hussain", city:"Manchester", country:"United Kingdom"}
s = pd.Series(dct)
print(s)
# name Sheikh Hussain
# city Manchester
# country United Kingdom
# dtype: object
We can also populate a constant value for each series:
1
2
3
4
5
6
7
8
s = pd.Series(10, index=[1,2,3])
print(s)
# 1 10
# 2 10
# 3 10
# dtype: int64
We can use Linear Space ot linspace to create a series of equally spaced or separated items
1
2
3
4
5
6
7
8
9
10
11
12
13
s = pd.Series(np.linspace(5,20,10)) # start, stop, number of items
print(s)
# 0 5.000000
# 1 6.666667
# 2 8.333333
# 3 10.000000
# 4 11.666667
# 5 13.333333
# 6 15.000000
# 7 16.666667
# 8 18.333333
# 9 20.000000
# dtype: float64
Using DataFrames
Now we’ll begin to use DataFrames to generate tables from the data that we provide.
1
2
3
4
5
6
7
8
9
10
11
12
13
data = [
['Sheikh', 'UK', 'Manchester'],
['James', 'UK', 'London'],
['John', 'DoeCountry', 'Doechester']
]
df = pd.DataFrame(data, columns=['Names','Country','City'])
print(df)
# Names Country City
# 0 Sheikh UK Sunderland
# 1 Kelly UK London
# 2 John DoeCountry Doechester
We can again do the same using a dictionary:
1
2
3
4
5
6
7
8
9
10
data = {'Name': ['Sheikh', 'David', 'John'], 'Country':[
'UK', 'UK', 'Sweden'], 'City': ['Manchester', 'London', 'Stockholm']}
df = pd.DataFrame(data)
print(df)
# Name Country City
# 0 Sheikh UK Manchester
# 1 David UK London
# 2 John Sweden Stockholm
We can also do the same with a list of dictionaries:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
data = [
{'Name': 'Sheikh', 'Country': 'UK', 'City': 'Manchester'},
{'Name': 'David', 'Country': 'UK', 'City': 'London'},
{'Name': 'John', 'Country': 'Sweden', 'City': 'Stockholm'}
]
df = pd.DataFrame(data)
print(df)
# Name Country City
# 0 Sheikh UK Manchester
# 1 David UK London
# 2 John Sweden Stockholm
Using External Files
Most of our work will be from an external file with data collected from there, so Pandas has built-in functions like read_csv or read_excel and more to extract data from those files and include them as a dataset to work with.
1
2
3
4
5
6
7
8
9
10
11
12
data = pd.read_csv('weight-height.csv')
# This is a file provided by the course to work on the contents and tasks
print(data.head(5)) # This prints off the first 5 items from the list
# Gender Height Weight
# 0 Male 73.847017 241.893563
# 1 Male 68.781904 162.310473
# 2 Male 74.110105 212.740856
# 3 Male 71.730978 220.042470
# 4 Male 69.881796 206.349801
print(data.tail(5)) # Prints the last 5 items
If we need to get the shape of a dataset (number of rows and columns), we would use the shape method here too:
1
2
3
4
data = pd.read_csv('weight-height.csv')
print(data.shape)
# (10000, 3) # 10000 rows and 3 columns
We can use the indexing method to get specific columns or rows.
1
2
3
4
5
6
7
8
9
10
11
data = pd.read_csv('weight-height.csv')
print(data.columns)
# Index(['Gender', 'Height', 'Weight'], dtype='object')
print(data["Height"].head(5)) # Get the column that matches the string value
# 0 73.847017
# 1 68.781904
# 2 74.110105
# 3 71.730978
# 4 69.881796
# Name: Height, dtype: float64
The new function I have come across is using describe(), which is what we can use to gain statistical information on all the data or items inside the dataset. It will produce values like standard deviation, min, max, mean, etc.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
print(data.describe())
# Height Weight
# count 10000.000000 10000.000000
# mean 66.367560 161.440357
# std 3.847528 32.108439
# min 54.263133 64.700127
# 25% 63.505620 135.818051
# 50% 66.318070 161.212928
# 75% 69.174262 187.169525
# max 78.998742 269.989699
print(data['Heights'].describe())
# count 10000.000000
# mean 66.367560
# std 3.847528
# min 54.263133
# 25% 63.505620
# 50% 66.318070
# 75% 69.174262
# max 78.998742
# Name: Height, dtype: float64
So far, as I’ve not been explaining each function, I have been learning how to use these tools a little more while experimenting with them on my own.
We can use the info() to provide us with information on the dataset too.
1
2
3
4
5
6
7
8
9
10
11
print(data['Heights'].info())
# <class 'pandas.core.series.Series'>
# RangeIndex: 10000 entries, 0 to 9999
# Series name: Height
# Non-Null Count Dtype
# -------------- -----
# 10000 non-null float64
# dtypes: float64(1)
# memory usage: 78.3 KB
# None
Modifying a DataFrame
Hitherto, I have spent some time learning about how to bring back data in different ways from different sources. What about adding or modifying existing datasets? For that, Pandas works similarly to inserting items into a normal list.
1
2
3
4
5
6
7
8
9
10
11
12
13
data = pd.read_csv('weight-height.csv')
new_data = [x for x in range(0, 10000)]
data['new_data'] = new_data
print(data.head(5))
# Gender Height Weight new_data
# 0 Male 73.847017 241.893563 0
# 1 Male 68.781904 162.310473 1
# 2 Male 74.110105 212.740856 2
# 3 Male 71.730978 220.042470 3
# 4 Male 69.881796 206.349801 4
The only error I came across was that I had to have the same amount of items to append to my dataset as there are rows already. That’s why I used the range() function so that it can auto-populate.
Filtering information in a DataFrame
Most of these functions are very similar to what we learn of from before; the only real difference is learning how to filter results. For this, I used AI as the course contents at the time of writing this didn’t include it, but that’s probably to promote self-study or research.
1
2
3
4
5
6
7
8
9
10
11
python_titles = data[data['title'].str.contains('javascript', case = False, na = False)]
js_titles = data[data['title'].str.contains('javascript', case=False, na=False)]
print(js_titles.head())
# id title url num_points num_comments author created_at
# 267 12352636 Show HN: Hire JavaScript - Top JavaScript Talent https://www.hirejs.com/ 1 1 eibrahim 8/24/2016 15:16
# 580 10871330 Python integration for the Duktape Javascript ... https://pypi.python.org/pypi/pyduktape 3 1 stefano 1/9/2016 14:26
# 811 10741251 Ask HN: Are there any projects or compilers wh... NaN 1 2 ggonweb 12/15/2015 23:26
# 1046 11343334 If you write JavaScript tools or libraries, bu... https://medium.com/@Rich_Harris/how-to-not-bre... 48 19 callumlocke 3/23/2016 10:54
# 1093 10422726 Rollup.js: A next-generation JavaScript module... http://rollupjs.org 57 17 dmmalam 10/21/2015 0:02
So this is how we would filter through data. What this code does is search through a dataset in the column you specify and then brings back results within that column of rows that contain a string value that you place in it. Similar to RegEx, it ignores case sensitivity, and any null results are excluded.
Summary
This post, to my own standard, is very cheap. The reason why I make this statement is that I haven’t really explained all the functions and explored them in the sense that would make me a professional Python developer.
The problem here is that this is my second pass over this course, and I need to know about these functions and their abilities to then be able to go on ahead and use them. Using them is also a very selective attribute, as it only depends on the projects I work on.
As I go along, because of how useful the course has been, I’ll be looking to go over the course contents again. Each time, I’ll gain a slightly deeper insight that will help improve my overall comprehension and skill in using the tools.
