Post

Learning Pandas by Experimenting with Data

A learning-focused walkthrough of Pandas, NumPy, SciPy, and Matplotlib, exploring how they work together for data analysis through hands-on experimentation with Series, DataFrames, and real datasets.

Learning Pandas by Experimenting with Data

Before I begin to dive into this topic, I want to be able to re-summarise the statistics module I covered in the last post. This post and the last one can be merged into one whole topic, as it has all to do with dealing with data. After taking it to ChatGPT to gain confirmation on my own thoughts of Pandas, SciPy, NumPy, and MatPlotLib, I received the following summary.

All these modules are used by statisticians and for data analysis, even when it comes to using non-numerical data. The point of using NumPy and SciPy is to aggregate data into chunks and then transform or translate that data into information that is understandable. We then use MatPlotLib and/or Pandas to gain visualised insights that display the information we want in a better format.

To re-iterate, SciPy and NumPy both use data with advanced mathematical methods to perform fast maths which also happens to prevent deterioration in compute time, enabling us to gain the information we need quickly and accurately. We then would use Pandas and MatPlotLib to give us that information in visualised methods like graphs and tables.

Hopefully, that introduction to the topic will be enough for as it was for me to understand. Pandas from a top level view is all about making tables using dataframes and series.

On my initial inspection of Series, I came to think of it as being something very similar to Python’s enumerate() method, which brings back the indices for each element within a dataset. Dataframes was more about taking a series of information over to a table, which is kept organised.

I should make mention that tables are viewed as data structures and not visualisations, and so this whole topic would lean more towards being like NumPy.

Here is a mental model to grasp this information a little more clearly.

1
2
3
4
5
6
7
8
9
Raw Data (CSV, Excel, JSON, DB)
↓
Pandas (tables, cleaning, filtering, transforming)
↓
NumPy / SciPy (math, statistics, algorithms)
↓
Pandas
↓
Matplotlib / Seaborn (visualization)

For the most part of Statistics and Pandas, I want to focus as much as I can on using the information more than writing about it. I’m not typically the best mathematician, but I do like dealing with tables and graphs, and I can use or experiment with these types of data.

For me, it’s more about being able to understand the tools I can use to generate real data from, and while it’s not imposed to memorise every tool within Python. It’s more about knowing and possessing the ability to work with such tools.

Again, these tools are mostly used by data scientists, and we’ll come across them when it comes to things like Machine Learning or creating real Algorithms.

Using Series

First, we’ll need to import both Pandas and NumPy:

1
2
import numpy as np
import pandas as pd

Now that we have them in our Python program, we can begin to explore the functions available:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
for item in dir(pd):
    print(item)

# ArrowDtype
# BooleanDtype
# Categorical
# CategoricalDtype
# CategoricalIndex
# DataFrame
# DateOffset
# DatetimeIndex
# DatetimeTZDtype
# Series
# ...

This should bring back a list of modules or functions all available within Pandas, and I would just go and explore them a little. The most used functions throughout the course contents are Series and DataFrame.

Let’s get a list and have its content listed alongside its index numbers:

1
2
3
4
5
6
7
8
9
10
nums = [1,2,3,4,5]
s = pd.Series(nums)

print(s)
# 0    1
# 1    2
# 2    3
# 3    4
# 4    5
# dtype: int64

We can instead use a custom index:

1
2
3
4
5
6
7
8
9
10
11
12
nums = [1,2,3,4,5]

s = pd.Series(nums, index=[1,2,3,4,5]) # you can use an existing list or even the range function here
s = pd.Series(nums, range=(0,6)) # you can use an existing list or even the range function here
s = pd.Series(nums, index=nums) # you can use an existing list or even the range function here
print(s)
# 1    1
# 2    2
# 3    3
# 4    4
# 5    5
# dtype: int64

We can do the same using a dictionary:

1
2
3
4
5
6
7
8
9
dct = {name:"Sheikh Hussain", city:"Manchester", country:"United Kingdom"}

s = pd.Series(dct)

print(s)
# name       Sheikh Hussain
# city           Manchester
# country    United Kingdom
# dtype: object

We can also populate a constant value for each series:

1
2
3
4
5
6
7
8
s = pd.Series(10, index=[1,2,3])

print(s)

# 1    10
# 2    10
# 3    10
# dtype: int64

We can use Linear Space ot linspace to create a series of equally spaced or separated items

1
2
3
4
5
6
7
8
9
10
11
12
13
s = pd.Series(np.linspace(5,20,10)) # start, stop, number of items
print(s)
# 0     5.000000
# 1     6.666667
# 2     8.333333
# 3    10.000000
# 4    11.666667
# 5    13.333333
# 6    15.000000
# 7    16.666667
# 8    18.333333
# 9    20.000000
# dtype: float64

Using DataFrames

Now we’ll begin to use DataFrames to generate tables from the data that we provide.

1
2
3
4
5
6
7
8
9
10
11
12
13
data = [
    ['Sheikh', 'UK', 'Manchester'],
    ['James', 'UK', 'London'],
    ['John', 'DoeCountry', 'Doechester']
]

df = pd.DataFrame(data, columns=['Names','Country','City'])
print(df)

#     Names     Country        City
# 0  Sheikh          UK  Sunderland
# 1    Kelly         UK      London
# 2    John  DoeCountry  Doechester

We can again do the same using a dictionary:

1
2
3
4
5
6
7
8
9
10
data = {'Name': ['Sheikh', 'David', 'John'], 'Country':[
    'UK', 'UK', 'Sweden'], 'City': ['Manchester', 'London', 'Stockholm']}

df = pd.DataFrame(data)
print(df)

#      Name Country        City
# 0  Sheikh      UK  Manchester
# 1   David      UK      London
# 2    John  Sweden   Stockholm

We can also do the same with a list of dictionaries:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
data = [
    {'Name': 'Sheikh', 'Country': 'UK', 'City': 'Manchester'},
    {'Name': 'David', 'Country': 'UK', 'City': 'London'},
    {'Name': 'John', 'Country': 'Sweden', 'City': 'Stockholm'}
    ]

df = pd.DataFrame(data)

print(df)

#      Name Country        City
# 0  Sheikh      UK   Manchester
# 1   David      UK      London
# 2    John  Sweden   Stockholm

Using External Files

Most of our work will be from an external file with data collected from there, so Pandas has built-in functions like read_csv or read_excel and more to extract data from those files and include them as a dataset to work with.

1
2
3
4
5
6
7
8
9
10
11
12
data = pd.read_csv('weight-height.csv')
# This is a file provided by the course to work on the contents and tasks

print(data.head(5)) # This prints off the first 5 items from the list
#   Gender     Height      Weight
# 0   Male  73.847017  241.893563
# 1   Male  68.781904  162.310473
# 2   Male  74.110105  212.740856
# 3   Male  71.730978  220.042470
# 4   Male  69.881796  206.349801

print(data.tail(5)) # Prints the last 5 items

If we need to get the shape of a dataset (number of rows and columns), we would use the shape method here too:

1
2
3
4
data = pd.read_csv('weight-height.csv')

print(data.shape)
# (10000, 3) # 10000 rows and 3 columns

We can use the indexing method to get specific columns or rows.

1
2
3
4
5
6
7
8
9
10
11
data = pd.read_csv('weight-height.csv')
print(data.columns)
# Index(['Gender', 'Height', 'Weight'], dtype='object')

print(data["Height"].head(5)) # Get the column that matches the string value
# 0    73.847017
# 1    68.781904
# 2    74.110105
# 3    71.730978
# 4    69.881796
# Name: Height, dtype: float64

The new function I have come across is using describe(), which is what we can use to gain statistical information on all the data or items inside the dataset. It will produce values like standard deviation, min, max, mean, etc.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
print(data.describe())

#              Height        Weight
# count  10000.000000  10000.000000
# mean      66.367560    161.440357
# std        3.847528     32.108439
# min       54.263133     64.700127
# 25%       63.505620    135.818051
# 50%       66.318070    161.212928
# 75%       69.174262    187.169525
# max       78.998742    269.989699

print(data['Heights'].describe())
# count    10000.000000
# mean        66.367560
# std          3.847528
# min         54.263133
# 25%         63.505620
# 50%         66.318070
# 75%         69.174262
# max         78.998742
# Name: Height, dtype: float64

So far, as I’ve not been explaining each function, I have been learning how to use these tools a little more while experimenting with them on my own.

We can use the info() to provide us with information on the dataset too.

1
2
3
4
5
6
7
8
9
10
11
print(data['Heights'].info())

# <class 'pandas.core.series.Series'>
# RangeIndex: 10000 entries, 0 to 9999
# Series name: Height
# Non-Null Count  Dtype
# --------------  -----
# 10000 non-null  float64
# dtypes: float64(1)
# memory usage: 78.3 KB
# None

Modifying a DataFrame

Hitherto, I have spent some time learning about how to bring back data in different ways from different sources. What about adding or modifying existing datasets? For that, Pandas works similarly to inserting items into a normal list.

1
2
3
4
5
6
7
8
9
10
11
12
13
data = pd.read_csv('weight-height.csv')
new_data = [x for x in range(0, 10000)]

data['new_data'] = new_data

print(data.head(5))

#   Gender     Height      Weight  new_data
# 0   Male  73.847017  241.893563         0
# 1   Male  68.781904  162.310473         1
# 2   Male  74.110105  212.740856         2
# 3   Male  71.730978  220.042470         3
# 4   Male  69.881796  206.349801         4

The only error I came across was that I had to have the same amount of items to append to my dataset as there are rows already. That’s why I used the range() function so that it can auto-populate.

Filtering information in a DataFrame

Most of these functions are very similar to what we learn of from before; the only real difference is learning how to filter results. For this, I used AI as the course contents at the time of writing this didn’t include it, but that’s probably to promote self-study or research.

1
2
3
4
5
6
7
8
9
10
11
python_titles = data[data['title'].str.contains('javascript', case = False, na = False)]
js_titles = data[data['title'].str.contains('javascript', case=False, na=False)]

print(js_titles.head())

            # id                                              title                                                url  num_points  num_comments       author        created_at
# 267   12352636   Show HN: Hire JavaScript - Top JavaScript Talent                            https://www.hirejs.com/           1             1     eibrahim   8/24/2016 15:16
# 580   10871330  Python integration for the Duktape Javascript ...             https://pypi.python.org/pypi/pyduktape           3             1      stefano    1/9/2016 14:26
# 811   10741251  Ask HN: Are there any projects or compilers wh...                                                NaN           1             2      ggonweb  12/15/2015 23:26
# 1046  11343334  If you write JavaScript tools or libraries, bu...  https://medium.com/@Rich_Harris/how-to-not-bre...          48            19  callumlocke   3/23/2016 10:54
# 1093  10422726  Rollup.js: A next-generation JavaScript module...                                http://rollupjs.org          57            17      dmmalam   10/21/2015 0:02

So this is how we would filter through data. What this code does is search through a dataset in the column you specify and then brings back results within that column of rows that contain a string value that you place in it. Similar to RegEx, it ignores case sensitivity, and any null results are excluded.

Summary

This post, to my own standard, is very cheap. The reason why I make this statement is that I haven’t really explained all the functions and explored them in the sense that would make me a professional Python developer.

The problem here is that this is my second pass over this course, and I need to know about these functions and their abilities to then be able to go on ahead and use them. Using them is also a very selective attribute, as it only depends on the projects I work on.

As I go along, because of how useful the course has been, I’ll be looking to go over the course contents again. Each time, I’ll gain a slightly deeper insight that will help improve my overall comprehension and skill in using the tools.

This post is licensed under CC BY 4.0 by the author.