STEM with Python – Part 7: Pandas and Pandasql

The SQL of Python, so to speak.

What if I told you that you can convert your spreadsheet into a SQL table against which you can execute advanced SQL questions to conduct professional data analysis? Well, it’s possible with Python’s pandas and pandasql libraries. This is a very interesting programming journey. Here, we go!

We’ve seen a lot of Python’s potential throughout this series, covering SQL compatibility, and algebraic libraries. And this time, let’s take advantage of the language’s other very powerful and famous libraries: pandas and pandasql.

Since we covered Python and MySQL a few days back, this is the perfect timing to cover its SQL-related library that converts your spreadsheet into a SQL-like database against which you can execute SQL queries. Can you believe it? I repeat you can convert your spreadsheet into a SQL-like database! It’s simply wonderful and this is something I’ve never done it before in other languages, and I’m not even sure I can find anything as powerful as pandas/pandasql in other languages!

Anyways, without further adieu, let’s get into it!

First things first, you have to install pandas.

$pip install pandas

The installed result:

image 01

And then, install pandasql. This is the library that creates the magic!

$pip install pandasql

The installed result:

image 02

And here’s the CSV file we’re toying around.

image 03

Okay, now we’re ready to go! This is our first panda-powered code. And don’t forget to import both pandas and pandasql.

import pandas as pd
import pandasql as pdql

data = pd.read_csv("/path/dataJan112023.csv")
data = data.set_index("id")

print(data)

The executed result:

johnito@skynewubuntuserver:pandas$ python3 pandasSample.py
      name  age  height gender
id
1     John   48     180      m
2     Anna   30     167      f
3     Mike   40     187      m
4   Ashley   21     172      f
5    Frank   80     180      m

And here’s another query with a condition.

import pandas as pd
import pandasql as pdql

data = pd.read_csv("/path/dataJan112023.csv")
data = data.set_index("id")

print(data.query("age >= 30"))

The executed result:

johnito@skynewubuntuserver:pandas$ python3 pandasSample2.py
     name  age  height gender
id
1    John   48     180      m
2    Anna   30     167      f
3    Mike   40     187      m
5   Frank   80     180      m

The SQL of Python, so to speak:

Okay, this is the fun part. By taking advantage of pandasql’s powerful feature, you can execute SQL queries against the spreadsheet. Here we go!

import pandas as pd
import pandasql as pdql

data = pd.read_csv("/path/dataJan112023.csv")
data = data.set_index("id")

sql_query = "select * from data where age >= 30"

print(pdql.sqldf(sql_query, locals()))

The executed result (the difference is that we have the index column on the left side):

johnito@skynewubuntuserver:pandas$ python3 pandasSample3.py
   id   name  age  height gender
0   1   John   48     180      m
1   2   Anna   30     167      f
2   3   Mike   40     187      m
3   5  Frank   80     180      m

Here’s an SQL query with a condition.

import pandas as pd
import pandasql as pdql

data = pd.read_csv("/path/dataJan112023.csv")
data = data.set_index("id")

sql_query = "select * from data where gender == 'f'"

print(pdql.sqldf(sql_query, locals()))

The executed result:

johnito@skynewubuntuserver:pandas$ python3 pandasSample4.py
   id    name  age  height gender
0   2    Anna   30     167      f
1   4  Ashley   21     172      f

Another sample:

import pandas as pd
import pandasql as pdql

data = pd.read_csv("/path/dataJan112023.csv")
data = data.set_index("id")

sql_query = "select * from data where gender == 'f' and age >= 30"

print(pdql.sqldf(sql_query, locals()))

The executed result:

johnito@skynewubuntuserver:pandas$ python3 pandasSample4.py
   id  name  age  height gender
0   2  Anna   30     167      f

Finally, here’s an advanced query with left-join. What is interesting is that you can actually create a temporary table with pandas’ DataFrame as you can see below. And what we do here is to left-join data2 to the spreadsheet’s data. Interesting, isn’t it?

import pandas as pd
import pandasql as pdql

data = pd.read_csv("/path/dataJan112023.csv")
data = data.set_index("id")

sql_query = "select * from data where gender == 'f' and age >= 30"

data2 = pd.DataFrame({"id":[2], "job":["Programmer"]})
print(data2)

sql_query = "select * from data d left join data2 d2 on d.id = d2.id"

print(pdql.sqldf(sql_query, locals()))

The executed result:

johnito@skynewubuntuserver:pandas$ python3 pandasSample5.py
   id         job
0   2  Programmer
   id    name  age  height gender   id         job
0   1    John   48     180      m  NaN        None
1   2    Anna   30     167      f  2.0  Programmer
2   3    Mike   40     187      m  NaN        None
3   4  Ashley   21     172      f  NaN        None
4   5   Frank   80     180      m  NaN        None

Afterthoughts:

So, this is superb! I mean, simply superb! How much can we enjoy coding with this language with such powerful and like-no-other libraries?! This time, I used a small-sized spreadsheet, but if you want, you can try using a large amount of financial data which can be downloaded from the Internet next time. You can probably find a very interesting analyzed dataset from your queries.

See you soon!

Leave a Reply