The SQL of Python, so to speak.
What if I told you that you can convert your spreadsheet into a SQL table against which you can execute advanced SQL questions to conduct professional data analysis? Well, it’s possible with Python’s pandas and pandasql libraries. This is a very interesting programming journey. Here, we go!
We’ve seen a lot of Python’s potential throughout this series, covering SQL compatibility, and algebraic libraries. And this time, let’s take advantage of the language’s other very powerful and famous libraries: pandas and pandasql.
Since we covered Python and MySQL a few days back, this is the perfect timing to cover its SQL-related library that converts your spreadsheet into a SQL-like database against which you can execute SQL queries. Can you believe it? I repeat you can convert your spreadsheet into a SQL-like database! It’s simply wonderful and this is something I’ve never done it before in other languages, and I’m not even sure I can find anything as powerful as pandas/pandasql in other languages!
Anyways, without further adieu, let’s get into it!
First things first, you have to install pandas.
$pip install pandas
The installed result:
And then, install pandasql. This is the library that creates the magic!
$pip install pandasql
The installed result:
And here’s the CSV file we’re toying around.
Okay, now we’re ready to go! This is our first panda-powered code. And don’t forget to import both pandas and pandasql.
import pandas as pd
import pandasql as pdql
data = pd.read_csv("/path/dataJan112023.csv")
data = data.set_index("id")
print(data)
The executed result:
johnito@skynewubuntuserver:pandas$ python3 pandasSample.py
name age height gender
id
1 John 48 180 m
2 Anna 30 167 f
3 Mike 40 187 m
4 Ashley 21 172 f
5 Frank 80 180 m
And here’s another query with a condition.
import pandas as pd
import pandasql as pdql
data = pd.read_csv("/path/dataJan112023.csv")
data = data.set_index("id")
print(data.query("age >= 30"))
The executed result:
johnito@skynewubuntuserver:pandas$ python3 pandasSample2.py
name age height gender
id
1 John 48 180 m
2 Anna 30 167 f
3 Mike 40 187 m
5 Frank 80 180 m
The SQL of Python, so to speak:
Okay, this is the fun part. By taking advantage of pandasql’s powerful feature, you can execute SQL queries against the spreadsheet. Here we go!
import pandas as pd
import pandasql as pdql
data = pd.read_csv("/path/dataJan112023.csv")
data = data.set_index("id")
sql_query = "select * from data where age >= 30"
print(pdql.sqldf(sql_query, locals()))
The executed result (the difference is that we have the index column on the left side):
johnito@skynewubuntuserver:pandas$ python3 pandasSample3.py
id name age height gender
0 1 John 48 180 m
1 2 Anna 30 167 f
2 3 Mike 40 187 m
3 5 Frank 80 180 m
Here’s an SQL query with a condition.
import pandas as pd
import pandasql as pdql
data = pd.read_csv("/path/dataJan112023.csv")
data = data.set_index("id")
sql_query = "select * from data where gender == 'f'"
print(pdql.sqldf(sql_query, locals()))
The executed result:
johnito@skynewubuntuserver:pandas$ python3 pandasSample4.py
id name age height gender
0 2 Anna 30 167 f
1 4 Ashley 21 172 f
Another sample:
import pandas as pd
import pandasql as pdql
data = pd.read_csv("/path/dataJan112023.csv")
data = data.set_index("id")
sql_query = "select * from data where gender == 'f' and age >= 30"
print(pdql.sqldf(sql_query, locals()))
The executed result:
johnito@skynewubuntuserver:pandas$ python3 pandasSample4.py
id name age height gender
0 2 Anna 30 167 f
Finally, here’s an advanced query with left-join. What is interesting is that you can actually create a temporary table with pandas’ DataFrame as you can see below. And what we do here is to left-join data2 to the spreadsheet’s data. Interesting, isn’t it?
import pandas as pd
import pandasql as pdql
data = pd.read_csv("/path/dataJan112023.csv")
data = data.set_index("id")
sql_query = "select * from data where gender == 'f' and age >= 30"
data2 = pd.DataFrame({"id":[2], "job":["Programmer"]})
print(data2)
sql_query = "select * from data d left join data2 d2 on d.id = d2.id"
print(pdql.sqldf(sql_query, locals()))
The executed result:
johnito@skynewubuntuserver:pandas$ python3 pandasSample5.py
id job
0 2 Programmer
id name age height gender id job
0 1 John 48 180 m NaN None
1 2 Anna 30 167 f 2.0 Programmer
2 3 Mike 40 187 m NaN None
3 4 Ashley 21 172 f NaN None
4 5 Frank 80 180 m NaN None
Afterthoughts:
So, this is superb! I mean, simply superb! How much can we enjoy coding with this language with such powerful and like-no-other libraries?! This time, I used a small-sized spreadsheet, but if you want, you can try using a large amount of financial data which can be downloaded from the Internet next time. You can probably find a very interesting analyzed dataset from your queries.
See you soon!