Author page: bentleyv

Python Docker Image with Pyodbc for SQL Server Connectivity

Photo by Guillaume Bolduc on Unsplash

The Ask

A docker image that can be used as a standardized starting point for python development work. Additionally,

  • Use pyodbc to connect to SQL Server
  • The smaller footprint, the better

Seems like a relatively straightforward ask. However, if you’re not careful, you’re going to hit 3 different roadblocks along your journey.

Block#1 : Docker Image Size

Maybe it was my search criteria, but many tutorials ask you to start building a python docker container by using the official Python image. But if you’re not careful, when you’re ready to deploy outside of your workstation, you’re going to realize that the size of the underlying image is over 900MB. This, followed by the immediate assertion that to reduce image size you can use the alpine flavor instead. Yes!! You’ve solved the image size issue.

Block#2: The Pyodbc module

Installing pyodbc on the official python image is a fairly straight forward command

pip install pyodbc

However, plugging this command into a Dockerfile using a python alpine image is certainly going to give you issues. As smarter people on the internet will tell you, this to do with the lack of gcc and/or g++ installed within the underlying alpine base (you’re starting to see why the image is really small in size). But, no worries, all you need is the right set of dependencies installed and we can take care of this block too.

Block#3: SQL Server as a Data Source

Top search results will tell you the best way to connect to SQL Server from docker is to make sure you have Microsoft’s ODBC drivers for SQL Server installed. Great! There’s also a dedicated page from Microsoft to do it on Linux. And when you’ve gone down the rabbit hole a bit longer, you realize that Microsoft is probably not going to directly support drivers for Alpine. So what do we do now? Pivot to using the FreeTDS driver all things are right with the world.

Also, when you’re working with SQL Server, you want to make sure you have your settings done right. Here’s a great article that tells you all about it.

Block#4: FreeTDS Driver Location within the container

Now hold on a hot minute, why is there a 4th one listed here? Because once you’ve hopped past all the previous blocks you’re going to probably find that pyodbc is now having trouble finding the driver you just installed. Again, smarter folks of the tech world to the rescue. Essentially what we have to do is tell pydobc where to find the installed driver since it can’t seem to pick the right one up right off the bat.

Putting this all together

TL;DR – Here are the steps we need to follow to get everything working

  1. Use python (alpine flavor) as your image base
  2. Install dependencies
  3. Pip install pyodbc
  4. Set connection string
  5. Explicitly tell pyodbc, within your script, where to find your driver

Here’s a GitHub repo with what I managed to put together – with help from /u/jpens from the Docker subreddit and all the other articles listed.

Conclusion

Now we have a docker image with a relatively small footprint that uses pyodbc and FreeTDS to connect to SQL Server.

Zombieland: Double Tap Movie Review

Zombieland Double Tap Poster Credit

As most fans of Zombieland will tell you, it’s been a long 10 year roller coaster of a journey to the sequel. After the first movie was well received, there was so much news, it was hard to determine what was real and not. For every seemingly positive news, there was also an immediate, inevitable setback. And eventually, after years of back and forth you give up the notion of another movie (and yet hold out on the hope of a 1% chance – just like any optimist might). Half Life fans know exactly what I’m talking about. In any case, the wait was so worth it.

The Good

There is no sequel curse in the this franchise. After waiting a decade for the sequel, one would think the mojo might have disappeared – things have certainly changed over a decade after all. However, everything seems to pick off right where we left off. Even the cast didn’t seem to age one bit (in the trailers!). There’s more than enough odes and nods to the first movie to satisfy even the most hardcore of original fans. And while the original cast still brings it, the whole exercise in reminiscing is made better with the introduction of new cast members. Special shout out must go out to Zoey Duetch as Madison (valley girl tropes usually devolve into annoyance fairly quickly, but not this time!).

The Bad

One of the things that made the original movie great were #TheRules. However, I’m not sure who ok-ed the editing in the sequel as there was just not enough time to read the rules displayed on the screen – especially in the beginning of the movie….unless it was supposed to be an attempt at subliminal messaging or something of that sort.

The Ugly

Within the zombieland universe, there is no such thing.

TL;DR

If you’re a fan of the first movie or just someone looking for a good zombie comedy, this movie is a must see! This is a befitting sequel to what should have been at least a trilogy had it been done right years ago.

BAE Texts Analysis

Photo by Jamie Street on Unsplash

A while ago, I came across an interesting project where someone analysed their iMessages history. While I couldn’t track down that specific project, here’s another example of someone from Reddit analyzing 4 years of iMessage history with his long distance girlfriend. Having been inspired by this and a few similar projects, I decided to take a crack at it.

BAETA

The BAE Texts Analysis (BAETA – pronounced bay-ta) project was designed to analyze text messages between bae and I. There were also a few constraints to keep in mind:

  1. Ability to analyze messages from both Google Voice and iOS
  2. Ability to visualize findings
  3. Ability to replicate the process for the future

With these constraints in mind, I figured the best way to approach this, was with the following tools

  1. Data scripting – Python with the following modules/packages
    • bs4 a.k.a BeautifulSoup (you’ll see why below)
    • nltk – for NLP (Stemming / Lemmatization)
  2. Data Storage – SQL Server
  3. Data Visualization – Tableau / Power BI

Collecting Data

Google Voice

With some of my messages on Google voice, I had to figure out a way to collect this data. A quick search led me to Google Takeout. Sign in to my account, select all the data I want Google to zip up (depending on your data history this could take a while), and once all done, I get an email saying my data is ready for download. I followed the steps and, voila, I have all my history from Google Voice downloaded to my computer with the following structure

As you can see, all texts are housed within files with the name Text. Here’s a snippet from one of the files.

Now that I know what the structure of a text / conversation looks like, I use the BeautifulSoup module from Python to navigate the document structure and extract the information I need.

iMessage

In order to get access to iMessages I needed three things

  1. A Macbook – there are articles out there that talked about how creating an iTunes backups could give me access to my texts and after failing a couple of attempts (perhaps I could’ve resolved this with more troubleshooting) on a Windows machine, I decided to switch tactics and use a Macbook instead
  2. Messages App on Mac OS – Once on the Mac, I had to set it up to sync my iMessages. Here’s an article that gave me more information on where I needed to look for the required file. What I needed to find, here, was a chat.db file that was generated as a consequence of the iMessage sync. This sqlite database houses all the chats and I can hone in on what i’m looking for from there.
  3. SQLite DB Browser – Once I had the chat.db file from step#2, I moved it to my Windows machine and got the DB Browser for SQLite that helps browse the data within the chat.db file. Here’s what the db structure looked like.

Now that I know what’s in the database, I try to figure out what tables are likely to contain the data I need – message, chat and chat_message_join. The message table has the actual message but I’m also going to need some other info

  • ROWID to join with message_id from the chat_message_join table
  • chat_id from the chat_message_join to join against the ROWID from the chat table
  • epoch time manipulation – the article I linked to in the section above, gives excellent guidance as to what needs to be done to convert the epoch time to human readable time.

Python Scripting

Once I got a handle on what information was where, I knew I wanted to extract and retain the following key pieces of information from both data sets

  • Message Author – who sent a message
  • Message Time – what time was the message sent
  • Message – the actual message itself

And now that I know what I want from my source data, I wrote a Python program to do 3 things

  1. Extract key information from Google Voice Takeout data
  2. Extract key information from iMessage Data
  3. Analyse the message content from the extracted data – as part of the exercise I also wanted to analyse what words / emojis were used the most
  4. Store the data for later use – write it to a db within my local SQL Server instance

Notes

While creating the script and figuring out how to store the data, there were a couple of things I had to keep in the back of my mind.

  1. utf-8 encoding – the data from Google Voice needed to be utf-8 encoded
  2. utf-8 encoding on sql server table – in order to retain the encoding when data was written to the table, I needed to use nvarchar as the column datatype as opposed to the standard varchar

Data Visualization

Now that I extracted the information and put it into tables within my db, it’s time to start visualizing. I did a version of all visualizations in Tableau and kept hitting roadblocks with Unicode displays. Power BI on the other hand, had no such issues. In any case, I wanted to answer the following questions.

Who Texts More?

Do Messaging Habits (message count) Change With Platform?

Messaging Over Time

Who Initiates Conversations

Who Talks More

Word & Emoji Analysis

Future Enhancements

Having answered the questions I set out to, there are a couple of things I would like to do in the next iteration of this project

  • Sentiment Analysis – scour texts and assign a sentiment score to either the text themselves on some cluster / aggregation of them
  • Word Use By Author – what words use trends can we find by looking at messages from each author
  • Visualization – Use the workarounds for the Unicode character roadblocks in Tableau (although it’s a complete shame that a powerful tool like Tableau has this as a problem to begin with) and publish to Tableau

Conclusion

This was such a fun project to undertake.Besides sharpening some old skills and learning new ones, the most important part was sharing this with Bae and while she doesn’t share the same nerd level of enthusiasm as me, it was certainly quite exciting. And with the project base all set, it’s just a matter of time before I return and work on some of the future enhancements.

Breaking Windows 10 with BitLocker

Earlier this year, after being with Apple for over 10 years, I decided to switch back over to Windows, primarily because laptops this side of the fence seemed to offer better configurations for comparable pricing. And having used Windows 10 for over a couple of years, it seemed like a decent enough bet. The one thing I didn’t anticipate, however, was the Windows Update system.

Back in the good ‘ol days, if you had a Windows machine, you considered yourself lucky if you got a notification warning you about an impending restart – usually the computer just restarted itself and applied all the updates..because why not. Since those days, Microsoft decided to institute the “Active Hours” methodology – allowing you to schedule hours where it was a complete no no to install any updates. Hmm, not bad!

So the other day there were a few updates that needed to get installed and since I wasn’t working on anything important I decided to give the go ahead and install whatever updates was recommended. And, moreover, assuming that the system would apply updates and land back at the login screen as it had done many times before, I pressed “Restart” and didn’t come back until 24 hours had passed. Bad mistake.

BitLocker

The laptop in question, being my personal machine, I had decided from Day 1 that BitLocker was probably overkill. Worst case, I’d just reformat the drive and we’d be good to go. So far so good. Imagine my surprise when no authentication method would work. Ok, what do we do now? Everyone knows the IT adage…

“Have you tried turning it off and on again?” – Any IT Person

Alright, restart it is. Here’s where things get interesting – right after the laptop manufacturer logo, I was greeted by this.

BitLocker Greeting Screen
A BitLocker Screen for reference

Ok, what in God’s good name is BitLocker asking me? I’ve never activated it. And because of that simple precursor of a step, there is no access code, no recovery key, nothing – again… because nothing was ever setup. Why in the world, then, does Windows 10 think I have an encrypted drive? And what recovery keys are you talking about?

Now, I’m not sure if anyone else has run into this, but can you imagine waking up one day to find your laptop had decided, on its own, that it was in your best interest to turn on an encryption service without warning? I have read reports of windows updates deleting files, breaking other things but this is not something I had heard of or come across until now. Frantic Googling was of little help. Some resolution activities included buying a whole new hard drive! Why thank you!

So a whole fresh Windows 10 installation and a manual data backup plan later, thanks Windows… I think I can take it from here.

Visualizing World Happiness Rankings

The Reddit Is Beautiful subreddit has a monthly visualization challenge where you are given a dataset and challenged to visualize it – the June 2019 challenge was the World Happiness Rankings.

Approach

I used 2 tools to approach this challenge

  • Python – a simple way to combine the 3 sheets that came part of the dataset
  • Tableau – to visualize the newly combined data

Analysis

The best way to interact with the visualization is by heading over to Tableau Public and viewing the visualization story in Full Screen Mode.

2017 World Happiness Rankings

A filled world map with the world happiness rankings of 2017. As a added feature, I used the data from 2015 and 2016 to find which year the country was the happiest.

2017 World Happiness Rankings

Most & Least Happiest Countries of 2017

Let’s take a look at the most and least happiest countries of 2017. In addition to the rankings, I also took the countries that improved the most and regressed the most as an indicator of happiness. To visualize this, I used a combination of maps (to show the countries on a map) and bar charts (to show the rank progression – plotting the change in rank from 2016 to 2017).

Most & Least Happiest Countries 2017

Happiness Rank Movement

To get a better picture of the happiness ranks, it was important to chart the happiness rank journey of the top 10 countries from the 2017 list. The color popularity for new cars visualization from sirvizalot is an excellent starting point for anyone who wants to learn how to leverage this type of visualization.

Happiness Rank Journey

Happiness Scores

Now that the 2017 data has been visualized, it was important to see how the happiness values have shifted over the course of 2015 – 2017. The best way to do that was with a box and whiskers chart.

Comparison of Happiness Values 2015 – 2017

Conclusion

I really like the monthly challenges that the subreddit runs as it gives me a chance to work with a variety of tools and flex those data muscles outside the confines of the 9-5.

Hello World Again

Photo by Hayden Walker on Unsplash

What I have learnt over many attempts of writing and maintaining a personal blog, especially after having gone from shared hosting to cloud self-managed hosting, is that maintaining the underlying infrastructure is hard work. Having gone from paying a pretty penny on Bluehost, I decided to switch to AWS with disastrous results – I found addressing database leaks that would take down the entire EC2 instance.

In any case it’s time to get back to writing again and this time I decided to go with Digital Ocean. Now we keep writing and seeing if my experiment with Digital Ocean will yield the results I’m looking for.