Publish AI, ML & data-science insights to a global community of data professionals.

Is Family Group That Bad? Results Will Shock You

Analyzing WhatsApp Group Chats & Building the Web App

Text Processing, Plotly Graphs & Heroku Deployment

Who doesn’t know about WhatsApp? It is the widely used mobile application irrespective of the device operating system. We all use this application to get our work done quickly and on the go. I don’t know about other countries but in India, family groups are criticized a lot due to spammy/false messages shared in the groups. This also means that a lot of Data is generated and WhatsApp gives the option to export this Data! In this article, I will show you how to dig-in this Data to discover hidden facts and finally make a deployable web app.

Photo by AARN GIRI on Unsplash
Photo by AARN GIRI on Unsplash

Before moving ahead, let us see how to export this data from the groups:

Image by Author (Created in Google Drawings)
Image by Author (Created in Google Drawings)

Data Pre-Processing

The text file contains the timestamp of the message, the author, and the message. Data processing and manipulation can be easily done if the data is transformed into a pandas data frame and here, we will convert this textual information into this. A single entry in this file looks like this:

21/04/20, 5:47 pm - Author Name: Message sent

For every line, a simple regular expression can extract the date and time from this entry, splitting the line at this date and semicolon will give the author and message for the same but there are few problems here:

  1. A multiline message doesn’t have a new timestamp and therefore that continued message needs to be added to the main thread.
  2. There are some WhatsApp default messages like: "Messages are end-to-end encrypted" or "XYZ added you to the group" which may or may not have a timestamp and neither they are multiline. These can break our logic. Also, they are irrelevant to our analysis and can be skipped.

Considering these cases, the code for data frame extraction can be quoted as:

I have not included the whole implementation here but if you are curious to know then you can head over to my GitHub repository:

kaustubhgupta/WhatsApp-Groups-Analyser

Apart from converting it into a data frame, I am also interested in knowing information about the emojis shared in the group. This can be achieved by using the library named as emoji. For every message, we will check if it contains the emoji or not and then create a separate column for their count also. For the demonstration purpose, I will analyze my family group. Let’s look at how the new dataset looks:

df.tail(10)
Image by Author
Image by Author

This is the tail of the data frame and you can see that there are around 9k rows. The date column has a date-time data type for easier manipulation and as mentioned earlier that the data was extracted without media, the messages which contain media have been given the tag of "Media omitted" by WhatsApp. Now that our dataset is ready, it’s time to answer some questions.

Do family groups only have media chats?

All the media containing messages are renamed as omitted messages and that makes it easy to group these messages, count them and divide them by the total number of messages to get the percentage of the messages that are media. The code implementation will be:

((df[df['Message'] == ' <Media omitted> '].Message.count()) / (df.Message.count()))*100

When I ran this command on my family group dataset, it returned 53.6% that means 53 messages out of 100 are either photo, video, or any GIF. Surprisingly, when I ran the same command on my college group dataset, it returned a 3% result! That means we can say that to some extent these groups have more media but this statement is highly arguable as my college group has fewer media but other groups may have more media in the form of assignments.

Do family groups only use one type of emoji?

Emojis are everyone’s favorite thing. Sometimes people only exchange emojis while conversing to convey their emotions and talk less. It will be fun to explore how the emojis are distributed in the group. Before moving ahead, there is one thing to consider here. My native language is Hindi and when I extracted the emojis from the messages, some of the language characters are categorized as emojis. Let’s look at the top 10 frequent emojis of the group:

(Family Group Distribution) Image by Author
(Family Group Distribution) Image by Author

This is a donut plot and by looking at it I can say that the group has a lot of variety! The namaste emoji (🙏 ) accounts for 32% of the total emojis used. There were 20055 total ** emojis used out of which 498** were unique. If we look at the college group distribution, (😂 ) emoji leads the chart:

(College Group Distribution) Image by Author
(College Group Distribution) Image by Author

Let’s explore some more stats to understand the nature of the groups.

Active and Lazy Members

We have the data for each day and every day from the date the group was formed. Due to this, we can analyze the activity of the members of the group over any period. Here we will consider the overall activity from the initial day. To do this, we will group the data by authors, apply count as aggregation, and plot them as a bar plot. Another method can be using the value count function to directly get the count for each author. Whatever method you choose you can plot that data and here is an example for that:

Image by Author
Image by Author

If you look closely then you can see that the active member sent around 50% of the total group messages! You can plot the same for lazy members by taking the least count as a deciding factor:

Image by Author
Image by Author

When I ran the same commands in my college group, the results were almost the same but I was expecting a bump here. (It’s because during phone cleaning, I lost a lot of data and now I have limited data corresponding to the group)

Nightowls or Early Birds?

This question is pretty straightforward that which members are more active in the morning and which at night. A family group usually has all the relatives and senior citizens and as they usually wake up early, it is obvious that there will be more activity in the morning in the case of family groups and more night activity in the case of college groups. In my analysis, I found out the group activity hiked between 8 to 9 am in the case of the family group and between 11 am to 1 pm in the case of the college group. The college group hike is because this is the prime time where we discuss whether we should attend our lab or not!

Family Group Activity (Image by Author)
Family Group Activity (Image by Author)

Now look at other group activity:

College Group Activity (Image by Author)
College Group Activity (Image by Author)

Group Status on Holidays

I took this analysis even further to check the group activities on prominent holidays. Here I will consider national holidays in India. There is a strong opinion that on holidays, family groups are filled with messages from unidentified sources and to some extent it’s true. While the college groups show little or no activities on these days, family groups had a great graph over these days. The dates chosen were Republic Day, Independence Day, Christmas, and other few holidays. I was not able to capture patterns on more famous days such as Diwali or Holi as these festivals don’t have any fixed date and their dates change every year. Here are some of the graphs generated:

Image by Author
Image by Author

Putting All Together

The analysis part is over and now it’s time to put all the things together into a structured format. To make this analysis accessible to every user where they can get reports for their own uploaded data, I have to build a web application to serve this purpose. I structured the whole analysis code into 3 phases or modules. One module helps in processing, cleaning, and creating the data frame. A single module to generate all the stats I presented here (only the raw data) and another module to generate the interactive graphs to be rendered on the web. I used Heroku to deploy this application which is built using flask framework and jinja templating helped to connect this backend to the web frontend. The web app is named Whatsapp-Analyzing.

Demo showing how to use this web app
Demo showing how to use this web app

Conclusion

In this article, we covered how to access the WhatsApp data, process the information, answer some of the trendy questions, and finally deploy the application on could platform. A lot of things can be discovered depending upon the level of understanding of the data. I have not given the code implementation of the Plotly graphs shown in the images as this would make the article messy. All the code can be found in my GitHub repository.

The answer to the question raised in this article, "Is Family Group That Bad?" stands baseless. Every group has almost the same trend but yes, there are few factors where one prevails over others but it cannot be generalized. Let me know your thoughts in the comments thread, you can follow me on medium to get notified about more insightful stories. With that said, Sayonara!

You can connect with me on:

Kaustubh Gupta – Machine Learning Writer – upGrad | LinkedIn

kaustubhgupta – Overview


Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles