COVID-19 Data Analysis in Brazil and the World

João Gustavo
Analytics Vidhya
Published in
12 min readJun 21, 2021

--

Para ler em Português, clique aqui!

COVID-19 is an infectious disease caused by a newly discovered coronavirus.

Transmitted mainly through droplets from coughs or sneezes of infected people, the severity of symptoms varies greatly from person to person.

The fact is, not much is known about COVID-19. Studies are being conducted around the world, but the results are not yet conclusive and definitive.

So far, it has been observed that about 80% of the confirmed cases are asymptomatic and rapid. Most people who fall into this group recover without any sequela.

However, 15% of people will have severe infections and will need oxygen. The remaining 5% of people will be classified as very severe infections and will need assisted ventilation, using mechanical respirators in a hospital environment.

In order to raise situational awareness about COVID-19 in Brazil, I will conduct an analysis of the public data on the disease.

Data Collection

The data used was obtained from a repository on GitHub, about Covid-19, from OWID (Our World in Data). Therefore, the dataset to be analyzed is a .csv(owid-covid-data.csv) file — Data about COVID-19 (coronavirus) by Our World in Data

Exploratory Data Analysis

This is where you will gain the necessary knowledge, so you will be able to absorb all the insights gained from the analysis.

The best way to start is to check what the body of our DataFrame looks like, so let’s find out how our data is distributed.

After typing just one line of code, we could see that our DataFrame has:

  • 84 530 entries
  • 59 mounds

It is notable that the DataFrame has many variables and each one will be explained in the dictionary below.

Variables Dictionary

  • iso_code - ISO 3166–1 alpha-3 — three-letter country codes
  • continent - Continent of the geographical location
  • location - Geographical location
  • date - Date of observation
  • total_cases - Total confirmed cases of COVID-19
  • new_cases - New Confirmed Cases of COVID-19
  • new_cases_smoothed - New confirmed cases of COVID-19 (7 days smoothed)
  • total_deaths - Total deaths attributed to COVID-19
  • new_deaths - New deaths attributed to COVID-19
  • new_deaths_smoothed - New deaths attributed to COVID-19 (7-day smoothed)
  • total_cases_per_million - Total confirmed cases of COVID-19 per 1,000,000 people
  • new_cases_per_million - New confirmed cases of COVID-19 per 1,000,000 people
  • new_cases_smoothed_per_million - New confirmed cases of COVID-19 (7-day smoothed) per 1,000,000 people
  • total_deaths_per_million - Total deaths attributed to COVID-19 per 1,000,000 people
  • new_deaths_per_million - New deaths attributed to COVID-19 per 1,000,000 people
  • new_deaths_smoothed_per_million - New deaths attributed to COVID-19 (7-day smoothed) per 1,000,000 people
  • reproduction_rate - Real-time estimate of the effective reproduction rate (R) of COVID-19.
  • icu_patients - Number of COVID-19 patients in intensive care units (ICUs) on a given day
  • icu_patients_per_million - Number of COVID-19 patients in intensive care units (ICUs) on a given day per 1,000,000 people
  • hosp_patients - Number of COVID-19 patients in hospital on a given day
  • hosp_patients_per_million - Number of COVID-19 patients in hospital on a given day per 1,000,000 people
  • weekly_icu_admissions - Number of COVID-19 patients newly admitted to intensive care units (ICUs) in a given week
  • weekly_icu_admissions_per_million - Number of COVID-19 patients newly admitted to intensive care units (ICUs) in a given week per 1,000,000 people
  • weekly_hosp_admissions - Number of COVID-19 patients newly admitted to hospitals in a given week
  • weekly_hosp_admissions_per_million - Number of COVID-19 patients newly admitted to hospitals in a given week per 1,000,000 people
  • total_tests -Total tests for COVID-19
  • new_tests - New tests for COVID-19 (only calculated for consecutive days)
  • total_tests_per_thousand - Total tests for COVID-19 per 1,000 people
  • new_tests_per_thousand - New tests for COVID-19 per 1,000 people
  • new_tests_smoothed - New tests for COVID-19 (7-day smoothed). For countries that don’t report testing data on a daily basis, we assume that testing changed equally on a daily basis over any periods in which no data was reported. This produces a complete series of daily figures, which is then averaged over a rolling 7-day window
  • new_tests_smoothed_per_thousand - New tests for COVID-19 (7-day smoothed) per 1,000 people
  • positive_rate - The share of COVID-19 tests that are positive, given as a rolling 7-day average (this is the inverse of tests_per_case)
  • tests_per_case - Tests conducted per new confirmed case of COVID-19, given as a rolling 7-day average (this is the inverse of positive_rate)
  • tests_units - Units used by the location to report its testing data
  • total_vaccinations - Total number of COVID-19 vaccination doses administered
  • people_vaccinated - Total number of people who received at least one vaccine dose
  • people_fully_vaccinated - Total number of people who received all doses prescribed by the vaccination protocol
  • new_vaccinations - New COVID-19 vaccination doses administered (only calculated for consecutive days)
  • new_vaccinations_smoothed - New COVID-19 vaccination doses administered (7-day smoothed). For countries that don’t report vaccination data on a daily basis, we assume that vaccination changed equally on a daily basis over any periods in which no data was reported. This produces a complete series of daily figures, which is then averaged over a rolling 7-day window
  • total_vaccinations_per_hundred - Total number of COVID-19 vaccination doses administered per 100 people in the total population
  • people_vaccinated_per_hundred - Total number of people who received at least one vaccine dose per 100 people in the total population
  • people_fully_vaccinated_per_hundred - Total number of people who received all doses prescribed by the vaccination protocol per 100 people in the total population
  • new_vaccinations_smoothed_per_million - New COVID-19 vaccination doses administered (7-day smoothed) per 1,000,000 people in the total population
  • stringency_index - Government Response Stringency Index: composite measure based on 9 response indicators including school closures, workplace closures, and travel bans, rescaled to a value from 0 to 100 (100 = strictest response)
  • population - Population in 2020
  • population_density - Number of people divided by land area, measured in square kilometers, most recent year available
  • median_age - Median age of the population, UN projection for 2020
  • aged_65_older - Share of the population that is 65 years and older, most recent year available
  • aged_70_older - Share of the population that is 70 years and older in 2015
  • gdp_per_capita - Gross domestic product at purchasing power parity (constant 2011 international dollars), most recent year available
  • extreme_poverty - Share of the population living in extreme poverty, most recent year available since 2010
  • cardiovasc_death_rate - Death rate from cardiovascular disease in 2017 (annual number of deaths per 100,000 people)
  • diabetes_prevalence - Diabetes prevalence (% of population aged 20 to 79) in 2017
  • female_smokers - Share of women who smoke, most recent year available
  • male_smokers - Share of men who smoke, most recent year available
  • handwashing_facilities - Share of the population with basic handwashing facilities on premises, most recent year available
  • hospital_beds_per_thousand - Hospital beds per 1,000 people, most recent year available since 2010
  • life_expectancy - Life expectancy at birth in 2019
  • human_development_index - A composite index measuring average achievement in three basic dimensions of human development — a long and healthy life, knowledge and a decent standard of living

Data Types

We can also talk about the data types found in our DataFrame, which are mostly Float data (data with decimal places), however, we have some exceptions with data being of type object (non-numeric data).

First Entries

Next, let’s meet our DataFrame, let’s look at the first 5 entries and see what they tell us.

First Five Entries

After checking the first few entries, we can see that some values are missing. However, there is no point in doing extensive cleaning when it comes to data about a virus, as this data already corresponds to reality.

Data Cleaning

Therefore, we will analyze the variables with the most missing values and then we can define which treatment is the most appropriate.

After analyzing the missing data, it is notable that:

  • weekly_icu_admissions,weekly_icu_admissions_per_million,weekly_hosp_admissions andweekly_hosp_admissions_per_million - become irrelevant variables for the analysis, since about 98% of the values are missing, so it can be excluded. The variable,tests_units will also be excluded, as there is no relevance to the analysis.
  • icu_patients,icu_patients_per_million,hosp_patients andhosp_patients_per_million - are variables referring to a single day of registration, have more than 87% of missing values. Thus, these will be excluded, in order to have an analysis as close as possible to reality.

Viewing Data

This part is for visualizing the data, either by means of graphs or by means of the DataFrame.

Countries with Most Cases

Next, let’s locate which Countries have the most cases for the date 2021–04–27, in order, they are:

  • World148.716.872 cases registered around the world.
  • United States32.175.725 According to CNN BRAZIL, one of the possible reasons for the large number of cases was also the failure of the American government to act quickly and decisively to prevent the spread of the virus.
  • India17.997.113 registered cases, a country with a gigantic population, about 1.366 billion people. Despite being a country with a smaller territory, it has a high demographic density, which would explain the quantity of cases.
  • Brazil14.441.563 Here we also have a country with continental dimensions, but a smaller population than India and the United States, the amount of high cases is due to lack of preparation, the negligence of the population, in relation to lockdown and the use of masks, which resulted in a large dissemination of the virus.
  • France5.595.403 One of the main reasons for the large number of registered cases was the delay in government action, generating an even greater spread of the virus.
  • Russia4.725.252 cases, a country of continental dimensions, but with a smaller population than countries of similar dimensions, but with a favorable climate for the spread of the virus. However, it does not fit in the top 5 countries with the most deaths, which is an extremely positive point.

Countries with Most Deaths

Next, let’s locate which countries have the most deaths for the date 2021–04–27

We can note that the Countries that have the most deaths, in order, are:

  • World3.134.956 deaths recorded around the world.
  • United States573.381 registered deaths, it was to be expected, the country with the most cases is also the country with the most deaths.
  • Brazil395.022 registered deaths, it is not the second country with more cases but it is the second country with more deaths, possibly due to the lack of resources and the country’s unpreparedness to deal with the virus.
  • Mexico215.547 registered deaths, here we have an unexpected case, it doesn’t fit in the 5 countries with the most cases, but it is among the 5 with the most deaths.
  • India201.187 It seems that India knew how to deal with the virus in an efficient way, so that there were a reduced number of deaths.
  • United Kingdom127.705 registered deaths, it doesn’t appear among the 5 countries with most cases, but it is one of the 5 with most deaths, definitely a curious case.

5 Country Charts Total Deaths and Cases

In order to have a graphical visualization, which is often more efficient for obtaining information than text, I will plot graphs for the 5 Countries with the most deaths and cases.

Therefore, I will also put for comparison purposes graphs where the world will be present, so we can relate the numbers of the countries to the total numbers of the globe.

In the graphs we can see that the higher the bar, the darker and consequently the higher the number.

  • When we put the world together with the countries, we can see that although the number of cases is large in one country, the total number of cases in the world is approximately 1.8% of the world’s population. In other words, 98.2% of the world’s population has not contracted Covid-19. We have to take into consideration that 1.8% of 7.866 billion is a large amount.
  • When we talk about the total amount of deaths, we have 3,134,956 deaths worldwide and 148,716,872 cases, about 2.1% of the people who contracted the virus died, while 97.9% recovered.

Graphs of Deaths and Cases over Time

Next, I will plot a graph so that we can see the evolution of Covid-19 cases and deaths over time. The graphs will have the above Countries and the World as the object of analysis.

Some of the insights we can take from the graphs are:

  • The United States has always ranked first in the number of cases.
  • Although India is ahead of Brazil in the number of cases, Brazil has a large number of extra deaths.
  • The country with the most recorded deaths is the United States
  • When we put the world on the graph, its growth becomes much more expressive, since it is the sum of cases and deaths from all the countries on the globe.

Exploratory Analysis for Brazil

Now for a specific analysis for Brazil, I will make a copy of the DataFrame that will return me only the entries that have Brazil as a location.

Let’s get to know this dataset, we can look at the first 5 entries for this.

The most recent date present on the DataFrame is 27–04–2021.

First Death in Brazil

The first death in Brazil, was registered in:

  • March 17, 2020 — On that day there were already 321 cases and that was when the first Covid-19 death occurred in Brazil.

First Case in Brazil

The first case of Covid-19 in Brazil was recorded in:

  • February 26, 2020 — Date when the first case of Covid-19 was reported in Brazil.

First Case and First Death

We know that the first case and the first death in Brazil occurred on the following dates:

  • February 26, 2020 — First case of Covid-19 recorded in Brazil.
  • March 17, 2020 — First recorded Covid-19 death in Brazil.

Time elapsed between the First Case and the First Death

Between the first recorded case and the first death, it has been a long time:

  • 20 days — between the first case and the first death.

Brazil was undoubtedly one of the countries that suffered the most from the Coronavirus pandemic, and for comparison purposes, we will make an analysis of Brazil in comparison to the 5 countries with more cases and more deaths in South America.

To do this we will create a copy of the DataFrame where only entries that the continent is South America will be present. Let’s get to know the dataset by looking at the first 5 entries.

Countries with Most Cases

Next, we will locate which countries in South America have the most cases for the most current date.

The Countries that have the most cases, in order, are:

  • Brazil14.441.563 registered cases.
  • Argentina 2.905.172 registered cases.
  • Colombia 2.804.881 registered cases.
  • Peru1.768.186 registered cases.
  • Chile1.179.772 registered cases.

The difference in the number of cases in Brazil when compared to other countries is remarkable. It becomes a frightening data when we discover that the cases in Brazil alone represent more than 50% of all cases in South America.

Countries with Most Deaths

Finally, we will locate which Countries in South America have the most deaths for the most current date.

The countries with the most deaths, in order, are:

  • Brazil395.022 registered deaths.
  • Colombia72.235 registered deaths.
  • Argentina62.599 registered deaths.
  • Peru60.013 registered deaths.
  • Chile26.020 registered deaths.

So, here we see once again that Brazil continues to lead in the number of deaths, as well as the cases. However, Argentina has more cases than Colombia, but has fewer deaths.

5 Country Charts Total Deaths and Cases (South America)

Graphs to visualize the disposition of cases and deaths in the South American countries.

In the graphs we can see that Brazil remains in first place, both for cases and deaths.

Some other insights that we can draw from the graphs are:

  • Brazil accounts for the largest share of both cases and deaths in South America.
  • Brazil represents more than 50% of South America’s deaths, with 395,022 registered deaths. South America has 658,283 registered deaths.
  • Although Argentina has more cases than Colombia, it has fewer deaths.

Graphs of Deaths and Cases over Time (South America)

Graphs to see the evolution of Covid-19 in the South American countries.

Some of the insights I got from the graphs, are:

  • Brazil had a great influence on the South American graph, knowing that they have a similar behavior and are the most similar in number of cases and deaths.
  • Chile remains last either in the number of cases or deaths.

Conclusion

In this analysis, we can get diverse information about covid-19. For example, we now know that the United States is by far the country with the most cases and deaths, Brazil is the second country with the most deaths, India has more cases than Brazil but fewer deaths, and we can point out that Brazil is the country in South America that suffered the most from the Pandemic. Some other curious facts are:

  • It is not because you have more cases that you will also have more deaths and vice versa, examples of this are India and Mexico.
  • Mexico is not among the 5 countries with the most cases, but it is among the 5 with the most deaths.

To access the project, click here! Follow me on LinkedIn and keep an eye on my GitHub, there you can find more projects in the future.

--

--