Malaria Detection Using Machine Learning

Dexter Barahona
5 min readDec 21, 2019

--

About 20,000 Indians die from malaria every year, and about 15 million cases are reported.

Machine Learning can be used for various applications, but one of the most interesting is the way it can be applied to helping the medical field.
Computers + Biology = HUGE advances in medicine!

To put it simply, my machine learning program checks for irregularities in cells that have malaria and compare them to those who don’t. This is super efficient for third world countries to gather data and when stored into the program, these cells could be classified as well.

I will be explaining the steps to being able to replicate this program if you would like to try it out yourself! Do not get discouraged if you don't understand something, a quick google search will definitely help you out but I will try to explain to the best of my ability.

Finding of the Malaria Parasite (1880)

The first thing to understand when you’re replicating a project is to realize why you’re doing it. Malaria has become one of the most prevalent issues in the past ~150 years. Emerging technologies are quickly getting to solving this issue and if you want to see the real-world applications of these emerging technologies I would definitely recommend doing what I did:

Today we are going to learn how to create and use a machine-learning model to detect malaria. First you're going to need the correct files:

https://drive.google.com/file/d/1lxVO...

Using this link, you can get the pictures that we are going to be identifying:

After you gather the photos, we have to start the program, open your trusty coding place and let’s get to work.

Installing Libraries:

What you’re going to want to do first is to name the first-class “gen_dataset.py” and next you want to import all necessary libraries. These libraries include:

  • cv2, os
  • numpy as np
  • csv
  • glob

You should have something that looks like this at this point:

Since we have all the libraries imported, we can continue to code our project.

This is where the code will get complicated so I will type it out completely and explain each part if you’re still confused there are many great resources online to find out what exactly each piece of code does.

Coding the Actual Algorithm:

label = "Uninfected" --> This part of the code is checking the label of the folder it is going to use to train the data set. dirList = glob.glob("cell_images/"+label+"/*.png")--> here is where the list is being created, and in parenthesesis is the folder that its going to use to create the list. 

file = open("csv/dataset2.csv", "a") --> this part of the code is creating the .csv file and the "dataset2" will be the name of the dataset.

for img_path in dirList:

im = cv2.imread(img_path)--> this is locating the code
im = cv2.GaussianBlur(im, (5,5), 2)--> a gaussian blur is added to easier differentiate the malaria infected cells from those that aren't

im_gray = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)--> this changes the pictures to a gray color, because color is not needed for detecting malaria in cells.

ret, thresh = cv2.threshold(im_gray, 127, 255, 0) --> This part creates a threshold on the picture to make it easier to find contours.
contours,_ = cv2.findContours(thresh, 1, 2)--> Contours are outlines in the picture that are counted to detect malaria.

file.write(label)
file.write(",")

for i in range(5):
try:
area = cv2.contourArea(contours[i])
file.write(str(area))
except:
file.write("0")

file.write(",")

file.write("\n") --> this piece finished writing the file

Gathering the Data:

This specific piece of code is going to be run twice. One time when the dataset is checking the label for “uninfected” and one time with the label “infected.” For Uninfected, make sure that you name the dataset= dataset2.csv and the infected dataset = dataset1.csv

You are now finished with the first part and should have two files, “dataset1.csv” and “dataset2.csv.”

Combining the Data:

Simply combine the datasets to one dataset named “dataset.csv”

Now we move onto the second part of the code:

Importing Libraries for New Code:

You once again want to import the necessary libraries:

  • pandas as pd
  • from sklearn.model_selection import train_test_split
  • from sklearn.ensemble import RandomForestClassifier
  • from sklearn import metrics
  • joblib

It should look a little something like this:

Next, we have the actual algorithm itself:

dataframe = pd.read_csv("csv/dataset.csv")--> This is where the .csv file is readprint(dataframe.head())--> this part of the code is displaying the dataframe 

x = dataframe.drop(["Label"],axis=1)
y = dataframe["Label"]
x_train, x_test, y_train, y_test = --> this is where the dara is trained train_test_split(x,y,test_size=0.2,random_state=42)

model = RandomForestClassifier(n_estimators=100,max_depth=5)
model.fit(x_train,y_train) --> this part classifies the data.


joblib.dump(model,"rf_malaria_100_5")

predictions = model.predict(x_test)--> First it has to make a predication to test how accurate its going to be.

print(metrics.classification_report(predictions,y_test)) --> This is the actual test

Your code and dataset is complete and should now look like this:

You can see by the accuracy that it is at 90% and becomes more accurate with the kind of classifier you use and the number of pictures. As you know, if you tried this, there were about 27,000 pictures for the algorithm to go through. The more accurate the dataset means the more pictures that you’re going to need.

Thank you for reading and hopefully following this tutorial with me!

Social Media: 👤

If you enjoyed this article or had any questions or concerns please contact me at dexteralxbarahona@gmail.com

Connect with me on Linkedin at https://www.linkedin.com/in/dexter-barahona-723314194

Instagram: DexterBarahona

--

--

No responses yet