Classification of Healthy and Cancerous Lung CT Scans using topological data analysis

Serge Assaad

Duke University


Abstract

The purpose of this study is to find a way to use Topological Data Analysis (TDA) to differentiate between CT scans of healthy patients and patients with lung cancer.

Methods: The data set consists of 10 CT scans of healthy patients and 10 CT scans of patients with lung cancer. Edge detection methods Sobel and Canny were used to obtain a point cloud of edges, and a Rips-Collapse algorithm was run to obtain 1-dimensional persistence diagrams. Two types of feature vectors were then extracted, the first having coordinates of the persistence of the 2nd most persistent 1-cycle and the most persistent 1-cycle. The 2nd type of feature vector had coordinates of the persistence of the 3rd most persistent 1-cycle and the 2nd most persistent 1-cycle. The parametric curves obtained were then fitted to a line, and the r2 and slope values were computed for each fit. The differences in the means of the r2 values and slope values between healthy and sick patients were then calculated.  Permutation tests were run on the r2 values and slope values, with the two groups being the healthy and sick patients.

Results: It was found that the method yielding the smallest percent error rate was the one using the Sobel edge detector for the 2nd type of feature vector, by comparing r2 values of the linear fits.

Conclusion: Based on the study, the method using the Sobel edge detector for the 2nd type of feature vector seems like the most promising to use in order to design a classifier for lung CT scans. In the future, in order to increase the significance of the results, more patients, more images per patient, and more time would be needed.


Introduction

Lung cancer is the leading cause of cancer death and second most common cancer among both men and women in the United States according to Center for Disease Control and Prevention (CDC) [1].  Because of these statistics, it is important to look for new ways to diagnose lung cancer in order to try to decrease the mortality due to the disease and try to decrease the prevalence of the disease across all populations in the United States. In biggest demand right now are specialists to treat the United States' aging population that is being diagnosed with cancer at rapid rates. Computers that will potentially be able to assess whether a patient has cancer could help serve as back-up in places where there may be a shortage of pulmonologists, particularly rural and poor American cities. There is a need to develop computational methods to help diagnosis, which would not only accelerate the work rate of current physicians, but also help improve the accuracy of diagnoses. In this study, methods to extract features from visual data will be used, particularly when it comes to the detection of “1-cycles”, or closed loops, in the images. These extracted features will then be used to differentiate between normal and cancerous lung scans. Before moving forward, it would be useful to know the following definitions:

1-cycle: Roughly, a closed loop on a 2D image

Persistence of a cycle: Roughly, the size of that cycle. If a cycle is very persistent, that means it appears as very large on the image.

Edge detection: An algorithm which detects rapid variations in intensity. When given an image, it will output what it thinks are the edges appearing in the image.

Rips-Collapse algorithm: Algorithm to detect 1-cycles in a graph

Feature vector: An object containing multiple pieces of information, or “features” (for our purposes, the vectors will mostly be 2-dimensional).

Type 1 feature vector: For simplicity, we will name “Type 1 feature vector” any feature vector containing 2 specific pieces of information, namely, the persistence (size) of the 2nd largest 1-cycle as the first entry, and the persistence of the largest 1-cycle as the 2nd entry.

Type 2 feature vector: For simplicity, we will name “Type 2 feature vector” any feature vector containing 2 specific pieces of information, namely, the persistence (size) of the 3rd largest 1-cycle as the first entry, and the persistence of the 2nd largest 1-cycle as the 2nd entry.


Methods

For each patient, 20 linearly spaced images from their CT scans were chosen with the hope of finding distinguishing homological features between healthy patients (Figure 1) and patients diagnosed with lung cancer (Figure 2).

                                              Figure 1: CT Scan of a healthy lung

                                              Figure 1: CT Scan of a healthy lung

                                          Figure 2: CT Scan of an unhealthy lung

                                          Figure 2: CT Scan of an unhealthy lung

Four different methods of feature extraction were used. This was achieved by manipulating 2 different variables: first, the type of edge detection algorithm used (Canny or Sobel detection). Second, which pieces of information were extracted from the images, and chosen as the x and y axes (Type 1 or Type 2 feature vectors).


Results

Here are the percent accuracy ratings found for each of these 4 methods (this was done using a permutation test).

Table 1: Percent error for each method

Table 1: Percent error for each method

It was found that the method yielding the smallest percent error rate was the one using the Sobel edge detector for the 2nd type of feature vector, by comparing r2 values of the linear fits. The persistence diagrams for healthy and sick patients for this particular method are shown in Figures 3 and 4.

  Figure 3: Persistence diagrams for healthy patients (Type 2 vectors, Sobel detection)

  Figure 3: Persistence diagrams for healthy patients (Type 2 vectors, Sobel detection)

Figure 4: Persistence diagrams for sick patients (Type 2 vectors, Sobel detection)

Figure 4: Persistence diagrams for sick patients (Type 2 vectors, Sobel detection)


Discussion

It was determined that the plots of the Type 1 feature vectors (second largest vs. largest cycle) were not as significant since every lung CT scan regardless of whether the patient is healthy or sick would likely have the same most persistent 1-cycle. This 1-cycle represents the mediastinum (space between the two lungs) and the heart. Therefore, the plots of the first type of feature vectors rely solely on differences in the second largest cycle of the sick and healthy patients, which makes it more difficult to differentiate between the two.

A possible explanation of why the second type of feature vectors (3rd most persistent 1-cycle vs. 2nd most persistent 1-cycle) plots are different between the healthy and sick patients is because the 2nd or 3rd most persistent 1- cycle in a sick patient would be representative of the tumor. This may introduce an irregularity in the supposedly linear trend of the data, thus reducing the r2 value of the linear fit.

One reason that the Sobel detector may be more accurate than the Canny detector is because the Sobel detector only gives a basic outline of the image whereas the Canny detector outputs the small intensity variations in the lungs in great detail. This may produce more persistent 1-cycles than needed and thus potentially skew the results.


Conclusion

In the end the results showed promise when using the Sobel edge detector on the Type 2 feature vectors to extract r2 values. Based on our study, this method seems to be the most promising if one were to design a classifier for lung CT scans, given that it has the lowest percent error rate.

This study serves as a proof-of-concept of the applications of Machine Learning and Data Analysis in healthcare. Of course, given its limitations, much further research needs to be done before accepting this as a valid method for diagnosing cancer, but there is hope that future work will be able to improve upon this idea.


References

[1] Ammanagi, A. S., Dombale, V. D., Miskin, A. T., Dandagi, G. L., Sangolli, S. S. (2012). Sputum cytology in suspected cases of carcinoma of lung (Sputum cytology a poor man's bronchoscopy!). Lung India: Official Organ of Indian Chest Society, 29(1), 19-23.

http://doi.org/10.4103/0970-2113.92356