Introduction / Background

This presentation is dedicated to all who are not with us today, as a result of a Negligent Technical System Failure throughout the World…

· September 11^th

· Therac-25

· Kursk’ tragedy

· Chernobyl

Introduction / Background

· The recent Institute of ( IOM ) report on the quality of care, states that hospital errors cause between 44,000 - 98,000 deaths every year in American hospitals.¹ - Source: Dr. Brennan, New England Journal of Medicine

· Although, the use of computer-aided technology is urged and has gained wide popularity throughout the medical field, its performance according to some physicians is far from adequate.²

- Source: Dr. Kassirer, New England Journal of Medicine

· Information Technology has become so pervasive that some authors have pointed out that this pervasiveness is a clear sign that we have moved from the industrial revolution to the information revolution.^{3(p. 1)}

- Source: Martin WainWright Information Management System

· Yet, as sophisticated as we are, there is a great lack of information on hospital adverse events, and there is even less available data on computer related adverse events.

· An organized and standardized report of errors is not available to study mistakes and what can be done about them. This type of analysis is available in the aircraft industry and has contributed greatly to improving the safety of flying.

· A plea has been made for the institution of a non-partisan data collection agency to which fatalities and anonymous error reports could be sent, and which would analyze and publish this data on a periodic basis. ¹⁰

- Source: Dr. Myhre, Tranfusion Medical Journal

· Nonetheless, with all the limitation that we have, there is a universal attempt to improve the healthcare system today as much as we can not only within the United States, but also abroad. ¹¹

- Source: Dombal de FT, New England Journal of Medicine

· According to Eta S. Berner, one of the authors of the article on Performance of four Computer - Based Diagnostic Systems, computer-based diagnostic systems are available commercially, but there has been limited evaluation of their performance.

Along with her team, Berner assessed the diagnostic capabilities of four systems: Dxplain ( PC version 4.5), Iliad ( version 4.0), Meditel ( version 2.0), and QMR ( version 2.03). ⁹ They concluded that the programs should be used by physicians who can identify and use relevant information, and ignore the irrelevant information that can be produced.

What were the Research Methods?

Overview

· 10 expert clinicians created a set of 105 diagnostically challenging clinical case summaries that involved the actual patients.

· These experts consisted of nationally recognized consultants in the fields of general internal medicine, 8 subspecialties of internal medicine, and neurology.

· Clinical data were entered into each program with the vocabulary provided by the program’s developer.

· The group of experts produced a ranked list of possible diagnosis for each patient.

· Then, each of the systems produced a ranked list of possible diagnoses.

· When they compared, the list of possible diagnoses of the experts with the computer system’s list of possible diagnoses, the scores were calculated on several 5 performance measures for each computer program.

Score Interpretation/ 5 Measures involved

· The first 2 scores were based on the entire list of diagnoses that the programs generated:

1) Correct Diagnosis Score - reflected the proportion of the diagnoses generated by the computer that were correct or closely related to the diagnosis that was considered to be correct. Ex: Is it correct?

2) Rank Score - reflected the average rank of the correct ( or closely related ) diagnosis as it appeared on the computer-generated list. Ex: How correct was it in terms of a score?

· 3 other scores were derived by reviewing the first 20 diagnoses listed by each program.

1) Comprehensiveness Score - reflected the average proportion of the appropriate diagnoses agreed by the experts that was included on a computer-generated list. It reflected the extent to which the computer suggested all the diagnoses that the experts had not originally listed, but in retrospect they agreed were reasonable to consider. Ex: How well did the program understood the patient’s dillemma?

2) Relevance Score – reflected the average proportion of computer –generated diagnosis that the experts found reasonable to consider, given the clinical data. ⁹ Ex: Was the particular diagnosis relevant?

· Additional Diagnosis Score – reflected the average number of additional diagnoses suggested by the computer that the experts considered appropriate after their final review of other cases. ⁹ Ex: Are additional diagnoses appropriate or not?

How did they select their Data?

· All experts contributed 15 detailed clinical summaries describing patients who had been referred for diagnostic consultation.

· The summaries included data such as history, findings of physical examination, and results of lab tests that were available at the time of the first initial consultation, and that indicated both normal and abnormal conditions.

· The definitive tests information that confirmed the exact diagnosis was omitted for the purpose of this study.

· To ensure that data was optimal, program developers were asked to indicate how they would enter specific clinical data for their particular programs.

· The vocabulary selection might have been biased if the program developers chosen the vocabulary used in a specific context. However, here it was avoided by having them express it in the language of their program as a master list of a discrete data. The data was collected previously from all other cases and listed alphabetically under the general categories of history, physical examination, and laboratory assessment. ⁹

How did they select their Cases?

· All cases involved the entire field of general medicine, including neurology. They were selected to present a spectrum of diagnostic difficulty, but were all considered to be cases in which a physician might be prompted to seek diagnostic help from a colleague.

· These cases included atypical presentations, rare diseases, multiple disorders presenting simultaneously, or elements sufficiently complex that the physician would be likely to request a diagnostic consultation.

· The group of experts decided which case was appropriate to consider and which was not. They categorized it by the organ system, or systems involved, the cause of the disease, and the diagnostic difficulty.

· After this review, 121/150 cases were finally selected for further consideration.

What was their Procedure?

· Using developer’s terms for the clinical data on the master list, the data from each case was entered into each program.

· However, due to some program’s limitations, some data could only be approximated in some programs, or could not be entered at all.

· Further, the data was analyzed by each program, and each produced a list of possible diagnoses.

· Top 20 diagnoses on each list were combined in a master list. The group of experts reviewed diagnoses for their appropriateness and correctlness without any prior knowledge on which program had suggested which diagnosis.

Results of Analyses

· When all cases were considered, scores for Correct Diagnosis showed that the mean scores for Dxplain and Meditel were significantly higher than the scores for Iliad and QMR.

· For 9 cases, none of the programs included the correct diagnosis. However, when it came to a Rank Score, due to the fact that the samples varied in size, the significance of the differences could not be calculated.

· For the Comprehensiveness score, the mean scores for Dxplain and Meditel were significantly higher than for Iliad and QMR.

· Although, Additional Diagnoses score showed on the average that approximately 2 appropriate diagnoses were generated by all 4 programs which had not been originally listed by the experts, there were no significant differences among the systems with regard to this measure.

· All programs produced moderately long lists of potential diagnoses. The list included many diagnoses that a knowledgable physician would regard as not being particularly helpful in explaining the case or guiding further studies.

· On the other hand, each program suggested some diagnoses that the experts later agreed on - were worthy to be included in the future diagnosis.

· Although, each program performed better or worse than the others on some of the performance measures, none performed consistently better or worse on all the measures.

· The programs also had additional functions that were not evaluated. Those functions included: interactiveness, display of signs and symptoms associated with deseases, suggestion of potentially relevant laboratory tests. ⁹

Conclusions

^·The increasing popularity of computer-based diagnostic systems suggests that at least some physicians have found them helpful.

^· However, such anecdotal data does not permit a systematic assessment of the clinical contexts in which these programs are most useful or how they actually perform.

^·This study arouses concern that important diagnostic considerations may be so obscured by other diagnoses that the value of the program may be significantly decreased, or that it could lead to excessive or costly interventions in inexperienced hands.

^· Although, some physicians may use the programs that were described in this article, most would probably enter selected key findings and use some of the other functions of the system to refine the list of diagnoses. ⁹

^·Medically knowledgeable persons would probably not only decide what data to enter, but also distinguish between diagnoses that are worthy of consideration and dismiss many of the poorly integrated diagnoses.¹⁶

^·The developers of these systems intend these programs to serve a prompting function, reminding physicians of diagnoses they may not have considered or triggerring their thinking about related diagnostic possibilities. ⁹

Summary/ Future References

· The results found stated that no single computer program scored better than the others on all performance measures. On average, less than half diagnoses on the expert’s original list were suggested by the 4 programs.

· Yet, on the average - each program suggested at least 2 additional diagnoses per case that the experts found relevant that they had not origianally considered.

· Clearly, as others indicated, the next step in the evaluation of these programs will have to include examining the performance of the physician and the computer together. ⁹

Discussion

· Some physicians in Med community were not impressed with the findings of this article. Dr. Lehmann pointed out that there was no mention of whether the programs were in any way able to explain their reasoning.

- Source: Dr. Lehmann, The New England Journal of Medicine

He suggested further, that the use of the reasoning techniques should allow more rigorous validation of the programs working and should assist physicians in interpreting their suggestions. This technique will help out clinicians to identify relevant advise and ignore irrelevant suggestions. ¹⁷

· Dr. Robert Yolton also inquired as to who would use such

programs? He doubted that it is useful for a family physician, since the diagnose problem level involved atypical, rare diseases, multiple disorders presenting simultanteously, or elements sufficiently complex that the physician would be most likely to request a diagnostic computer consultation. ¹⁷

- Source: Dr. Yolton, The New England Journal of Medicine

· Dr. Berner and her collegues responded that most of the programs permit queries, and they “explain” their diagnoses, in that they describe with various means of quantification, how the individual case findings relate to a given diagnosis in the program’s knowledge base, the level of confidence that can be placed in particular diagnosis, and what kind of additional data would support the diagnosis.

· She also stated that the design of their study did not permit them to evaluate the explanatory features or construction of the knowledge base. However, Lehmann’s point was relevant. Systematic evaluations of additional features of the programs should be warranted.

· Berner also agreed with Dr. Yolton that these systems would be most unlikely to be used by subspecialists for most patients within their disciplines. However, they will be helpful and useful to a physician when confronted with a patient with a puzzling illness.

· Finally, she concluded on the importance of her findings that the study of the 4 computer-based program’s mission was to find out whether their use led to improved diagnostic decision making and ultimately to improved quality care of patients.

Introduction / Background

Conclusions

Summary/ Future References

· Some physicians in Med community were not impressed with the findings of this article. Dr. Lehmann pointed out that there was no mention of whether the programs were in any way able to explain their reasoning.

- Source: Dr. Lehmann, The New England Journal of Medicine

· Dr. Robert Yolton also inquired as to who would use such

programs? He doubted that it is useful for a family physician, since the diagnose problem level involved atypical, rare diseases, multiple disorders presenting simultanteously, or elements sufficiently complex that the physician would be most likely to request a diagnostic computer consultation. 17

- Source: Dr. Yolton, The New England Journal of Medicine

· She also stated that the design of their study did not permit them to evaluate the explanatory features or construction of the knowledge base. However, Lehmann’s point was relevant. Systematic evaluations of additional features of the programs should be warranted.

· Berner also agreed with Dr. Yolton that these systems would be most unlikely to be used by subspecialists for most patients within their disciplines. However, they will be helpful and useful to a physician when confronted with a patient with a puzzling illness.

· Finally, she concluded on the importance of her findings that the study of the 4 computer-based program’s mission was to find out whether their use led to improved diagnostic decision making and ultimately to improved quality care of patients.