The issue of data has presented several challenges to all the health insurance industry stakeholders. Till a couple of years ago, the primary challenge was that the data requirements and uses of data was not always fully understood by some critical stakeholders.
As the industry matured, the awareness about utility of data in building rating structures, product design and flexible pricing systems increased, however, the industry did not transfer this knowledge to action. This was partly due to the fact that the data was hard to use, it was frequently of poor quality and incomplete. In addition, each TPA captured different data in different formats, thus making data aggregation and analysis at the insurer's end a very daunting task. As early as 2005 some TPA's tried to differentiate themselves by providing value added services built around data analytics for specific insurers. Since health insurance was an insignificant part of many general insurer's portfolio, this value added service was embraced by only a few insurers. Till then none of the insurers had given data analysis much importance, very few of them even had a dedicated person responsible for health insurance.
Health insurance premium growth in excess of 30% for FY '05-'06 significantly changed the complacent attitude many insurers had adopted towards their health portfolio. As health became close to 20% of the total business handled by a leading private insurer, the industry started to pay closer attention to health. The advent of detariffication in the general insurance industry has increased the importance of data collection and data analysis especially in the health insurance industry. There is an additional need for more effective analysis in the new competitive regime of pricing without any cross subsidy across classes. Proper data analysis is also necessitated by new regulations of the IRDA such as IBNR (Incurred but not Reported) estimation and Product Filing Requirements. It is also important to create industry wide benchmarks to enable an insurer to compare its own performance and rates with industry standards.
Knowledge and awareness about data standards has existed in India for quite a while. The Ministry of Communication & Information Technology (MoCIT) constituted a working group to prepare the ground for the Information Technology Infrastructure for Healthcare (ITIH) in India in coordination with Apollo Health Street Ltd in October, 2002. In January, 2004, the working group recommended standards to be followed for capturing and exchanging health information. The standards covered detailed formats for Healthcare identifiers, Data elements, Messaging standards, Clinical Terminology, Minimum Data Sets and Billing Formats. Another committee formed by the Insurance Regulatory Development Authority (IRDA) and coordinated by Bearing Point made recommendations for data formats and data collections from TPAs. However, industry adoption of these standards was weak.
As the industry developed awareness about data usage, other critical events were occurring which would impact the current awareness level about data pertaining to health insurance. IRDA appointed DVS Sastry, to look at general insurance data issues. In addition, the process of strengthening and preparing the Tariff Advisory Committee (TAC) for functioning in a de-tariffied industry started, based on the vision that in the detariffied industry TAC would function as a data warehouse, supporting the industry. Among the first steps that TAC did towards this goal was to collect data from TPA's and start some initial analysis. The data collection process required that all data be submitted in a standardised format. However since standards which had been proposed earlier were not adopted, this was a capacity that TPA's & insurers will need to develop as per data reporting guidelines proposed by TAC.
Although the TPA systems do have many common fields, and almost all capture those which can be defined as critical, variance is still significant. In addition, as anyone who has had the opportunity to study a large volume of data from multiple TPA's would testify, the main shortcoming in the data has been adherence to quality parameters during the data capturing stage. From my own experience of large volume analysis of an insurer's data from multiple TPA's, I have observed that although data quality is getting progressively better, many elementary errors are still common. The good news is that most of these errors can be eliminated easily be introducing validation checks and drop downs. In addition, training of the data entry staff can easily enhance data quality.
To standardize data collection, the TAC has published a Health Insurance Data Reporting Manual which contains detailed instructions on acceptable data structures. Adoption of this manual will be the first step in generating quality data. This initiative can then be expanded upon by adding data fields to enhance data depth and resultant analytics. Data fields as presented in the current structure of policy, members and claims datasets, which is recommended by TAC, has been found to be adequate for most types of analysis. Awareness about quality however is critical, therefore, the first section of the document discusses how the quality can be improved, the second section discusses benefits and mechanisms for data sharing.
The TPA's have been regarded as the traditional custodians of enrollment and claims data. They manage the policyholders details through the readily available tailor-made softwares for such purposes. The data provided by different TPAs showed significant variation in terms of quality and consistency.
Since each TPA captures data differently, TAC has now prescribed current formats of health insurance data collection. Data is currently segrated into the following structure:
- Policy Data (Table A): The policy level information is contained in this table. Ithas details such as the total number of people covered under a policy, policy premium, start date and end date of policy.
- Member data (Table B): This table contains information about the individual members covered under the policy. The details include the age or date of birth of member, sum insured, gender and relationship with the insured.
- Claims Data (Table C): Information on claims made by the covered members is contained in this table. The details include the date of admission and discharge, diagnosis description and code, name of the hospital, amount paid as well as claimed and the date of payment.
- Outstanding Claims Data: This table shows the outstanding claim amount at the beginning and close of each financial year as well as the total amount paid during the year in aggregate.
The chart below shows the list of important fields contained in Tables A, B and C.
The challenges in data quality can be subdivided into 3 distinct issues: Data accuracy, data completeness and data standardization:
It is of utmost importance that the data be largely accurate. A high volume of inaccurate data undermines the reliability of any analysis. The several common shortcomings in the current data with respect to data accuracy are:
1) Reasonability Checks: Reasonability checks is the most important step to understand and demonstrate data quality. They are very useful from data analysis perspective as they give an early indication of which analyses can be viably done. Frequently they also indicate which other elements of data pre-processing be included in initial data enhancement efforts. This includes, but is not limited to, data cleaning, re-categorization, sorting, grouping, and calculating new variables by utilizing existing variables and reformatting. The following reasonability checks have to be undertaken on the data due to it's current poor quality:
- Negative policy period; i.e. end date of a policy before its start date
- Length of stay less than zero; i.e. date of admission after the date of discharge
- Date of admission before the start date of its policy
- Date of admission after the end date of its policy
- Claim paid more than the corresponding amount claimed
- Inconsistency in age of Insured
- Policy premium less than zero
2) Duplicate records: Duplicate records are usually found in both Policy table A and Member table B of all TPAs. This duplication leads to multiple records when an attempt is made to create a master exposure dataset containing policy as well as member details. This limitation makes it difficult to evaluate the true risk profile of the lives covered under a policy. Further, there is a lack of a standard format to assign a unique policy number to each policy record. There were similar format issues with member records as well. Some TPAs used the same policy numbers for group as well as individual policies.
3) Missing policy data: Policy data was missing for certain claims paid by most TPAs. The percentage of missing records can be as high as 30% for some TPAs. Such a gap undermines the entire purpose of the loss ratio analysis by different parameters.
4) Segregation of expenses into different benefits: The claimed amount needs to be segregated into different categories including Room & Nursing, Surgery, Investigation, Medicine and Miscellaneous charges as per the prescribed format. However, the figures for Miscellaneous charges were unusually high for some TPAs. It might be due to the fact that some of the TPAs do not classify the total claimed amount into the various sub-categories efficiently. Such classification, if provided, can facilitate analysis of the payment patterns for different benefits.
Completeness of data is pre-requisite for effective analysis. The variables provided in the data set should be sufficiently populated for accurate analysis. Any error or incompleteness in the data would lead to inaccurate results. Some examples of common fields which have completeness issues and the impact thereof:
1)Age/Date of Birth: The likelihood of a claim, and therefore the required premium, is affected by the age of insured. The lack of accurate date of birth or age of insured undermines the ability to do any meaningful analysis based on age.
2)Occupation: The occupation information helps in better understanding of the risk profile of the insured. The field for “Occupation” is usually blank in most TPA records.
3)ICD Codes: ICD (International Classification of Diseases) is a standard disease based code prescribed by WHO (World Health Organization) and used across the world. A unique three digit ICD-10 code is assigned to each of the diagnosis groupings of the claimant. The claims data is grouped by these codes for analysis purposes. This coding is largely unavailable or inaccurate in many cases. The lack of prescribed ICD-10 Codes can be a major shortcoming in the analysis of the data. Some TPAs provided broad level code ranges rather than a unique three digit ICD-10 code.
4)Diagnosis and Procedure Descriptions: Diagnosis descriptions are useful to verify the ICD codes assigned to the claims and to populate the ICD codes, in case it is unavailable. The procedure description is used to analyze whether it is medically appropriate for the diagnosis. There was a high degree of variability in the diagnosis and procedure descriptions generated by different TPAs. Some of the inconsistencies observed were:
Diagnosis descriptions are restricted to a text limit by many TPA softwares. This leads to incomplete diagnosis descriptions. Data in the field was difficult to comprehend due to syntax errors or broad descriptions such as “Conservative Surgery”. Some TPAs provide an additional column titled “OPINION”, frequently this contains a combined descriptive and procedural information for some records.
Most TPAs did not provide procedure descriptions at all, while others only recorded drug names. Occasionally, diagnosis and procedure descriptions are provided interchangeably by some TPAs.
Inconsistency and variation among industry stakeholders for populating certain fields highlights the need for a standard terminology to facilitate data collection. Lack of a structured mechanism for population common fields causes problems while creating a single dataset for all insurers or TPAs. Some fields which can easily be represented in a uniform manner by all stakeholders to enhance data usability are: gender, relationship of patient with primary policy holder, hospital and city names, unique identifier for hospitals, type of cover