Me Time and We Time

Like every working parent, I am often confronted by the question, “When is your ‘me’ time?” If it’s not some third party asking, looking incredulously at my ridiculous schedule, I’m asking myself…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Auto Fraud Detection Without Building Models

Is it possible to detect fraud from millions of transactions in few minutes without any labelling of data and without building a machine learning model?

The growing scale of businesses and organizations, their complex ecosystems, digitization in their operations has offered opportunities for fraudsters both inside and outside these organizations to carry out fraudulent activities. Organizations around the world lose an estimated five percent of their annual revenues to fraud, according to a survey of Certified Fraud Examiners (CFEs) who investigated cases between January 2010 and December 2011. Applied to the estimated 2011 Gross World Product, this figure translates to a potential total fraud loss of more than $3.5 trillion. Please note that the figure of $3.5 trillion is estimated for the year 2011 and we are in the year 2021. We believe during last 10 years; the fraud amount has gone many folds.

This is a great threat for the organizations and a huge challenge for the leaders to address. Moreover, given the pace at which businesses are growing, and fraud scenarios are evolving, fraud prevention teams don’t have the luxury of time. So, is it possible to detect fraud from millions of transactions in few minutes without any labelling of data and without building a machine learning model? If we say “Yes”, you may find it highly unlikely, but it is true. We have built a solution “Discover” that does exactly this.

The DISCOVER is an outcome solution of the extensive research by the best-in-class Machine Learning and Management Consulting experts from IIT Madras and BITS Pilani to find anomalies in data which further helps in detecting and preventing fraudulent activities in organizations across industries. This solution can find anomalies from any kind of data on its own. It is based on unsupervised machine learning and some of its numerous applications would be: it can help in detecting the fraud in salary payouts, find dead inventory that should be disposed and inventory that should be ordered immediately, find the products that generate high revenue and the products that lose money, identify employees who are generating high revenue for the company and employees because of whom the company is losing money. These are just few of the numerous applications of Discover. In the Use-Cases section of this article, we have given many more potential applications of this solution.

We explain this novel solution with its core features, benefits, and use-cases:

Features:

· It is based on Unsupervised Machine Learning, so it need not be trained and can be used immediately. However, to get better results, feature engineering can be done based on the domain knowledge.

· The entire solution is written in C and PHP. As C programs can be compiled on any operating system, it can be used on Server, Cloud, Desktop, Laptop etc. PHP runs on most operating systems.

· It has built-in facility to do auto-feature selection if target variable is provided.

Benefits:

· The accuracy of this solution is far better than other comparable solutions.

· Its speed is at least 100 times to other products in such category.

· It can handle both numeric and categorical data.

· Our solution can handle any number of rows and columns and it can sort fraudulent transactions from any type of data. This makes our solution highly efficient and effective.

· The solution finds the key features (or columns) with weightage of each feature for each row that make it a suspicious transaction or unusual transaction. We also provide whether the value is High or Low with weightage. This helps us to understand why a particular transaction is shown as suspicious or unusual and enables us to conduct the investigation in the right direction. This saves a lot of time for the teams conducting the investigation. As the teams can identify & prevent the fraud within a short time, it saves millions of dollars for the companies. Apart from preventing fraud, it can also identify which product is doing very well, which customer is very valuable, which machinery requires maintenance and so on. Below, we give the use cases.

Use Cases:

As this Solution can be applied for businesses and organizations across industries, here is a list of its potential applications for different sectors:

Common Use Cases (Applicable to Most of the Companies):

Ø Detect employees whose performance is exceptional and whose performance is unbelievably bad based on quantifiable performance data.

Ø Detect fraud in Employee Salary.

Ø Detect Most Valuable Customers.

Ø Detect Most Valuable Products.

Ø Detect Products whose sales is going down or sales is increasing at steep rate.

Ø Detect Fraud in Accounts Payable and Accounts Receivable.

Banking Financial Sector:

Ø Early Detection and Prevention of Fraud & Money Laundering activities.

Ø Detect Unworthy Loan Applicants.

Insurance Sector:

Ø Detect abnormal claims, detect ineligible applicants and policy holders.

Ø Detect ineligible applicants and policy holders.

IT & ITES Sector:

Ø Identify the offerings that are doing exceptionally well and the ones that are not doing well.

Ø Detect fraud in accounts payable to different vendors.

Manufacturing Sector:

Ø Detect manufacturing defects based on quantifiable data.

Ø Help Manufacturing companies by identifying the inventory that is moving slow from the warehouse and identifying the inventory that is moving fast from the warehouse.

Ø Detect employees whose performance is exceptional and whose performance is bad.

Ø Detect Machineries require maintenance.

E-Commerce Sector:

Ø Help E-Commerce companies by identifying the goods that are moving slow and identifying the goods that are moving fast along the supply chain.

Ø Detect troublesome customers and exceptionally good customers.

Telecommunication Sector:

Ø Detect unauthorized access in computer networks.

Ø Monitor the performance of computer networks (Detect network bottlenecks).

Health Care Sector:

Ø Patient Identification & Medical Crisis Prevention, Revenue Improvement, Cost reduction and Efficiency Improvement for the organizations (Hospitals & Government).

Retail Sector:

Ø Identification of Products that give more revenue to the company/ identification of products that are performing badly.

Ø Detect employees whose performance is exceptional and whose performance is very bad.

Pharma:

Ø Data Driven Health Monitoring and Early Warning Diagnostic Tools for Patients.

Ø Detect unusual indicators in Key Performance indicators.

Airline:

Ø Data Driven Health Monitoring and Early Warning Diagnostic Tools for Aircraft Engines.

Ø Detect sectors where performance is exceptionally good or exceptionally bad.

Governance, Risk and Compliance:

Ø Detect duplicates/ fraud in salary payouts.

Ø Detect fraud in accounts payable to different vendors.

Ø Detect employees whose performance is exceptional and whose performance is very bad.

These are just some applications which we could visualize at present. We can find many other innovative ways of using this solution for many other unforeseen problem areas.

How about a case study to understand it better?

For this case study, we have taken data of transactions made by credit cards in September 2013 by European Cardholders from the website Kaggle.com:

The datasets contain transactions made by credit cards in September 2013 by European cardholders. This dataset contains transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

The data set comprises 31 columns. All columns are anonymized except ‘Time’, ‘Amount’ and ‘Feature’. Apart from these 3 columns, there are 28 columns with the name ‘V1’, ‘V2’ to ‘V28’. Features V1, V2 … V28 are the principal components obtained with PCA.

The result is “Feature” (last column) representing whether the transaction is fraudulent or not. Feature value ‘1’ means ‘Fraud’ and ‘0’ means ‘No Fraud’.

What is PCA (Principal Component Analysis)?

Principal component analysis (PCA) is a technique to reduce the dimensions of datasets, but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance. The method requires computation of eigenvectors and eigenvalues.

Result: Top 10,000 Transactions:

The fraudulent transactions are 0.172% of all transactions. Thus, if we take 10,000 random transactions, we shall get approximately 17 fraudulent transactions. However, when we submitted the transactions to our software, “Detect Fraud and Money Laundering Transactions”, we got 411 fraudulent transactions. So, the software could detect around 24 times that of random selections. Further, if we take top 1,000 transactions, we got 182 fraudulent transactions. This is better than 100 times as random 1,000 transactions are likely to fetch 1.7 fraudulent transactions. This shows the effectiveness of the software which is based on Unsupervised Machine Learning.

We have selected the dataset in “Detect Fraud and Money Laundering Transactions” and asked it to find out top 10,000 likely fraudulent transactions. Please note that time taken to compute top 10,000 transactions is less than 2 minutes.

Below, we give the screenshot of “Detect Fraud and Money Laundering Transactions”:

Let us study the above screen carefully. It shows that there are 284,807 rows and 31 columns. We have ignored the column “Class” as it contains whether the transaction is fraudulent or not. So, this column should not be considered. We have clicked on “Show Key Column” button and selected 3 Key Columns. The weights would be displayed as separate column. The number of fraudulent transactions to be shown is 10,000.

When we clicked on the “Result” button, we see the following screen:

In the above screenshot, we can see that there are four fraudulent transactions in the top 25 transactions. These are marked with green rectangle. For the first fraudulent transaction, “V7” column has weightage of 19.91%, “V8” has weightage of 19.69% and “V21” has weightage of 15.88%. This way, we can see what feature (or column) is responsible for this transaction with weightage. This helps in further analysing the fraudulent transactions. Of course, the columns from V1 to V28 are derived from PCA, thus anonymized and we do not know the meaning of these columns. However, in real life the columns are well-known and give vital clues as to why a particular transaction is likely to be a fraudulent one.

Time Taken:

Please note that we have processed 284,807 rows and 30 columns (the last column is ignored). So, practically we have processed 8.5 million columns and it has taken less than 2 minutes on a modest hardware. This shows that the speed of the software is remarkably high.

In the screenshot given below, we can see that 411 items are with Class shown as 1 and thus, the software has detected 411 fraudulent transactions from the top 10,000 transactions selected by the software.

We have analyzed and computed the number of fraudulent transactions found in each 500 transactions selected by the software. Below, we give the result.

Below, we show the screenshot of the final analysis:

From the above, it is apparent that in top 500 transactions, we got 125 fraudulent transactions. If we take random 500 transactions, we shall get one transaction (0.172*5 = 0.86 transaction). Next 500 transactions, we got 57 fraudulent transactions. We see that numbers are declining fast. It is noticeably clear that the software automatically sorts the transactions from most likely fraudulent to least likely fraudulent.

Accuracy:

Let us analyze the accuracy of the result. If we take random 500 transactions, we are expected to get one fraudulent transaction (0.172*500/100 = 0.86). However, our software could get 125 fraudulent transactions from top 500 transactions without any modelling or without any information about the data. This means, the accuracy obtained was 125 times in comparison of random selection of transactions. Obviously, it cannot be a coincidence.

If we take random 1,000 transactions, we are expected to get two fraudulent transactions; however, we could detect 182 fraudulent transactions in top 1,000 transactions obtained by the software. So, the accuracy obtained was 91 times.

Note: Though, we have not given result for top 200 transactions in the above screen. In the top 200 transactions, we got 96 fraudulent transactions. Thus, we got 96.0/200*284807/492 = 278 times better then random selection of 200 transactions.

Will it work on any type of data?

Yes, we can detect any type of fraud provided the required data are available. However, it is important to note that just transaction data in isolation may not be sufficient to detect fraud or abnormality. We also require historical data along with the transaction data. For example, take two customers, each one has done a transaction of $1,000. If we take the historical data, we find that first customer has done many transactions of $10,000 and above, so for this customer transactions is least likely to be fraudulent. However, the second customer has never done transaction above $100, this makes it suspicious transaction.

Below, we give example of details required for Credit Card Transactions:

· Balance before Transaction

· Balance after Transaction

· Number of Transactions taken place in last 24 hours

· Total Amount of Transactions taken place in last 24 hours

· Maximum Amount of Transaction made in the last quarter

If more details are provided, it will improve the accuracy. We also need to compute certain ratios like

· Balance before Transaction / Transaction Amount

· Transaction Amount / Maximum Amount of Transaction made in the last quarter

Result Download Link:

The result can be downloaded from our website:

Research Scholars:

We provide “Discover” to Research Scholars free for one month. Those, who are interested, please send request to research@patodiainfo.com with necessary documents. We shall send link and key after verifying the documents.

Note: Please note that “Discover” is provided to Research Scholars only for research purpose and they cannot use it for commercial purpose.

Conclusion:

The software has been developed with extensive research and multiple levels of extreme refining with the primary objective of helping the businesses and organizations to detect fraud effectively and efficiently, with the promise of huge savings in costs and efforts.

In the next series, we shall cover how to detect fraud from payroll data using “Discover”.

Add a comment

Related posts:

The most surprising experiment about how our brains work

You probably think that your conscious experience of the world around you is a perfect representation of reality. You hear and see things as they really are out there. But is this true…

Benefits of digital asset management systems

We like to share our successes, learnings and the benefits that come with using Lytho’s solution for sharing, managing and creating assets from a single source of truth. Sometimes it can be helpful…

How to get organisations to be better at using data

Business intelligence, data analysis and other teams who work with data on behalf of organisations want to change people’s behaviour. They need to consider the way they provide feedback.