A systematic review of data science and machine learning applications to the oil and gas industry

This study offered a detailed review of data sciences and machine learning (ML) roles in different petroleum engineering and geosciences segments such as petroleum exploration, reservoir characterization, oil well drilling, production, and well stimulation, emphasizing the newly emerging field of unconventional reservoirs. The future of data science and ML in the oil and gas industry, highlighting what is required from ML for better prediction, is also discussed. This study also provides a comprehensive comparison of different ML techniques used in the oil and gas industry. With the arrival of powerful computers, advanced ML algorithms, and extensive data generation from different industry tools, we see a bright future in developing solutions to the complex problems in the oil and gas industry that were previously beyond the grip of analytical solutions or numerical simulation. ML tools can incorporate every detail in the log data and every information connected to the target data. Despite their limitations, they are not constrained by limiting assumptions of analytical solutions or by particular data and/or power processing requirements of numerical simulators. This detailed and comprehensive study can serve as an exclusive reference for ML applications in the industry. Based on the review conducted, it was found that ML techniques offer a great potential in solving problems in almost all areas of the oil and gas industry involving prediction, classification, and clustering. With the generation of huge data in everyday oil and gas industry activates, machine learning and big data handling techniques are becoming a necessity toward a more efficient industry.

Similar content being viewed by others

Artificial intelligence techniques and their application in oil and gas industry

Article 16 November 2020

RETRACTED ARTICLE: A comprehensive study on artificial intelligence in oil and gas sector

Article 10 August 2021

Knowledge-Based Machine Learning Approaches to Predict Oil Production Rate in the Oil Reservoir

Chapter © 2024

Explore related subjects

Avoid common mistakes on your manuscript.

Introduction

Artificial Intelligence (AI) is the field that integrates computational power with human intelligence to produce smart and reliable solutions to extremely nonlinear and highly complicated problems. AI is the field of science that allows computers to think and decide on their own. Machine learning (ML) is a subset of AI that provides statistical tools to explore and analyze big data. ML is comprised of further subsets such as supervised, unsupervised, and reinforced learning. Supervised learning is the data learning technique applied when some past or labeled data is available for future forecasting by function approximation. The unsupervised learning technique is the machine learning technique when the past labeled data is unavailable and is usually used for clustering purposes. Reinforced learning is the combination of supervised and unsupervised learning techniques in which some part of the data is labeled and some part is not.

In the last two decades, engineering journals have reported numerous articles utilizing ML for regression, function approximation, and classification problems. With the development of intelligent oilfields and big data technology, the adoption of the ML method has gained new vitality for the study of problems in the oilfield development process. With the advent of computing techniques, several correlations utilizing ML have come to the fore, especially in reservoir characterization (Anifowose 2012; Fatai A Anifowose et al. 2013a, b), reservoir engineering (Al-Marhoun and Osman 2002; Gharbi et al. 1999; Gharbi and Elsharkawy 1999); and reservoir geomechanics (Tariq et al. 2017a, b) and many other areas in petroleum engineering applications.

The most repeated question that ML petroleum researchers faced in their everyday life is that ML models are usually limited to the data set tested, so how to globalize this and produce more general correlations? ML applications have common limitations and challenges that hinder the globalization of the created models, such as overfitting, coincidence, excessive training, lack of interpretability of results, and bias. Besides, these models require a large amount of data that is not available in many cases.

Overfitting is considered the most common problem in ML applications. This is due to the lack of an appropriate amount of data to be used for training. To overcome this issue, the ratio of data points to the total number of weights used by the connections (ρ) was used to lessen the effect of insufficient data. The coincidence effect is another issue that accompanies the AI supervised learning models as they try to match a specific dataset, so there is a probability of getting a good match by coincidence. This also can happen in other regression analysis techniques, which require working on methods to minimize that occurrence (Livingstone et al. 1997). Overtraining can happen when there is no clear stopping stage for the training. The error may stay decreasing by updating the model structure, including the weights. The real risk, in that case, is that the model can be more complex to fit a specific dataset, becoming impossible to generalize after that. A training methodology named “early stopping” uses a control set that monitors the training process to overcome this. If error begins to rise, the early stopping will end the training process. Other techniques are being used to save time and effort, such as reinforcement learning with in-stream supervision, such as generative adversarial networks that monitor the learning of two competing networks to better understand the model concept (Hossain 2018).

The availability of large datasets is also a concern, which affects the training accuracy and goodness of the model. If the gathered data is limited, a methodology like single-shot learning is implemented, in which the AI model is pre-trained on a similar dataset and is enhanced with experience.

Interpretability is the key to data analysis. AI models are not that simple, and even in some cases, it is impossible to interpret the results even in modeling small linear problems. The single connections in the models do not alone affect results, but the whole combined connections do. One of the methods developed to help in that regard is the local interpretable model and its agnostic explanations, which try to detect which parts of the raw data the model depends on mostly for estimations. In the generalized additive models’ method, the separation between model features enhances each feature's interpretation.

The lack of AI models' generalization ability is a major limitation that delays the widespread of AI in the oil and gas industry. It is hard for many models to be used in circumstances different from those used in building the original model (Virginia 2018). Additional resources are to be utilized each time for training new datasets, even if they were similar to previous cases (Ramamoorthy and Yampolskiy 2018). The reusability of the ML models is also quite challenging. Usually, the trained models on one geological field are less reliable when applied to other geological fields. It is highly recommended to implement the model when the input parameters of the given dataset lay within the range of the input parameters on which the model is to be implemented (Mohaghegh 2017).

Lastly, the effect of bias cannot be ignored and sometimes is hard to be detected and mitigated. Many researchers are solving the issues related to AI bias by understanding the model's objective and its associated results. Using model-independent perturbations by substituting the inputs with random values obtained from a normal distribution will help avoid biases (Samek et al. 2018). Table 1 provides a summary of all limitations of AI and ML models.

figure 1

What is needed from AI in the oil and gas industry?

Many oil and gas industry giants are currently applying AI in oil and gas operations. AI advances made it suitable for several applications such as precision in drilling and automation, saving oil and gas producers' time and money. These advances are going to serve different aspects of the oil and gas industry, such as:

Precise drilling

Drilling activities are always accompanied by high risk and a high level of uncertainty. AI techniques coupled with the big data recorded by the smart sensors mounted on drilling strings such as pressure, temperature, and seismic surveys in real time can be used to overcome these challenges. Precise drilling using AI can enhance the control level of the rate of penetration and identify risks in advance.

Production optimization

Every oil and gas company focuses on production optimization and efficiency, which eventually increases profits with the help of AI, automated pattern recognition, and classification to prepare production data for generating analytics. Estimation and prediction models can then be built based on the refined data. It can also isolate the effects of the reservoir from the production control responses such as gas lift rates, choke openings, network routing, and artificial lift methods.

Reservoir management

Multiple teams from several aspects such as seismic, geology, reservoir, and production engineering are required to collaborate to achieve better reservoir management. The AI models can be trained with historical data of seismic surveys, geological descriptions, and production methodologies and then can be applied in the characterization or modeling of reservoirs and field monitoring.

Inspections

Frequent inspections are scheduled for detecting abnormal equipment performance to prevent failures of the equipment and potential accidents. That is why companies are looking for automated and smart detective approaches. Robots driven by AI models can help investigate abnormal equipment behavior by identifying anomalies using techniques such as pattern recognition. Besides, drones can inspect pipelines and offshore facilities that can detect, in real time, cracks or leaks in pipelines. They can also help in case of an emergency, such as gas leaks. In certain situations, these robots can intervene in emergency cases and use the procedure, which applies to that case, which will elevate the company's safety measures.

Chatbots

AI-powered chatbots can help engineers and scientists by digging in a database or archive of historical data, suggesting possible solutions to problems, providing correct standards of job execution, or help in teaching junior staff using natural language processing. Jacobs (2019) discussed three newly released chatbots in the oil and gas industry: Sandy, Nesh, and Ralphie. They are designed intentionally to provide answers to oil and gas professionals’ complex questions. These are also named virtual assistants that use artificial intelligence (AI) natural language processing (NLP), which has quickly entered the market through the tech giants Amazon, Apple, and Google, which enabled many millions of people to engage in dialogue with laptops, smartphones, and speakers.

Facilities monitoring

Intelligent cameras can reduce potential damage by detecting hazardous activities such as smoking in dangerous areas. They can be trained using photos and recordings of dangerous activities to alert the staff or take predefined actions. Moreover, they can detect if the employees are watering their protective PPE or not. Using this approach will help enhance safety management.

Commonly used machine learning techniques in oil and gas industry

Several ML techniques such as ANN, FL, SVM, DT, RF, KNN, RNN, CNN, and fuzzy C-means clusters are widely used in different applications of oil and gas. Table 4 summarizes some of the algorithms with their advantages and disadvantages.

figure 2

Petrofacies classification and fractures identification

Reservoir rocks can be classified and grouped based on their reservoir quality. Such classification can be done based on petrophysical rock properties (e.g., porosity, permeability, and pore size) and geological features (e.g., textures, diagenetic overprints, and pore types). Petrofacies are usually defined based on combining both petrophysical and geological attributes, which can be an essential tool for reservoir characterization (Avseth and Mukerji 2002). Petrofacies classification is frequently done using both core samples and wireline log data. Cores are not frequently available from all wells due to the time and cost associated, and thus several studies (Bhattacharya and Mishra 2018; Qi and Carr 2006; Sebtosheikh and Salehi 2015) have examined how machine learning algorithms can be trained on data obtained from certain cored well and then used to perform petrofacies classification in other un-cored wells. Petrofacies labels, defined as a function of depth based on the integration of well-log and core data, are used to train the models (Sebtosheikh and Salehi 2015; Silva et al. 2015). The utilized logs for facies identification are usually Gama Ray (GR), resistivity (Rt), neutron (NPHI), density (RHOB), and lithology (PEF). In addition, other features could be extracted from these logs to improve the prediction, such as total organic matter (TOC), matrix grain density (RHOMAA), and apparent volumetric cross-section (UMA).

Earlier studies have used ANN, SVM, and RF to classify petrofacies from well logs in both sandstone and carbonate reservoirs (Silva et al. 2015; Al-Anazi and Gates 2010; Martinelli et al. 2013; Salehi and Honarvar 2014). Nevertheless, more recent studies have suggested that Gradient Boosting (GB) algorithm outperforms ANN and SVM, especially when a limited number of features are available (Silva et al. 2015). Another algorithm that has shown success is the Random Forest (RF), which reduces the computational time for the training phase compared to GB (Bhattacharya and Mishra 2018). Based on the existing literature, it seems that there is no consensus regarding the most suitable machine learning technique for petrofacies classification. This could be due to several factors, including the wide variations in the features selected or available data, as well as differences in terms of complex geology and reservoir heterogeneity. Indeed, as pointed out by Silva et al. (Silva et al. 2015), the applicability of various algorithms has to be tested for each training/testing data set to be used. One major challenge that remains for the success of machine learning in this application is to have/select the right petrophysical and geological attributes/features to distinguish between facies. Such tasks remain mainly subjective and far from being automated or objective.

Fractures and facies identification are usually made through personal judgments based on field log and laboratory core analysis data. Recently, AI has been used to identify fractures and facies in unconventional formations. Tian and Daigle (2019) could identify micro-fractures and organic matter in siliceous and carbonate-rich shale samples and find the association between them using AI. That was to automate the process of understanding micro-fractures in shale samples to make it fast and avoid personal evaluations. SEM and EDS images were used to find fractures and organic matters in intact and deformed samples. The single-shot detector (SSD) deep learning approach was used to train the data obtained from the images. Around 97% of fractures in intact samples and 92% in deformed ones were identified using SSD. Also, detected organic matter images were overlapped over detected fractures to find the associations. It was found the clear majority of micro-fractures penetrated the OM and clay minerals. It seems that the combination of the soft OM and clay and brittle materials (quartz and calcite) enhances the fracability according to the study.

Well correlation

Correlating different reservoir units and formation tops across different wells is essential in reservoir characterization and modeling. Such a task may require significant time from experienced geologists, especially in large fields with hundreds of wells. The use of machine learning to handle this issue has been recognized many years back (Luthi and Bryant 1997). An interpreter has first to pick formation tops and perform well correlations in several wells, which will be used as a training dataset to perform interpretation in tens to hundreds of other wells. An increasing body of studies (Maniar et al. 2018; Zheng et al. 2019) has demonstrated that a deep convolutional neural network (CNN) can provide an accurate and efficient approach for well-log correlations. The most common log data used for the correlation includes gamma ray and resistivity, although any other geophysical well-log data with sufficient log character can be used. One crucial observation documented by Zheng et al. (2019) was the drastic reduction in prediction accuracy as the number and percentage of the training dataset decreases. This might be explained by the complexity of geology that would require wells covering different depositional environments and stratigraphic sequences throughout a field.

To produce a “universal” model for well correlation, Brazell et al. (2019) developed a deep CNN architecture trained based on five million data points derived from thousands of well-log and experienced interpreter correlations. The data was obtained from various depositional environments and basins within the USA. The authors have implemented a 3D search logic to determine the marker propagation pathway and the optimum correlation. The model does require some interpreted-top examples to be provided from the specific dataset to account for particular complexity within the geology of a given area. Nevertheless, no need for extensive training data set from the specific field due to the rich dataset used to build the model. The model could provide an accuracy of around 96% on the testing dataset. It is important to note that more interpreted examples might be needed for the training if the model is to be applied to a dataset outside the US with very different regional complex geology. Another potential consideration is incorporating seismic sequence stratigraphy into the workflow, which currently relies only on well-log data. This can be important, especially in benching out strata and faulted reservoirs where the spatial continuation of a given unit might be heterogonous.

Reservoir characterization

Machine learning has an increasing number of applications in the field of geosciences. Still, we focus here on applications directly related to reservoir characterization in the oil/gas industry. The areas discussed are petrophysical properties prediction from the seismic, core, and well-log data. Other properties such as water saturation, petroleum geochemical parameters, and reservoir geomechanics will be predicted.

Petrophysical properties prediction

Reservoir characterization plays a critical role in the oil and gas industry, such as developing optimal production and reservoir management strategies. Permeability, which determines the ability and direction of oil flow, is central in reservoir characterization. An accurate permeability determination is essential for material balance calculations, reservoir flow simulation, estimating oil production rate, stimulation strategies, and enhancing oil recovery. However, permeability is very difficult to determine due to its complexity and highly nonlinear nature. Therefore, machine learning techniques are widely used to predict petrophysical parameters such as porosity, permeability, capillary pressure, relative permeability, and bulk density. Table 5 shows a summary of the studies used to predict porosity and permeability.

figure 3

figure 4

Ahmed et al. (2019) presented a comparative study of predicting ROP using several intelligence techniques. ROP was predicted for two wells using an extreme learning machine, ANN, and SVR techniques. They selected the input parameters for the ROP models based on the specific energy concept. The ROP was predicted for more than 8800 data points based on the RPM, WOB, torque, depth, mud weight, flow rate, nozzle sizes, and standpipe pressure (SPP). They reported that all ROP models showed acceptable prediction performance with a correlation coefficient higher than 0.70 for the testing data. However, among all tested techniques, support vector regression showed the best ROP estimation with a correlation coefficient of 0.94.

Mehrad et al. (2020) used a machine learning approach to develop a rigorous ROP model for vertical wells. They used different parameters to determine the ROP, including logging, drilling, and geomechanical parameters. They found that the best ROP prediction can be obtained by using the uniaxial compressive strength (UCS), mudflow rate, weight on bit (WOB), Depth, mud density (MD), and revolutions per minute (RPM) as input parameters. After that, they combined the least-squares support vector machines (LSSVM) with different optimization algorithms to estimate the ROP profile. The examined optimization algorithms are genetic algorithms (GA), particle swarm optimization (PSO), and cuckoo optimization algorithm (COA). LSSVM-GA, LSSVM-PSO, and LSSVM-COA hybrid algorithms were used to predict the ROP for two vertical wells, and more than 2000 data points were used to train and tests the hybrid models. LSSVM-COA showed the best prediction performance for training and testing wells among all tested algorithms, and an R-square of around 0.802 was achieved.

Artificial intelligence showed an effective approach for estimating the drilling performance, and accurate profiles of ROP can be predicted. However, it is noticeable that there is a lack of implementation of those techniques for real-time operations, especially for gas wells. Also, most of the available ANN-based models were developed to predict the ROP for a certain section, usually for the reservoir section. No attempt was reported for predicting the full profile of ROP using the ANN technique. Predicting the complete profile of ROP in real time can significantly improve the drilling performance and reduce the operational time and cost.

Furthermore, the coupling of different drilling efficiency indicators can help in improving the drilling operations by considering more than one parameter. For example, the ROP models can be coupled with the MSE concept to determine the best drilling conditions in drilling time (ROP) and required drilling energy (MSE). Hassan et al. (2018) coupled the torque modeling with the mechanical specific energy (MSE) to optimize the drilling performance. First, artificial intelligent techniques were used to predict the torque and ROP profiles for around 18000 ft. Then, the MSE was calculated for the whole drilling section using the surface drilling parameters. After that, the MSE was coupled with the torque and ROP profiles to identify the optimum drilling conditions that will result in maximizing the ROP and minimizing the required drilling energy (MSE). They mentioned that the developed approach would enable the drilling engineers to evaluate and optimize the drilling performance in real-time applications; hence, the surface drilling parameters can be controlled to maintain the drilling operations within the optimum conditions.

Besides, AI techniques were used to estimate several drilling problems, such as loss of circulation, one of the most common drilling problems that can increase the overall drilling cost by around 25–40%. Solomon et al. (2017) developed a new ANN model to estimate the loss circulation zones. The developed model can also recommend the suitable sizes of loss circulation materials based on the characteristics of the depleted zones. They used 30 case studies to train and validate the developed ANN model. They mentioned that the ANN model showed a very acceptable prediction performance, and a coefficient of determination of 0.8 was obtained. Besides, they compared the reliability of the developed model with different fracture predictive models, and they concluded that the developed ANN model could reduce the estimation error from around 26% to less than 16%.

Manshad et al. (2017) used an SVM and radial basis function to assess the loss of circulation problems for 30 oil wells. They reported that SVM showed high performance in predicting the amount of loss circulation material required to overcome the thief zones. A coefficient of determination of 0.8 was obtained between the predicted results and actual field data. In comparison, the radial basis function was able to estimate the mitigation of loss of circulation problems with an accuracy of 78.3%.

Al-Hameedi et al. (2018) estimated the volume of lost circulation materials for 500 wells using the machine learning technique. They predicted the volume of fluid losses based on the profiles of mud weight, bit nozzle sizes, ROP, equivalent circulation density (ECD), plastic viscosity (PV), and WOB. They reported that the machine learning models were able to predict the volume of fluid losses with very acceptable error for different types of mud loss, including partial, seepage, severe, and total mud losses.

Alkinani et al. (2020) used an ANN technique to predict the volume of drilling fluids losses during drilling fractured zones. They developed and validated the ANN model using 1500 wells. Also, the lost circulation volume was determined based on the profiles of mudflow rate, yield point (YP), PV, ECD, bit nozzle sizes, RPM, and WOB. They reported that the ANN model was able to predict the loss of circulation with a coefficient of determination higher than 0.92.

Abbas et al. (2019) applied SVM and ANN techniques to estimate the severity of loss of circulation while drilling. They used 1120 case studies from 385 wells to train and validate the new AI models for different types of mud losses such as seepage, partial, severe, and total fluids losses. They used the rock lithology, mud properties, and drilling surface parameters to predict the severity of loss of circulation. They reported that the developed ANN model was able to estimate the fluids loss with a correlation coefficient higher than 0.82. While the SVM model showed better prediction performance compared to the ANN model, a correlation coefficient higher than 0.91 was obtained.

Overall, different AI techniques were utilized to estimate the loss of circulation problems. ANN and SVM methods are the common AI tools that are used for this purpose. The very practical performance was reported for predicting the loss circulation based on the mud properties, rock lithology, and drilling parameters. However, the application of these models in real-time operation might be restricted due to the huge drilling data, leading to misleading results or delaying the model prediction. Therefore, proper data cleaning could be required to improve the data quality and reduce the data size for problems in real-time applications (Elkatatny et al. 2016).

Drilling fluids

Drilling is one of the most critical tasks, with challenges including lost circulation, clogged pipes, wellbore instability, and kicks occurring regularly. Drilling fluid, sometimes known as the "blood of the drill," is a direct or indirect remedy to the challenges stated above during the drilling process. It helps to keep the wellbore clean and retain the wellbore's integrity. For instance, high mud weight controls the high wellbore pressures and prevents kicks. On the other hand, high mud weight has a tendency to frack the formation. Similarly, low mud weight prevents fractures but can cause kick or blowout. Further drilling fluids prevent the pipe from sticking during drilling by building thin filter cake on the wellbore wall as well as by removing drilling cuttings out the wellbore. The drilling fluid works as an architect for the wellbore. The operation's success or failure is largely determined by the drilling fluid's performance and compatibility (Agwu et al. 2018). Many drilling issues can be avoided by using the proper drilling fluids. Drilling fluids are always chosen based on data analysis and expertise gained from previously drilled wells in the area. Each well design includes a drilling fluid program that specifies drilling fluid, additives, rheology, density, filtration, and other drilling fluid parameters. Combating wellbore difficulties involves comprehensive analysis and decision-making to build the drilling fluid to satisfy specific needs that suit distinct formation features.

The majority of drilling fluid design is done in the laboratory through trial and error. Hence, a system that can use existing data and provide a deeper knowledge of drilling fluid is required. Machine learning models are created using the parameters of drilling fluids and the downhole circumstances. These models aid in forecasting changes in drilling fluid parameters and recommend the optimum course of action. Rheological models express a mathematical relationship between the shear rate and the shear stress to describe the fluid flow behavior. This relationship is complicated in the case of drilling fluids. However, no single rheological model can accurately fit all drilling fluids' shear stress-shear rate data across all shear rate ranges. Instead, a plethora of mathematical models with varying degrees of relevance has been utilized. These mathematical models do not precisely capture the behavior of non-Newtonian fluids. For instance, the Bingham plastic model does not describe the drilling fluid flow behavior at a low shear rate. Further, it overestimates the yield point of the drilling fluid. The power-law model does not account for the yield point of drilling fluids. There are challenges in performing hydraulic calculations due to many rheological parameters involved in the case of the Herschel-Bulkley model (Huang et al. 2020).

Regression approaches are utilized to predict rheological proficiencies such as an ANN. For greater accuracy, the ANN model can be trained continuously with more data sets. It gives a more comprehensive view of how to comprehend the drilling performance. For example, if there is a reduction in pump pressure during the drilling operation, which happens for several reasons, including thinning effect on the drilling fluid, quick transport of the cuttings to the surface, reservoir fluid influx in the wellbore, and lost circulation, etc. Here AI interlinks different parameters, improves the decision-making process, and brings back the engineers on the right track within a short time.

Tables 8 and 9 outlines several studies of artificial intelligence in drilling fluids. The tables summarize the drilling fluids properties investigated and the AI technique used. They also show the input and output parameters and accuracy of a performance evaluation using correlation coefficient (R2), mean square error (MSE), average absolute percent relative error (AAPE), etc.

figure 5

Fluid flow through a fracture network is challenging to simulate because of the structural complexity. Srinivasan et al. (2018) built a machine learning tool to predict the solute flow through a fractured network. A discrete fracture network (DFN) methodology was used to simulate fluid flow in fracture networks. Solute flow through the fracture network is usually taking the shortest path. Graph theory was used to reduce the number of fractures to those that only contribute to flow. Then, SVM and RF were used to identify the backbone of the fracture network that contributes to flow. This significantly reduced the computational power when simulating flow using the DNF model. The trained model could capture the early solute breakthrough precisely; nevertheless, it was not as useful in predicting late time flow.

Proppant distribution in a hydraulic fracture is crucial information as it could be used to optimize MSF design. Maity et al. (2019) identified proppant particles from cored samples based on imaging processes supported by machine learning classification tools. The goal was to understand the proppant distribution after an MSF job. This helps identify the location of the new infill wells to be drilled and the completion spacing as proppant distribution can tell the length of popped fracture and which clusters were propped. Images were taken for the particles obtained from a 600 ft cored interval using a dedicated slanted well to obtain these cores. Training ANN classification, the particles were divided into proppant, calcite, and others. The following attributes of particles were used as input: hue, roundness, size, darkness, roughness, translucence, and entropy. K-fold cross-validation was used for hidden layer size optimization for ANN. It was benchmarked against other classifiers such as SVM. It was concluded that the proppant is limited within 30 ft vertical distance in the studied formation. It was validated against field data using other classification techniques.

AI is also an active area in hydraulic fracture design optimization such as the number of horizontal wells, number of stages, volume of proppant and fluids, type of chemical additives, and sweet spot identification (Awoleke and Lane 2011; Lolon et al. 2016). Most of the AI developed models ignore important geological and reservoir properties such as porosity, permeability, saturation, and pressure. These data are challenging to obtain especially along the horizontal sections of the wellbore. Some researchers replaced these data with the location of the well (i.e., coordinates) as the mentioned properties are spatially changing (Mishra et al. 2015; Wang and Chen 2019). Wang and Chen (2019) trained machine learning algorithms (RF, SVM, ANN, and AdaBoost) on 3160 horizontal well data of Montney unconventional formation to predict the first-year production and optimize the fracture design. Features such as proppant mass, well location, lateral length, fluids treatment size and type, completion type, number of stages were used for training. Recursive feature elimination with cross-validation (RFECV) was used to find the most significant features where RF was used for prediction. Then, algorithms were trained based on the most important features to predict the production rate from a fractured well. Using RFECV showed that the most important parameter in enhancing production is the mass of proppant pumped for the case of Montney formation and the location of the well. It was found that using more than the four features (proppant mass, latitude, longitude, and TVD) will not improve the correlation coefficient. It was also observed that the RF results in the best performance in terms of prediction accuracy. One drawback of the trained model is its lack of reservoir properties such as permeability, porosity, and pressure.

Optimization of hydraulic fracture stages using gradient-free (i.e., AI) methods has been applied by many researchers (Iino et al. 2020; Yu and Sepehrnoori 2013). The objective function that is usually optimized is the net present value (NPV) or cumulative production. Features such as fracture half-length, spacing, porosity, permeability, the distance between laterals, and fracture conductivity were used for the optimization. Different AI algorithms were tried such as covariance matrix adaptation evolution strategy (CMA-ES), simultaneous perturbation stochastic approximation (SPSA), genetic algorithm (GA), and non-dominated sorting genetic algorithm (NSGA-II). Rahmanifard and Plaksina (2018) aimed to optimize hydraulic fracture stages in unconventional gas formation based on cumulative production or NPV using AI-based optimization tools such as GA, Differential Evolution (DE), and Particle Swarm Optimization (PSO). Gradient-based methods are usually used for optimization purposes. However, they suffer from being trapped in local optima which means that the absolute optima could not be found. Also, many functions could not be differentiated at a certain value or range. Hence, this study was utilizing AI-based optimization tools that are gradient-free. The authors used Wattenbarger et al. (1998) analytical slap model to estimate gas cumulative flow within a certain production period. The optimization function is the NPV which is a function of the cumulative gas production, water cumulative production, and cost of hydraulic fracturing and waste disposal. The objective is to find the optimum number of hydraulic fractures (NHFs) that will maximize NPV. The PSO outperformed the other AI methods such as DE and GA as it required much fewer iteration for convergence.

Du et al. (2017) utilized embedded discrete fracture modeling (EDFM) to train an AI-based algorithm to estimate productivity in the Permian Basin. Authors used EDFM for fracture representation in a reservoir simulator; a method that reduces the need for using fine grids. The EDFM composes of two elements: matrix and fracture that can be represented separately. Mangrove which is a commercial software was used for hydraulic fracture network generation. AI was implemented to remove unnecessary fracture complexity that would not contribute to productivity. Using AI methods to reduce the complexity of the fracture and then implement it in EDFM resulted in significant simulation time reduction as compared only to using Mangrove. It enabled doing sensitivity analysis as it was feasible. However, the simplified structure resulted because the AI should be history matched to tune parameters such as reservoir permeability otherwise an error up 40% could be the outcome.

Bhattacharya et al. (2019) used machine learning algorithms to predict production in fractured Marcellus shale. The authors used the data of one well with 28 stages of hydraulic fractures in Marcellus shale to predict the production rate. The data used were petrophysical and geomechanical data (GR, sonic), pressure data (surface, casing, tubing), and fiber optics data such as distributed acoustic sensing (DAS) and distributed temperature sensing (DTS) while missing are hydraulic fracturing data and design. Ghahfarokhi et al. (2018) also implemented DAS and two years of DTS data for estimating production from Marcellus shale well. Bhattacharya et al. (2019) implemented the following machine learning tools: RF, ANN, and SVM. Feature engineering was implemented to find secondary attributes from the row data such as the brittleness index (BI). Collinearity analysis was implemented to find the most suitable features which reduced them from 34 originally to 18. All models could predict the production rate to good accuracy. However, SVM provided less accuracy with more computation time. Including hydraulic fracturing, reservoir, and PVT properties should improve accuracy. The model's lack of these data is a major limitation of their approach. Figure 6 shows that the Poisson ratio (PR) and brittleness index (BI) were the most important while DAS and DTS were not as significant.

figure 6

Similar concepts were applied to other shale formations such as Bakken shale. Luo et al. (2019) investigated the possibility of predicting the productivity of horizontally drilled wells in Bakken shale based on completion and geological parameters. Geology and completion data of 2061 horizontal wells in the Bakken were used. These include vertical depth, amount of proppant, water saturation, porosity, permeability thickness…etc. Spearman correlation, RF, and joint mutual information (JMI) were used for feature selection. Deep learning (ANN) was used as a predictive model based on one-year production data. Based on feature selection, it was found that the formation thickness, depth, and amount of proppant are the most important parameters to predict the production in the first year. It was also observed that less porous spots require more proppant to increase productivity which agrees with the physics of unconventional. Wang et al. (2011) applied AI on 2780 MSF and 139 vertical wells in the Bakken to predict productivity. A deep neural network was used in the study with k-fold cross-validation to check the predictiveness of the model. The number of hidden layers and neurons was optimized to give the best prediction for 6 and 18 months. The model showed that the amount of proppant placed in each stage is the most important parameter in predicting productivity. The trained model resulted in a small root mean square error (RMSE) when predicting the 6 and 18 months of production.

Sweet spots identification in unconventional formation is an important process as horizontal drilling combined with MSF is an expensive process that should be justified by good productivity. Also, unconventional formations cover large areas, and hence, finding the right location to complete the well is critical. Tahmasebi et al. (2017) defined the sweet spots as the ones having high TOC and fracability index (FI). They used multiple linear regression (MLR) to train log data from shale formation to predict the TOC and FI. Nonlinear models such as FL, hybrid neural networks (NN)/FL, GA were also implemented. For variable selection, stepwise selection was implemented. Mineralogy composition was used to assess the fracability where quartz is the brittle mineral. MLR failed to predict FI where the correlation coefficient was 0.44 which is an unsatisfactory value. The prediction of TOC was better where the correlation coefficient was around 0.88. The hybrid (NN + FL) machine learning, among nonlinear models, (HML) could provide better accuracy and remove the weakly correlated variables.

Rastogi and Sharma (2019) used machine learning tools to find the impact of fracturing chemicals on production using one-year production data. Different algorithms were used for feature selection such as F-Regression, Decision tree-based regressions, recursive feature elimination … etc. The data were obtained from different fracture jobs in the Powder River Basin. Chemicals additives were found to be in the top 5 parameters that impact productivity out of 11 selected features.

AI has also been applied to the area of acid fracturing in terms of conductivity prediction. Acid self-prop the fracture by generating peaks and valleys that act as a conduit for the fluids to follow. Akbari et al. (2017) used a 106 data point generated experimentally to develop a conductivity correlation based on GA. The developed correlation resulted in better accuracy as compared to the popular correlations for acid fracture conductivity. Eleibide et al. (2018) applied ANN and adaptive network-based fuzzy inference systems on the same data set. The authors showed that the model accuracy was improved as compared to Akbari et al.'s model. Desouky et al. (2020a, 2020b) utilized more than 500 data points to generate a more accurate acid fracture conductivity correlation that considers rock type and etching pattern.

Future and challenges

The utilization of ML techniques to handle a large data set and to predict several parameters in many aspects of the oil and gas industry is rapidly growing. The main reason behind that is the generation of large data in everyday activates of the oil and gas industry. To be able to process the large data and make it useful, a careful data processing and handling has to take place and ML techniques are a great tool to do that. Furthermore, due to the complexity of the different relationships between the many factors controlling the productivity of an oil or gas well, ML techniques are widely used to figure out these complex relationships and build a multilayered correlation to relate the different factors. Without ML, the classical liner/nonlinear regression methods do not have the capability to handle high complexity as ML models do. Also, the high uncertainty of the many oil and gas industry activates is a major concern given the capital-intensive nature of these activities, building a reliable forecast and prediction models are necessary to navigate through these challenges while optimizing the outcomes.

ML techniques have provided many solutions to the oil and gas industry to thrive. At the same time, there are many disadvantages of these models that are sometimes ignored or rarely mentioned. One of the main disadvantages of using ML in building a relationship between several parameters is whenever there is a high correlation, it does not necessary imply causation (“correlation does not imply causation”). Building a high correlation model linking several parameters together based on the data used should not be taken as indication that these parameters are truly having a cause/effect relationship unless there is a proven physical or scientific relationship between them. Many developed models in the literature fail to address this fact and tend to associate correlation with causation. Another common challenge facing the applicability of ML techniques is the availability and accuracy of data used to build and test these models. The data has to be accurate in order to produce a useful model. Otherwise, the model developed will never be useful no matter its high accuracy. Conducting data collection quality assurance is highly recommended to avoid this issue.

A common criticism of ML models is that they require a large and diverse data set to train the model. Any model needs a sufficient representative data in order to capture the underlying structure that allows it to generalize to new similar cases. For instance, a ML model built to predict production of a certain formation would only be applicable for that formation and under the same conditions when the training data it collected. Generalizing the predictive models has to be done with careful consideration of the constrain of these models and the diversity and inclusivity of the data used to build them. This is a major disadvantage of ML techniques as they tend to be generalized without careful consideration of this limitation.

Surly, ML cannot be used to predict anything related to oil and gas industry or build a correlation between any two or more parameters. Before undergoing building the relationships between the different factors, a scientific and factual explanation of the actual “physical” relationship between these parameters has to be addressed first. Also, using ML to predict and forecast based on historical data has to be done carefully by addressing and assuring that the future conditions are similar to the historical events. ML tend to be a very useful tool to deal with big data and to build the complex relationships between the different parameters that linear/nonlinear regression models cannot handle. Many of the correlation that has been established based on regression analysis of laboratory data are being replaced with correlation developed using ML methods that are more case specific rather than general correlations.

Deep learning which is a subset of ML based on ANN is very efficient for many tasks but it is not the solution to every problem as it faces many challenges. Deep learning algorithms need to be trained with large sets of data and the access and availability of accurate data is not always possible in many aspects of the oil and gas industry. Therefore, overfitting is considered the most common problem in ML applications which is mainly due to the lack of an appropriate amount of data to be used for training. Also, overtraining can happen when there is no clear stopping stage for the training and the error keeps decreasing by updating the model structure and the model become more complex to fit a specific dataset. Even when dealing with large data sets, a major challenge is the training cost. In many situations, supercomputers are needed to handle large oil and gas data sets to build and run ML models.

The future trend of ML applications in the oil and gas industry looks promising. With the arrival of the internet of things and the automation of many of the oil and gas activates and the high reliance on data, it is possible to minimize risks and enhance productivity by integrating ML algorithms that are continuously trained and enhanced using the continuous flow of data. With the generation of large data in oil and gas industry, petroleum engineers and geoscientists must be exposed to big data handling techniques that are being developed in the AI domain. Making the most of the availability of data is something being addressed nowadays and will continue to be the trend for the future. Optimization cannot be reached without the utilization of the powerful capabilities of AI.

Concluding remarks

Based on the review of the literature and the authors’ work on the applications of AI in petroleum engineering, the following remarks can be made:

References

Author information

Authors and Affiliations

  1. King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia Zeeshan Tariq, Murtada Saleh Aljawad, Amjed Hasan, Mobeen Murtaza, Emad Mohammed, Ammar El-Husseiny, Sulaiman A. Alarifi, Mohamed Mahmoud & Abdulazeez Abdulraheem
  1. Zeeshan Tariq