In today’s society, the impact of big data is seen everywhere. This is especially the case with social media networks, cloud-based solutions for organizations, and enterprise-wide IT systems. As a result of this phenomenon, the size of log files has increased exponentially because of user-based activity and maintenance and operations activity. In addition, the log file storage retention requirements have increased, this leads to log files exceeding the Terabytes (TB) and sometimes the Petabytes (PB) range. With the explosion in log file size it is becoming increasingly difficult to conduct log file analysis. It is of utmost importance that an analyst follows a solid log analysis methodology reinforced with a capable platform in order to successfully process log files and find artifacts of value to yield accurate results.
In the past, a log analyst would use various command line utilities like GREP, AWK, and SED. While various command line utilities can be successfully used for filtering log files, the sheer volume of the log files and the complexity of the investigation requires the analyst to use more advanced log application tools. I have found success in performing detailed log file modelling and reporting using various Python Panda (DataFrames) tools and techniques for big data analysis. However, this technique requires the log analyst to have a comfort level with Panda DataFrames and Python programming (or other data science languages).
To help reduce the impact of a learning curve, I recommend using the free Windows-based Microsoft Power BI Desktop application as another alternative log analysis source. With a similar user interface and style as Microsoft’s Excel Power Query, Microsoft’s Power BI Desktop scales to perform effective log parsing, event filtering, event aggregation, event correlation, log reduction, interactive visualizations, and reporting.
The successful analysis of log files requires the analyst to follow a Log Analysis Methodology. Figure 1 presents a reasonable Log Analysis Methodology to follow when conducting a log file investigation.
Figure 1. Four Stages of Log Analysis
The Log Collection stage involves the identification of the devices or sources (e.g., applications, system, network device, security) and capturing log files that must be extracted for analysis. Preferably, the detail and robustness of the log files are decided during the requirement and design phase of the implemented systems and hopefully contain relevant events such as:
- authentication successes and failures
- authorization failures
- application related events and errors
- modifications to configurations
Initial log analysis may be performed on the primary device containing the log files, if the appropriate tools are available, otherwise the logs will have to be exported for analysis on a separate platform. In most cases, the analyst may need to export the logs from the primary device, at which time the export log format must be determined (e.g., csv, json) and the associated secondary sources are supported. Figure 2 presents a lists of various import file formats supported by Microsoft Power BI.
During the Collection stage, the analyst should also initiate log reduction techniques (filtering) ensuring the extracted log file contains relevant information for the case at hand. Besides pulling logs directly from the source, log files may also need to be pulled from separate log collectors or aggregators such as a SIEM where source log files are exported on a regular basis for monitoring, analysis as well as archiving or storage. Regardless of the source, the analyst must determine the format of the extracted log files and export the logs files while maintaining the integrity of all log data. It is imperative that the log file sources and exported log file format be identified in advance, preferably before an incident occurs.
Figure 2. Microsoft Power BI Import Formats
The Log Preparation stage commences after the identified log files have been extracted and imported into a log analysis tool. The analyst should ensure their platform of choice can support big data size investigations. This includes, at a minimum, sufficient memory (RAM), ample disk drive space, effective CPU processing power, and the proper selection of a log analysis application that can scale.
It is during this phase that integrity should be checked; the analyst should perform visual and programmatic inspection to ensure each row (record) and column (field) is complete. First, the analyst must determine the structure of each log record/row. There are many different structures used for log files (e.g., Syslog, LEEF, W3C Extended Log File format, NetFlow, IPFix, Microsoft Windows Event Viewer format, NCSA Common Log Format). Next the analyst must select optimized column data types (e.g., Date/Time, Text, Integer). Finally, the log analyst must perform various log file clean up and/or tidy up procedures to ensure only relevant information is analyzed.
Below is a summary of steps the analyst should take for effective and efficient log analysis:
- Filter or remove any unrelated rows, columns, and tables to reduce the size of the log files.
- Select the appropriate data types for each field or column. Some data types are searched faster than others.
- Use optimize filtering function techniques for conducting pattern searches (e.g., Regex, CIDR notation).
The Log Modeling stage entails the analysis of the information contained within log files based upon the type of events/incidents related to the investigation. Since different investigative models (e.g., Threat Hunting) exist, it is imperative for the analyst to be knowledgeable on the different types and select the most appropriate model. For example, if the log analyst is performing an intrusion analysis, an investigative model designed to identified intrusions should be used. Figure 3 presents a high-level description of the investigative categories appropriate to use in an investigation that could include system, application, network, and user content logs.
Figure 3. Log Analysis Investigative Categories
While the figure presents the categories in univariate format and can be analysed individually, multivariate analysis can also be performed, as identified in line item 16 by combining 2 or more categories together. Figure 4 presents a simple multivariate comparison using the Temporal Analysis and Size Analysis categories. Whereas Figure 5, presents a more complex multivariate comparison using Source Linkage (client IP addresses) and Destination Linkage (email domains) categories.
Figure 4. Temporal and Size Multivariate Analysis Figure 5. Source and Destination Linkage Multivariate Analysis
Figure 5. Source and Destination Linkage Multivariate Analysis
If a threat hunt model is used to identify TTPs (Tactics, Techniques, and Procedures) or Attack Trees, the analyst should be able to perform predictive analytics when examining log files. The selected tool should be able to use various machine learning languages and algorithms (e.g., R, Python, TensorFlow, scikit-learn, k-nearest neighbours, logistic regression, decision trees, random forest). Finally, if the analysis encompasses multiple log files or tables across similar or disparate sources, the log application tool should be able to establish optimized data modelling (cardinality) relationships (e.g., 1-to-1, 1-to-many, many-to-many) between the log files or tables for cross-correlating events (e.g., based upon the synchronization of time).
The Log Report/Presentation stage entails presenting the findings or results obtained during the modelling stage and communicating the conclusions in a clearly comprehensible report. This stage, in many cases, is the most difficult for the analyst because of the inability of tools to process large volumes of log data or generate meaningful tables or graphics. While there are many ways to present findings (e.g., tables, charts, plots, graphs), the analyst must select the most effective combination of text and visualization to communicate findings and results. The next generation of big data science or log analysis tools are incorporating text and visualization tools to effectively communicate findings and results.
In summary, this blog post presents a log analysis methodology. While this article represents an effective log methodology, the ability of an analyst to collect, prepare, model, and present log analysis findings is critical for any intrusion investigation. Regardless of log file format, the analyst must be able to analyse log files and draw meaningful conclusions if the data resides within the log files. In addition, if required, the analyst must be able to utilize the log file data to predict future behaviour or attacks using predictive analytics.
- , J., & Lakhani, A. (2018). Investigating the Cyber Breach: The Digital Forensics Guide for the Network Engineer. Cisco Press.
- ,J., & Sanders, H. (2018). Malware Data Science: Attack Detection and Attribution. No Starch Press.
- “Go from Data to Insight to Action with Power BI Desktop.” What Is Power BI | Microsoft Power BI, powerbi.microsoft.com/en-us/desktop/.
- , A. (2018). Windows Security Monitoring: Scenarios and Patterns. John Wiley & Sons, Inc.