Strategy and Methodology of Big Data Testing

September 11th, 2017

With advancement in technology and the new development taking place on a regular basis, there is a large pool of data that is also being generated. This, in specific terminology, is known as big data. Big data is a large pool of information which is made up of large data sets that cannot be processed using traditional methods of computing. This is because traditional methods work effectively on structured data stored in rows and columns and not for the ones that do not follow any specific structure.

Big data can be available in varied formats such as images or audio. This data varies in its structure as well as format for every record tested and is typically characterized by volume, velocity and variety.

Volume: Available in large amount, big data is generally available from different sources
Velocity: Generated at high speed, this data is to be processed and handled quickly
Variety: Big data can be available in various formats such as audio, video, email, etc

Big data testing
Availability of big data is leading to a demand of big data testing tools, techniques and frameworks. This is because increased data leads to an increased risk of errors and thus, might deteriorate the performance of applications and software.
When conducting big data testing, a tester’s goal is completely different. Testing of big data aims at verifying whether the data is complete, ensure an accurate data transformation, ensuring high data quality as well as automating the regression testing.
Strategy and methodology of big data testing
Big data testing is typically related to various types of testing such as database testing, infrastructure testing, performance testing and functional testing. Therefore, it is important to have a clear test strategy that enables an easy execution of big data testing.
When executing big data testing, it is important to understand that the concept is more about testing the processing of terabytes of data that involves the use of commodity cluster and other supportive components.
Big data testing can be typically divided into three major steps that include:

Data staging validation

Also known as pre-hadoop stage, the process of big data testing begins with process validation that helps in ensuring that the correct data is pushed into the Hadoop Distributed File System (HDFS). The data for which validation testing is done is taken from various sources such as RDBMS, weblogs and social media. This data is, then, compared with the data used in the hadoop process in order to verify that the two match with each other.
Some of the common tools that can be used for this step are Talend and Datameer.

“MapReduce” validation

MapReduce is the concept of programming that allows for immense scalability across hundreds of thousands of servers in a Hadoop cluster.
During big data testing, MapReduce validation is counted as the second step in which a tester checks the validity of business logic on every node followed by the validation of the same after running against multiple nodes. This helps in ensuring that:

Map Reduce process is working flawlessly.
Data aggregation or segregation rules are correctly executed on the data.
Key value pairs are generated appropriately.
Data is validated after Map Reduce process.

Output Validation

On successfully executing the first two steps, the final step of the process is output validation. This stage includes generating files that are ready to be moved to an Enterprise Data Warehouse (EDW) or any other system based on the specific requirements.
Output validation phase includes the following steps:

Validating that the transformation rules are correctly applied.
Validating the data integrity as well as successful loading of data into the target system.
Ensuring that there is no data corruption by comparing the target data with HDFS file system data.

Architectural & Performance testing
Big data testing involves testing of a large volume of data, which also makes it highly resource intensive. Therefore, to ensure higher accuracy and success of the project, it is important to conduct architectural testing.

It is important to remember that an improperly designed system may degrade software’s performance as well as does not allow it to specific requirements. This, in turn, generates a need of conducting performance and failover test services.
When performance testing is conducted on a system, it implies that it is being tested for aspects such as time taken to complete a job, utilization of memory and similar system metrics. On the other hand, the purpose behind conducting a Failover test is to verify that data processing takes place with a flaw in case of data nodes’ failure.
Conclusion
It is obvious that big data testing has a set of its own challenges such as need of technical expertise to conduct automation testing, timing issues in real time big data testing and need to automate the testing effort, it has numerous advantages over traditional database testing such as ability to check both structured and unstructured data.
But, a company should never rely on one single approach for testing its data. With an ability to conduct testing in multiple ways, it gets easier for the companies to deliver fast and quick results