8.A) Explain Pig architecture for scripts data flow and processing. – 10 Marks
Answer:-
Here’s a breakdown of the concepts related to executing Pig scripts and the execution flow through different components:
Ways to Execute Pig Scripts:
- Grunt Shell: An interactive shell of Pig that executes the scripts.
- Script File: Pig commands written in a script file that execute at Pig Server.
- Embedded Script: Create UDFs for the functions unavailable in Pig built- in operators. UDF can be in other programming languages. The UDFs can embed in Pig Latin Script file.
Execution Flow in Pig:
- Parser:
- Handles Pig scripts after passing through the Grunt shell or Pig Server.
- The parser checks for type errors and syntax errors.
- It generates a Directed Acyclic Graph (DAG), which represents the logical plan of the script.
- Acyclic means there are no cycles; data flows in a single direction from one node to another.
- Nodes represent logical operators, and edges represent data flow between operators.
- Optimizer:
- The DAG is sent to the logical optimizer, where optimization activities are performed automatically to reduce data flow and improve efficiency.
- Some optimization strategies include:
- PushUpFilter: Pushes filter conditions earlier in the execution to reduce the number of records early.
- PushDownForEachFlatten: Delays flattening operations (cross products) to keep records fewer at any point.
- ColumnPruner: Removes unused or unnecessary columns from records, optimizing data size.
- MapKeyPruner: Omits unused map keys to reduce record size.
- Limit Optimizer: If a limit operation is used right after a load or sort, it optimizes the process by reducing the number of records early, avoiding unnecessary processing.
- Compiler:
- After optimization, the compiler generates a series of MapReduce jobs that correspond to the logical plan.
- It prepares the jobs for execution.
- Execution Engine:
- The MapReduce jobs are submitted for execution by the execution engine.
- The jobs run on the cluster, performing the computations, and output the final result.
Flow Summary:
- The script is parsed to check for errors.
- The DAG is generated and optimized by applying various techniques to reduce the data flow.
- The compiler generates MapReduce jobs for the optimized plan.
- The execution engine runs the jobs, and the result is produced.