Suppose an instruction takes four cycles to execute in a nonpipelined CPU: one cycle to fetch the instruction, one cycle to decode the instruction, one cycle to perform the ALU operation, and one cycle to store the result. In a CPU with a four-stage pipeline, that instruction still takes four cycles to execute, so how can we say the pipeline speeds up the execution of the program?