the pcf function I posted is the same as yours, just using tex2Dlod instead of tex2D as the difference does not matter for the depthmap sampling, but tex2Dlod allows the graphics card to use dynamic branching, which in this case means a huge speed benefit, as without the PCF sampling is done for each depthmap for every pixel instead of just the one it needs to.