Energy-ecient computation is critical for i nc reasing p erformanc e in p ower limited systems. Floating p oint p erformance is of particular intere st b ecause of its imp ortance in scientic computing, graphics and multimedia pro cessing. For floating point applications that have large amounts of data parallelism one should optimize the throughput/mm2 given a power density constraint. We present a method for creating a trade-o curve that can be used to estimate the maximum oating-point performance given a set of area and power constraints. These throughput optimized designs turn out to be dierent from latency optimized ones and more energy ecient. Looking at floating-point multiply-add units and ignoring register and memory overheads, we nd that in a 90nm CMOS technology at 1W/mm2, one can achieve a performance of 27GFlops/mm2 single-precision, and 7.5GFlops/mm2 double-precision. Adding register le overheads reduces the throughput by less than 50% if the compute intensity is high. Since the energy of the basic gates is no longer scaling rapidly, to maintain constant power density with scaling requires moving the overall FP architecture to a lower energy/performance point using lower supply voltage, shallower pipelines and more relaxed gate sizing. A 1W/mm2 design at 90nm is a “high-energy” design, so scaling it to a lower energy design in 45nm still yields a 7 × performance gain, while a more balanced 0.1W/mm2 design only speeds up by 3.5× when scaled to 45nm. Performance scaling below 45nm rapidly decreases, with a projected improvement of only 2-3 for both power densities when scaling to a 22nm technology.