The discrete Fourier transform or DFT transforms a block of complex samples representing signal intensity from time 0 to time N to a block of samples representing frequency intensity and phase from frequency 0 to frequency N/2. The discrete Fouier transform is for sampled data (i.e. digital audio). N is the window or block size. If you choose N large (e.g. as long as a whole track) you'll get detailed frequency information for the whole track but you won't have a clue where in the track those components exist. Using numerous shorter blocks, you can pinpoint where in time the various frequencies are occurring (e.g. when which notes are being played) but in using the shorter block you only get a coarse idea about what frequencies are present (e.g. can't distinguish C from C#).
To overcome this, frequency analysis applications will often use the longer blocks and instead of placing them one after another, the blocks will be overlapped. This sort of gives you the best of both worlds. It does require more processing to do the transform - if your overlap is 50%, you are computing FFTs for twice as many samples.
When you edit audio on a workstation you create clicks or other artifacts at the edit points. We have this same problem when we edit audio to do FFTs. In editing, we address this by cross fading at the edit points. And that's exactly what we do with FFTs. We apply an envelope to the audio data in each block (fade it up at the beginning, fade it back down at the end) before performing the FFT. Mathemiticians call these envelopes "windows". There are many shapes of windows because there are many compromises to be made when you're slicing and dicing like this.