Post details: Linux performance bug - zero links fsync()

02/24/06

Permalink 05:24:38 pm, Categories: Linux kernel, 862 words   English (US)

Linux performance bug - zero links fsync()

Linux (as of kernel version 2.6.15.4) has the following filesystem performance bug:
fsync() and fdatasync() write to HDD (so uselessly loading hardware) even when file has zero links. See below for a test case and my idea how we could use it to increase system performance if there would be no such bug.

[More:]

The purpose of fsync() and fdatasync() syscalls (functions) is to reliably save the file data on disk with the purpose of later use. When a file has zero links (that is the file is deleted), it may indeed continue its existence if it is open by some program while it is not closed, in this case there makes no sense to save its data to disk, as this file anyway cannot be opened later and its saved data anyway cannot be read (except of by undeletion tools).

Well, to my mind has come yet one case (except of undeletion) when data of a file with zero links may need to be synced. It need to be synced in the case of software suspend (swsusp). But in this case all data in memory is anyway effectively synced with disk. So in this case also does not arise the need for fsync() to not be no-op for files with zero links.

I have wrote a test case for this in the hope that the latest version of Linux is enough clever to not write to disk files with zero links. (But my hope was false.) If it would be so then I would do the following to enhance performance of my computer:

I would create a big sparse file:

$ dd if=/dev/zero of=TMP bs=1024 count=1 seek=10000000

Then I would create a filesystem inside this file:

$ mke2fs TMP

Then I would mount this filesystem on /tmp:

$ mount TMP /tmp -o loop,async,noatime

.
And lastly I would remove this file (make it zero-link):

$ rm -f TMP

.

I would put all this in a script which would execute at the boot time:

dd if=/dev/zero of=TMP bs=1024 count=1 seek=10000000
mke2fs TMP
mount TMP /tmp -o loop,async,noatime
rm -f TMP

If Linux would be enough clever to not sync unlinked files, then after this Linux would never write the data in the file TMP (that is my /tmp filesystem) back to the disk, except of when it would need to do this when the memory is not enough for buffering.

That is this file (and /tmp filesystem) would effectively not take place on the disk but would exist in I/O buffers memory, except of that it would be automatically written to the disk only when the memory becomes not enough, just like swap.

So I would have much more efficient /tmp folder than my current subdirectory in the root filesystem, because the system would cease to bother the disk to save temporary files, which by definition do not need to be saved.

Well, why I do not simply use tmpfs instead? In current Linux tmpfs has a serious bug: It cannot be more than half of the physical memory (without swap) without risk to hang the system.

Finally, below is a testing C program which is the test case which proves that fsync() and fdatasync() on the current version of Linux are not no-op for files with zero links:

/*
* This program demonstrates a performance bug in Linux (up to 2.6.15.4).
* Running this program (in a writable directory) causes the <abbr>HDD</abbr> to work
* as shown with shining red light and the specific sound produced by <abbr>HDD</abbr>.
* (I do not recommend to run it more than several seconds, as I suspect
* that if running long time this program may damage your <abbr>HDD</abbr>.)
*
* So Linux is not optimized for the case of syncing a file which has zero
* links, syncing which DOES NOT MAKE SENSE, because syncing is intended for
* later reading from <abbr>HDD</abbr> which cannot happen in this case (except of
* software suspend, but in that case syncing effectively happens anyway).
*
* Somebody please do patch which will make fsync() and fdatasync() no-op
* for files with zero links.
*
* SYNOPSIS:
*   zerosync
*     use fsync()
*   zerosync d
*     use fdatasync()
*/

#define _POSIX_SYNCHRONIZED_IO

#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>

#define BOOL unsigned char
#define BYTE unsigned char

#define FILENAME "TEST.tmp"

int main(int argc, char *argv[])
{
int fd;
BOOL use_datasync;
const char arr[2] = {0x00, 0xff};
BYTE ind;

use_datasync = argc > 1 && argv[1][0] == 'd';
fprintf(stderr, use_datasync ? "using fdatasync()\n" : "using fsync()\n");
fd = open(FILENAME, O_CREAT|O_EXCL|O_WRONLY|O_SYNC, S_IRUSR|S_IWUSR);
//system("chattr +S " FILENAME);
unlink(FILENAME);
for(ind=0; ; ind=1-ind) {
lseek(fd, 0, SEEK_SET);
write(fd, arr+ind, 1);
if(use_datasync)
fdatasync(fd);
else
fsync(fd);
}
close(fd);
}

Please somebody do the patch which would make Linux to not sync files with zero links. It would be a great performance enhancement, especially useful for /tmp filesystem.

Comments, Trackbacks, Pingbacks:

Comment from: Victor [Member] · http://portonvictor.org
I have communicated this my idea in mailing lists. Stephen C. Tweedie from RedHat has noted a mistake (a wrong change of semantics) in my idea:
Correct. And the semantics *will* change with this patch, but in a subtle way. Ext3 happens to guarantee that after fsync(), *all* metadata for a file --- including directory metadata --- are synchronized to disk. So if you unlink an open file and then fsync() it, you are guaranteed that the unlink has been committed to disk. This is not, strictly speaking, a behavior required by POSIX; but it's still useful, and would be broken if we disabled fsync() for files with i_nlink==0.
Below is my response to his note where I correct the above to eliminate this mistake:
OK, Stephen, you has pointed where following my idea would really
significantly change the semantics, and it should not do.

So fsync() (but not fdatasync()) should indeed have effect on an inode with
zero links but _only the first time_. Precisely:

1. With every fd should be associated a boolean flag "no_links_committed"
(to save a bit of memory it could be instead implemented e.g. as having -1
(minus one) as the count of links in the fd data structure instead of 0).

2. When a file is unlinked, then if the number of links becomes zero
no_links_commited should be in reset state (or write zero as the count of
links in the fd data structure).

3. When fsync() (but not fdatasync() which is simpler) is called on a file:
- If the number of links is above 0 proceed as usual.
- If the number of links is zero:
* If no_links_commited is false do directory synchronization
(as mentioned by Stephen) but no other synchronization and
then set no_links_committed to true (or number of links to -1 for
a little more efficient impl.)
* If no_links_committed is true, do nothing.
So my corrected suggestions for a change of Linux kernel is:
  • fdatasync() should have no effect on files with zero links (nor on files with -1 links, see below).
  • fd (file descriptor) or inode data structure which contains the number of links of a file should allow additional value -1 (minus one) as the number of links.
  • When a file with one link is unlinked, proceed as usual (decrease the number of links to zero).
  • When fsync() is called on a file with this count of links equal to there, then:
    1. accomplish synchronization to the disk of the directories which was containing this file (from which this file was removed), but not of the file itself;
    2. decrement the count of links of this fd so making it equal -1.
  • When fsync() is called on a fd with -1 links, do nothing. (There are no need to synchronize this inode to disk as it is not a real use visible file, and directories from which it was removed are already synchronized.)
So somebody make a patch for Linux kernel. It will be a significant speedup.
Permalink 02/28/06 @ 14:16
Comment from: Victor [Member] · http://portonvictor.org
Well, kernel hackers who will implement my idea, also remember that my idea should be implemented not only for external (user visible) syscalls fsync() and fdatasync() but also for all similar kernel internals which accomplish data synchronization:
  • implementation of sync();
  • implementation of loop devices attached to disk files;
  • etc.
When you will implement it in Linux kernel, please leave here a note telling the kernel version where you have implemented it.
Permalink 02/28/06 @ 14:42
Comment from: Victor [Member] · http://portonvictor.org
Well, actually in any reasonable OS (I'm not sure about real implementation in Linux), fsync() would be implemented like this:
int fsync(int fd)
{
fdatasync(fd);
fsyncmetadata();
return ...;
}
or:
int fsync(int fd)
{
fsyncmetadata();
fdatasync(fd);
return ...;
}
(for simplicity return values are skipped) where fsyncmetadata() is a hypothetical function which would do the rest of the job of fsync() except of what fdatasync() does. With this note the above complex algorithm scheme could be simplified:
  • fdatasync() should proceed as usual for files with more than zero links.
  • fdatasync() should do nothing for files with zero links.
  • fsync() should be implemented as usual through calling fdatasync()
This will sync file metadata when needed and never sync file data (content) for files with zero links. This exactly what we need.
Permalink 02/28/06 @ 14:57
Comment from: Victor [Member] · http://portonvictor.org
BTW, this applies to any POSIX compliant OS not just to Linux.

Check other OSes (FreeBSD, Solaris, as well as all Unixes, etc.) to pass this test without unnecessary disk loading.
Permalink 02/28/06 @ 15:25
Comment from: Victor [Member] · http://portonvictor.org
Some keywords for this topic: disk performance test, disk I/O performance test, disk input/output performance test, HDD performance test, HDD I/O performance test, HDD input/output performance test, disk speed test, disk I/O speed test, disk input/output speed test, OS performance test, operating system performance test.
Permalink 02/28/06 @ 15:30
Comment from: Victor [Member] · http://portonvictor.org
I could add a summation of the thoughts in this page as an official comment to the POSIX standard!

Otherwise, it is too easy for an OS developer to skip (forget, not guess) about this little but important for OS performance optimization.

It was not even in Linux about which all the World cares till 2006 year. It is a simple idea but hard to guess and not skip.
Permalink 03/01/06 @ 00:23
Comment from: Theodore Tso [Visitor] · http://thunk.org/tytso
Can you name real-life examples of programs that stupidly call fsync() on an unlinked file descriptor? Even in the case of a library which doesn't know better, I still can't think of a real-life library which would do so. Most of the time real-life code calls fsync() because they have a very specific, real-world reason why they feel they need to do so.

In fact, I just did a "nm -Dou /usr/lib/*.so | grep fsync", and (a) there aren't that many, and (b) I could pretty much identify why all of them would be calling fsync(), and none of them do it randomly or friviously on a file descriptor just for yuks, or would be likely in a real world to be calling fsync() on a file descriptor for a tmp file that has been unlinked.

In general, calling fsync() for no reason is stupid, since it trashes your performance, linked or unlinked file. So usually it is done only in special circumstances or in libraries, by direct request by the calling application when it has a very good reason to do so.

That's not to say that your recommendation to unlink tmp files isn't a good idea. It is a good idea, and it's also not old news. Most applications that need temp files will indeed unlink and keep a file descriptor to it, but they do so just so they don't have to worry about cleaning up the file and deleting it in case the program dies or is killed in the middle of its operation.

But calling this a major performance bug, when in fact it only shows up in contrived test programs, and not in real life, seems to be stretching things immensely.
Permalink 03/04/06 @ 06:46

Software Blog

See also my free software. This weblog will contain information about:
  • software developed by me;
  • my software patches (for others' software);
  • my software related reviews, comments, suggestions, ideas
and other misc software related things.

Recently ||

Last comments

Search

Syndicate this blog XML

Add to MyYahoo

What is RSS?

powered by
b2evolution