Linux (as of kernel version 2.6.15.4) has the following filesystem performance bug:
fsync() and fdatasync() write to HDD (so uselessly loading hardware) even when file has zero links. See below for a test case and my idea how we could use it to increase system performance if there would be no such bug.
The purpose of fsync() and fdatasync() syscalls (functions) is to reliably save the file data on disk with the purpose of later use. When a file has zero links (that is the file is deleted), it may indeed continue its existence if it is open by some program while it is not closed, in this case there makes no sense to save its data to disk, as this file anyway cannot be opened later and its saved data anyway cannot be read (except of by undeletion tools).
Well, to my mind has come yet one case (except of undeletion) when data of a file with zero links may need to be synced. It need to be synced in the case of software suspend (swsusp). But in this case all data in memory is anyway effectively synced with disk. So in this case also does not arise the need for fsync() to not be no-op for files with zero links.
I have wrote a test case for this in the hope that the latest version of Linux is enough clever to not write to disk files with zero links. (But my hope was false.) If it would be so then I would do the following to enhance performance of my computer:
I would create a big sparse file:
$ dd if=/dev/zero of=TMP bs=1024 count=1 seek=10000000
Then I would create a filesystem inside this file:
$ mke2fs TMP
Then I would mount this filesystem on /tmp:
$ mount TMP /tmp -o loop,async,noatime
.
And lastly I would remove this file (make it zero-link):
$ rm -f TMP
.
I would put all this in a script which would execute at the boot time:
dd if=/dev/zero of=TMP bs=1024 count=1 seek=10000000 mke2fs TMP mount TMP /tmp -o loop,async,noatime rm -f TMP
If Linux would be enough clever to not sync unlinked files, then after this Linux would never write the data in the file TMP (that is my /tmp filesystem) back to the disk, except of when it would need to do this when the memory is not enough for buffering.
That is this file (and /tmp filesystem) would effectively not take place on the disk but would exist in I/O buffers memory, except of that it would be automatically written to the disk only when the memory becomes not enough, just like swap.
So I would have much more efficient /tmp folder than my current subdirectory in the root filesystem, because the system would cease to bother the disk to save temporary files, which by definition do not need to be saved.
Well, why I do not simply use tmpfs instead? In current Linux tmpfs has a serious bug: It cannot be more than half of the physical memory (without swap) without risk to hang the system.
Finally, below is a testing C program which is the test case which proves that fsync() and fdatasync() on the current version of Linux are not no-op for files with zero links:
/*
* This program demonstrates a performance bug in Linux (up to 2.6.15.4).
* Running this program (in a writable directory) causes the <abbr>HDD</abbr> to work
* as shown with shining red light and the specific sound produced by <abbr>HDD</abbr>.
* (I do not recommend to run it more than several seconds, as I suspect
* that if running long time this program may damage your <abbr>HDD</abbr>.)
*
* So Linux is not optimized for the case of syncing a file which has zero
* links, syncing which DOES NOT MAKE SENSE, because syncing is intended for
* later reading from <abbr>HDD</abbr> which cannot happen in this case (except of
* software suspend, but in that case syncing effectively happens anyway).
*
* Somebody please do patch which will make fsync() and fdatasync() no-op
* for files with zero links.
*
* SYNOPSIS:
* zerosync
* use fsync()
* zerosync d
* use fdatasync()
*/
#define _POSIX_SYNCHRONIZED_IO
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#define BOOL unsigned char
#define BYTE unsigned char
#define FILENAME "TEST.tmp"
int main(int argc, char *argv[])
{
int fd;
BOOL use_datasync;
const char arr[2] = {0x00, 0xff};
BYTE ind;
use_datasync = argc > 1 && argv[1][0] == 'd';
fprintf(stderr, use_datasync ? "using fdatasync()\n" : "using fsync()\n");
fd = open(FILENAME, O_CREAT|O_EXCL|O_WRONLY|O_SYNC, S_IRUSR|S_IWUSR);
//system("chattr +S " FILENAME);
unlink(FILENAME);
for(ind=0; ; ind=1-ind) {
lseek(fd, 0, SEEK_SET);
write(fd, arr+ind, 1);
if(use_datasync)
fdatasync(fd);
else
fsync(fd);
}
close(fd);
}
Please somebody do the patch which would make Linux to not sync files with zero links. It would be a great performance enhancement, especially useful for /tmp filesystem.
Correct. And the semantics *will* change with this patch, but in a subtle way. Ext3 happens to guarantee that after fsync(), *all* metadata for a file --- including directory metadata --- are synchronized to disk. So if you unlink an open file and then fsync() it, you are guaranteed that the unlink has been committed to disk. This is not, strictly speaking, a behavior required by POSIX; but it's still useful, and would be broken if we disabled fsync() for files with i_nlink==0.Below is my response to his note where I correct the above to eliminate this mistake:
OK, Stephen, you has pointed where following my idea would really significantly change the semantics, and it should not do. So fsync() (but not fdatasync()) should indeed have effect on an inode with zero links but _only the first time_. Precisely: 1. With every fd should be associated a boolean flag "no_links_committed" (to save a bit of memory it could be instead implemented e.g. as having -1 (minus one) as the count of links in the fd data structure instead of 0). 2. When a file is unlinked, then if the number of links becomes zero no_links_commited should be in reset state (or write zero as the count of links in the fd data structure). 3. When fsync() (but not fdatasync() which is simpler) is called on a file: - If the number of links is above 0 proceed as usual. - If the number of links is zero: * If no_links_commited is false do directory synchronization (as mentioned by Stephen) but no other synchronization and then set no_links_committed to true (or number of links to -1 for a little more efficient impl.) * If no_links_committed is true, do nothing.So my corrected suggestions for a change of Linux kernel is:
fdatasync() should have no effect on files with zero links (nor on files with -1 links, see below).fsync() is called on a file with this count of links equal to there, then:
fsync() is called on a fd with -1 links, do nothing. (There are no need to synchronize this inode to disk as it is not a real use visible file, and directories from which it was removed are already synchronized.)fsync() and fdatasync() but also for all similar kernel internals which accomplish data synchronization:
sync();fsync() would be implemented like this:
int fsync(int fd)
{
fdatasync(fd);
fsyncmetadata();
return ...;
} or:
int fsync(int fd)
{
fsyncmetadata();
fdatasync(fd);
return ...;
}
(for simplicity return values are skipped)
where fsyncmetadata() is a hypothetical function which would do the rest of the job of fsync() except of what fdatasync() does.
With this note the above complex algorithm scheme could be simplified:
fdatasync() should proceed as usual for files with more than zero links.fdatasync() should do nothing for files with zero links.fsync() should be implemented as usual through calling fdatasync()
Recently ||