Polling files on NFS shared directories

These days we were implementing a distributed application where processes read startup information from a shared file stored in an NFS directory. Unfortunately, the approach did not work as expected, and we faced strange situations. After all, we learned that polling NFS shared directories can be quite tricky.

The original approach

Let me explain shortly the algorithm that was initially intended for the application. There is one master process and other N slave processes running on different machines. All of them are started nearly at the same time. But the slave processes must first wait for a file with instructions created and written by the master process.

Of course, there are more appropriate approaches for the master process to share information and to signal startup among the N slave processes. However, our application being a prototype, we decided to keep it simple as possible at the first moment.

The slave processes keep polling the shared directory to check if the file was created:

FILE * file = NULL;
while (file == NULL) {
  sleep(1);
  file = fopen(filename, "r");
  // Check for serious errno (omitted...)
  // Read content
  // If content is not complete,
  // set file = NULL to force reading again
  fclose(file);
}

What happened

All slave processes started trying to read the same file, polling for it in the shared directory. And some were successful as soon as the file was written. But strange! Some other slave processes did never notice the existence of the new file. The error message from fopen was always “no such file or directory”.

Even worse, after renaming the file created by the master process to another random name, some slave process we still able to find and open the file using the original name! We wondered: how can fopen successfully handle a path that is invalid?

How we debugged

Wondering why some processes were not able to find the recently created file, we decided to list the directory to ensure that the file is visible and accessible (right permissions) for the process.

FILE * file = NULL;
while (file == NULL) {
  sleep(1);
  system("ls -la");
  file = fopen(filename, "r");
  // Check for serious errno (ommited...)
  // Read content
  // Ff content is not complete,
  // set file = NULL to force read again
  fclose(file);
}

And surprise! Now, as soon as the file is created, all processes are now able to find the file and read its content.

What causes the error

By accident, we discovered that, in case of NFS directories, the process keeps a cache of files that belong to the directory. On further system calls to open files in the same directory, the list of files is not refreshed within a reasonable amount of time.

The cache contains a table with file name and file reference (eg. inode). When a new file is created in the directory, fopen will still look up the outdated cache and erroneously claims that the file does not exist.

Also, after renaming the file, fopen is able to locate the previous name in the cache, since the table is not in sync with the real file system. However, the associated file reference (inode) is still valid, explaining why fopen was able to open the file.

What we propose as solution

Someone has to force a new listing of the directory. It might be any other process (as “ls -la” called by the “system” syscall), or the process itself. Therefore, we decided to create a listing of the directory, just to force to refresh the cache of files.

// Force refresh.
char * wd = get_current_dir_name();
DIR * dir = opendir(wd);
closedir(dir);
free(wd);

FILE * file = NULL;
while (file == NULL) {
  sleep(1);
  system("ls -la");
  file = fopen(filename, "r");
  // Check for serious errno (ommited...)
  // Read content
  // Ff content is not complete,
  // set file = NULL to force read again
}
fclose(file);

Conclusion

Of course, this issue has to be investigated further and deeper. The results presented in this article are merely empirical and reference documents need to be checked. But it is clear the polling shared remote resources must be taken with case even on simple approaches.

3 Responses to Polling files on NFS shared directories

  1. Griffin says:

    I’m currently experiencing this exact same problem (your original code is effectively identical to mine). I’m about to implement your suggested fix, and I’m confident it will work as I already noticed that simply issuing an “ls” from a shell was enough to refresh the directory cash and “kick start” the fopen() call into seeing the file in question.

    I was just curious if this is still the solution you are utilizing, or if came across any more elegant fixes?

    • Daniel Ferber says:

      Hi Griffin, sorry for the delay to answer you. I am still using the same solution. Unfortunately, I did not yet have time to investigate this issue further to find another more elegant solution.

  2. Ken says:

    Thread necromancy, I know…maybe if anyone see’s this it will save some headaches.

    I’ve had the same issue. The workaround is to force a request back to the server, which open() will do for files and opendir() will do for directories. That’s why system(“ls”) works in that situation, since `ls` has to open the directory. The same issue also exists when trying to rely on `stat` to get accurate file sizes – since stat doesn’t actually open() the file it succumbs to the problem.

    My solution was to make a wrapper called nfs_fopen() and use it in place of fopen(). It’s pretty simple, before calling/returning fopen() just opendir() and closedir() the directory name of path.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: