Polling files on NFS shared directories

August 6, 2008

These days we were implementing a distributed application where processes read startup information from a shared file stored in an NFS directory. Unfortunately, the approach did not work as expected, and we faced strange situations. After all, we learned that polling NFS shared directories can be quite tricky.

The original approach

Let me explain shortly the algorithm that was initially intended for the application. There is one master process and other N slave processes running on different machines. All of them are started nearly at the same time. But the slave processes must first wait for a file with instructions created and written by the master process.

Of course, there are more appropriate approaches for the master process to share information and to signal startup among the N slave processes. However, our application being a prototype, we decided to keep it simple as possible at the first moment.

The slave processes keep polling the shared directory to check if the file was created:

FILE * file = NULL;
while (file == NULL) {
  sleep(1);
  file = fopen(filename, "r");
  // Check for serious errno (omitted...)
  // Read content
  // If content is not complete,
  // set file = NULL to force reading again
  fclose(file);
}

What happened

All slave processes started trying to read the same file, polling for it in the shared directory. And some were successful as soon as the file was written. But strange! Some other slave processes did never notice the existence of the new file. The error message from fopen was always “no such file or directory”.

Even worse, after renaming the file created by the master process to another random name, some slave process we still able to find and open the file using the original name! We wondered: how can fopen successfully handle a path that is invalid?

How we debugged

Wondering why some processes were not able to find the recently created file, we decided to list the directory to ensure that the file is visible and accessible (right permissions) for the process.

FILE * file = NULL;
while (file == NULL) {
  sleep(1);
  system("ls -la");
  file = fopen(filename, "r");
  // Check for serious errno (ommited...)
  // Read content
  // Ff content is not complete,
  // set file = NULL to force read again
  fclose(file);
}

And surprise! Now, as soon as the file is created, all processes are now able to find the file and read its content.

What causes the error

By accident, we discovered that, in case of NFS directories, the process keeps a cache of files that belong to the directory. On further system calls to open files in the same directory, the list of files is not refreshed within a reasonable amount of time.

The cache contains a table with file name and file reference (eg. inode). When a new file is created in the directory, fopen will still look up the outdated cache and erroneously claims that the file does not exist.

Also, after renaming the file, fopen is able to locate the previous name in the cache, since the table is not in sync with the real file system. However, the associated file reference (inode) is still valid, explaining why fopen was able to open the file.

What we propose as solution

Someone has to force a new listing of the directory. It might be any other process (as “ls -la” called by the “system” syscall), or the process itself. Therefore, we decided to create a listing of the directory, just to force to refresh the cache of files.

// Force refresh.
char * wd = get_current_dir_name();
DIR * dir = opendir(wd);
closedir(dir);
free(wd);

FILE * file = NULL;
while (file == NULL) {
  sleep(1);
  system("ls -la");
  file = fopen(filename, "r");
  // Check for serious errno (ommited...)
  // Read content
  // Ff content is not complete,
  // set file = NULL to force read again
}
fclose(file);

Conclusion

Of course, this issue has to be investigated further and deeper. The results presented in this article are merely empirical and reference documents need to be checked. But it is clear the polling shared remote resources must be taken with case even on simple approaches.

Entry Filed under: Linux, planetLTC. .

2 Comments Add your own

  • 1. Griffin  |  December 8, 2008 at 5:54 pm

    I’m currently experiencing this exact same problem (your original code is effectively identical to mine). I’m about to implement your suggested fix, and I’m confident it will work as I already noticed that simply issuing an “ls” from a shell was enough to refresh the directory cash and “kick start” the fopen() call into seeing the file in question.

    I was just curious if this is still the solution you are utilizing, or if came across any more elegant fixes?

    Reply
    • 2. Daniel Ferber  |  December 11, 2008 at 7:08 pm

      Hi Griffin, sorry for the delay to answer you. I am still using the same solution. Unfortunately, I did not yet have time to investigate this issue further to find another more elegant solution.

      Reply

Leave a Comment

Required

Required, hidden

Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <pre> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Trackback this post  |  Subscribe to the comments via RSS Feed


Disclaimer

This is the technical weblog of Daniel Felix Ferber. The postings on this site are his own and don’t necessarily represent neither IBM’s, Stefanini IT Solutions nor Petrobras positions, strategies or opinions.

My Personal Weblog

This weblog is dedicated for my technical articles written in English. If you are interested in my personal thoughs, or my articles in Portuguese, please visit Daniel Ferbers Weblog.

Blogroll

Feeds

Pages

 

August 2008
M T W T F S S
« Jul   Sep »
 123
45678910
11121314151617
18192021222324
25262728293031

Archives

Meta