Java: How fast are decorated HashMaps and TreeMaps?

Recently, I have written about how fast HashMaps and TreeMaps are on Java. However, more results are to be commented. This article discusses how the most common map decoratos affect map implementations.

Note: This article is still a draft that will be reviewed.

(more…)

Add comment June 30, 2009

Java: How fast are HashMaps and TreeMaps?

This article discusses how the choice of map implementation might affect execution speed of a Java programs. Empirical results are presented. HashMaps (specially IdentityHashMaps) were significantly faster than TreeMaps for nearly all situations, a result that is compatible to theory. The migration to an alternative third party implementation from Apache Commons did not show any perceptible speed advantages.

Introduction

A Map expresses a relationship from keys to values. Their typical use is storing data (the value object) to be efficiently looked up by an index (the key object). Another use is extending an object (the key object) with further data (the value object) without creating dependencies between the key and value objects. More information may be found on [1] [2] [3].

The hash table is an example of Map implementation. The value objects are stored within fixed size list or array and the position (index) of each value object is calculated based on its properties (according to a hash function). A good choice  parameters allows constant average cost to look up elements. That means that time required to look up element is not expected to increase as more elements are added to the table. Also, insertions and removals or key-value elements are constant average cost. References found on [4] provide further information about hash maps.

Another example of Map are search trees. Each key-value element is stored as a note inside hierarchy organized as a tree. The key objects are required to be “comparable”, that means one can decide if a key is greater, less or equal than another key. The key-value elements are placed  inside the tree in a way that transversing the tree will visit the keys according to their ordering. On a well balanced tree with well chosen ordering, key-value elements can be looked up , inserted and removed with average cost that increases like log(number of elements). This means that time required for each operation increases as more elements are added, but the rate of the increase declines. The explanation on [5] provide further information about search trees.

Although hash tables are clearly more efficient than search trees, the latter provides ordering as an additional feature that might be important depending on the application. Also, search trees allow looking up a range o keys while hash tables are limited to single key look ups.

Java 6 itself has two major  implementations for maps: HashMap, TreeMap. The former is based on a hash table, the latter on a red-black tree, that is a special search tree. There are also backward compatible maps (for example Hashtable) and maps for specific purposes (WeakHashMap, LinkedHashMap, IdentityHashMap). Java 6 also offers “decorators” that add functionality to existing Maps, like turning them into unmodifiable, thread safe or type checked maps.

There are other organizations that provide alternative implementations. One example is Apache CommonsJava Collections Framework with maps that were expected to run faster or that are intended for specific purposes. Most map functionality from Apache Commons was already embodied into Java API itself during the past Java releases

How execution speed was measured

All test were evaluated on a 16GM RAM, 8 dual-core AMD Opteron(tm) 8220 processors with Sun Java Server 6. However, only thread was dedicated for the test, while other operating system tasks were scheduled to other processors.

Each test was performed 20 times to obtain average values and consists of two steps: Insertions to measure time required to populate the Map and lookups to measure the time to retrieve entries from the Map. Tests were executed for different sizes of Maps, varying from 10,000 to 200,000 elements. For a give size, all 20 tests were performed in exactly the same order of insertions and look ups.

The keys used by insertion and lookup steps were chosen from a fixed set of random objects (the domain) which size was adjusted according to the intended size for the map. This adjustment guaranteed that after populating the entire map, only about 50% of the keys were used in the map and that about 30% of the look ups will fail because the key was not previously added to the map. All choices (objects to be inserted and to be looked up) were calculated prior to measuring time, avoiding that these operation would interfere the measures.

Comparing map implementations

The first test case intended to assure that the implementations behave as expected in theory. The results are shown in the following two graphs. The former summarizes the average time measures to populate the maps. And the second, average time measures to look up elements from the map. The y-axis is average execution time measured in milliseconds, the x-axis the is number (thousands) of elements inserted into the table.

Java Single Key Maps - Insertion

Java Single Key Maps - Insertion

Java Single Key Maps - Look up

Java Single Key Maps - Look up

Both graphs  suggest that the hash maps are significantly faster that tree maps. Both for insertion and look up, total execution time is very predictable and varies linearly to the size of the table for hash map implementations, suggesting that the both insert and look up operations are nearly constant time, as expected in theory. IdentityHashMap has shown the fasts execution most time, because its implementation does not rely on the hashCode() method. But the speed gain worths only while looking up objects.

Execution time for tree map implementation shows a curve with slowly  increasing slope. The larger the table, the slower each individual operation, as expected in theory. However, starting at table sizes from 115k, the tree map implementation lost its predictable behavior, while keeping increasing slope tendency. The cause of the execution time variance was not yet investigated.

Conclusion: For small tables, there is noticeable difference among map implementations. TreeMaps are more interesting since they provide interesting features like preserving key order and querying key ranges. If the purpose is associating an object to another on large tables, then IdentityHashMap is the most indicated solution.

Comparing with third party map implementation

The second test case investigated if there is a real speed improvement by replacing the Sun Java native  maps by another third party implementation, in this case, Apache Commons. The next two graphs summarize average  insertion and look up times for both Sun Java and Apache Common map implementations.

Java Single Key Maps - Apache - Insertion

Java Single Key Maps - Apache - Insertion

Java Single Key Maps - Apache - Look up

Java Single Key Maps - Apache - Look up

No significant difference was measured during insertion, although there identity maps performed a bit faster. However, the lookup time graph shows clearly that Java Sun native IdentityHashMap outperforms all other implementations.

Conclusion: There is no motivation to replace Sun Java native implementations. It would imply more dependencies that that do not increase execution speed in return.

Comparing more third party map implementation

Apache commons provides more map implementations for specific purposes. Therefore, their performance was measured to investigate if they are worth or not. Following two graphs summarize average insertion time (continuous lines) and look up times (dotted lies) from Apache Commons, also comparing them with Sun Java native HashMap.

Java Single Key Maps - Apache - Others 1

Java Single Key Maps - Apache - Others 1

Java Single Key Maps - Apache - Others 2

Java Single Key Maps - Apache - Others 2

None of the Apache Commons implementations outperformed Sun Java native HashMap.

Conclusion: Again, ther is no motivation to replace Sun Java native implementations. Possibly, Apacha Commons implementation were faster in the past. But as Sun Java evolved to Java 6, the native Collections framework was enhanced to provide satisfactory performance.

Add comment June 24, 2009

How to correclty calculate days between two days

The standard Java API has no solution to calculate the period between two days, although this is a common request for someone who is doing calendar arithmetic.

Googling around, the are many people who expose their own solutions (see some of them 1, 2, 3, 4).  All them suggest getting difference between the two dates, measured in number of milliseconds, and divide by 1000 * 60 * 60 * 24 (that means, the number of milliseconds in a day typical day).

However, are these results really reliable? I think not. All those that I have read did not yet consider correctly both daylight savings time changes and leap years.

And looking at the forums, I found many people arguing about their solutions, nevertheless, I suspect that all them are wrong.

Fortunately, one may resort on Joda Time,  small and reliable Java package that replaces the Java date and time API and provides tools for all calendar arithmetic. I really recommend getting to know with this package.

Calculating the number of days between two days becomes as simple as (see more):

  Days d = Days.daysBetween(startDate, endDate);
  int days = d.getDays();

And all calendar details and pitfalls are correctly handled by Joda Time. Much metter than relaying on self-, handmade solutions that one cannot easily guarantee to be correct for all situations.

Add comment April 13, 2009

Regex that matches path, filename and extension

I was looking for a regular expression for Python capable to match a string containing a valid path, filename and extension. Finally, I discovered following solution:

^(.*/)?(?:$|(.+?)(?:(\.[^.]*$)|$))

Let me explain how I got this regular expression. Fortunately, Scott Carpenter has written an excellent article about a regular expression to match a filename with extension. Matching the filename extension is not trivial for all possible situations.

(more…)

Add comment April 6, 2009

Skype on Ubuntu 8.10: microphone and internal speaker issues

It has been a long time since my last post… Also, it has been some time I managed live without a PC and Internet. Now I have a PC and Internet again and I took the chance to immediately install Linux!

Updated: I also solved the issue of skype consuming 100% cpu during a call.

Although the Ubuntu Guide has a good description how to install Skype on Ubuntu Intrepid, it can be quite challenging to place a call successfully. Shame on Gnome, Pulse Audio and Skype! I had to overcome following issues:

  • No audio on skype, or Skype complaining that sound device does not work.
  • No microphone input during calls.
  • Call also audible from the internal speaker (even when headset is connected to the green audio jack).
  • Skype consuming 100% cpu during the call and Skype delaying audio over 60 seconds!

Here is the solution that worked for me.

(more…)

7 comments March 18, 2009

Resume scp/rsync file transfer

This is very basic Linux knowledge, but the question has arised too often and I decided to document it.

Problem:

Transfer (or upload/download) a large tree of files and directories from one machine to another using a ssh connection, and being able to resume/continue if the operation is interrupted.

Solution:

Until now, I believe that the best solution is using rsync over ssh, since rsync has a feature to resume interrupted file transfer, even when an entire tree of directories is involved.

Command line:

rsync -vrPtz -e ssh host:/remote_path/* /local_path/

Explained:

-e ssh rsync will use ssh client instead of rsh
-z compress file transfer
-t preserve time (other attributes as owner or permissions are also possible)
-P resume incomplete file transfer
-r recursive into subdirectories
-v verbose

3 comments November 19, 2008

About Trilead SSH open source project

Some time ago I have written an article about the JSch open source project. However, I soon lost all my motivation when I faced, again, the code complexity required for JSch just to start a execution session, a file copy or a port forwarding with JSch. Nevertheless the completely misunderstood authentication API for JSch let me to a mood even worse. I started looking for alternatives and fortunately I found Trilead SSH, a pure Java implementation of SSH, available with the BSD license.

I have not yet tested Trilead SSH extensively to state that it provides a stable and complete SSH implementation. But results were very promising. Unfortunately, I have not much information about performance of data transfer with Trilead SSH.

I think that the major advantage of Trilead SSH is its very simple and intuitive API, directly targeted to tasks that shall be solved for someone that is using SSH. Everything is made very easy: execute a command, do port forwarding, transfer files. All complexity is done behind the scenes and a single method call uses to be enough. There is no need to care about SSH details as it was required for JSch.

In special, I liked the Trilead SSH API resposible for authentication that is very straightforward and matches very well the steps to log into the remote host. Even better, I was able to implement my own logic to respond to authentication failure exactly to the application needs. In this situation Trilead SSH is simplier and more flexible that JSch.

As a drawback, it is not possible to have low level access to the SSH protocol. However, this seems to be a very improbable requirement.

Depite of the simple API, Trilead SSH claims to implement all major ssh features: sessions (remote command execution and shell access), X11 forwarding, local and remote port forwarding, local stream forwarding, SCP and SFTP. Many options are available for authentication methods and cipher types.

The API is very well documented and includes examples and advices, what is very rare on a typical javadoc.

Conclusion

Replacing JSch by Trilead SSH has been a good choice, although I have not yet evidences for better performance nor stability. But the code become considerable simpler and shorted. And development time was reduces many times.

3 comments November 13, 2008

PCMan: the real and true Gnome file manager

I have modified this post after several people complained that I was misunderstood: Nautilus can, sure, display a tree view of the file system. Sure, but only for “who-knows-how”.

I do not like Gnome desktop environment because of its excessively simplified GUI that leverages all users as if they lack minimal skill to interact with a computer. And what I dislike most on Gnome is even more oversimplified vision of the file system provided by Nautilus on its default configuration.

Please, the converse is not true: I do not love KDE. And even more disappointing was seeing KDE 4 abandoning its advantages over Gnome just to become similar to Gnome (after a while, KDE 4 will not have anymore its nice power-user features, nor the nice simplicity of Gnome).

Specially annoying is the “Spatial Navigation” that opens a new window for each folder that is accessed (although real [wo]men would call it “directories”). As soon you have more than a bunch of directories, navigating from one to other will open more and more windows on your desktop just to drive you crazy.

Sure, Gnome developers will explain that a typical user will not own more than a hand full of files. And, they will store everything on the same folder like “documents” or “images” (or even on the desktop leading to the result of the picture on the right). They

Screenshot from a user that stores all documents on his desktop folder

Screenshot from a user that stores all documents on his desktop folder

will never store their files in a secure separated partition, nor on a file system over the network. Just in the case that the amount of files gets out of control, the user will create one or other subdirectory here or there. Therefore, subdirectory navigation is not necessary and it is reasonable to display the content of all directories at the same time.

By the way, Nautilus also claims to support file system browsing. Just click “browse” instead of “open” is what they say. However, there are some issues to consider. The “browsing” features are not visible by default. And a feature that is not visible, or that is very difficult to find, in practice, is not a feature at all. There is no preference where the user can easily choose which configuration he prefers. Googling around shows very few references about “nautilus tree view”. Asking experienced Gnome users was of nearly no use.

After reading the entire Nautilus documentation and after getting some good advices from people that read this article, I understood how. The user needs to start Nautilus with the –browse parameter on command line, open any folder with the “browse” command, then go back again to original folder. The, the user needs to display the side pane and select to tree visualization on this pane (it took a while how to figure out that I can change the content shown by the side pane).

Finally, one will be able to see the directory hierarchy and be able to jump to sibling directories. It is possible only to go to/from parent and child directories.

Fortunately, I discovered an alternative that allows the real and true file system browsing. It is called PCMan File Manager, and stays a bit hidden on most Linux distros. At the beginning, it provides an appearance familiar to Nautilus, and understands the same pre-defined folders like “documents” or “images”. However, after a short time of use, one will soon discover the power-user features that are available.

PCMan claims to be extremely fast and lightweight. I do not think this matters most, compared to the features it provides. It has a sidebar with useful and organized common locations and mount points. The sidebar may also display a complete tree view of the file system. Tabbed browsing is also supported.

I liked most that each tab has its own tree view. It was really annoying on KDE/Konqueror that the tree view did not always match the focused tab. The tree view is completely absent in Nautilus. Also I like the organized common locations and mount points. Although Konqueror has something similar, it is not as organized and as useful as it is on PCMan.

Also, there are features I still need to get better acquaintance: mounting/unmounting/ejecting media within PCMan. I really dislike having an icon for storage devices on the Gnome/KDE desktop while not being able to see and manage them within Nautilus/Konqueror.

Other nice feature is opening a directory as root, avoiding the need to switch to a terminal or to log-in as a different user.

However, I was not able to open and browse very large directories, like /usr/bin, as promised by PCMan web site.

8 comments October 24, 2008

Fortified compiling prevents building older RPM

Today I had to build and test some older Open MPI RPM packages. However, I was prevented to build them, although the code apparently was correct.

The command:

rpmbuild -ba openmpi-1.2.1.spec

The error message was:

In function 'open',
    inlined from 'orte_abort' at runtime/orte_abort.c:91:
/usr/include/bits/fcntl2.h:51: error: call to '__open_missing_mode' declared with attribute error: open with O_CREAT in second argument needs 3 arguments

The line of source code that causes the error is:

fd = open(abort_file, O_CREAT);

After some research, I discovered that the source code is correct. But nowadays, newer versions of gcc do additional security checking that were not available that the time the code was written.

For security reasons, calling open with O_CREAT and only two parameters is not acceptable anymore.

In order to build the RPM, I would either need to change the SPEC file or the src.rpm package, or find out a build options for rpmbuild that turns off further code protection.

First, I changed the the src.rpm package, chaning the problematic line to:

fd = open(abort_file, O_CREAT, 0666);

Then, I confirmed that it is also possible to pass additional parameters to rpmbuild in order to turn off code protection:

rpmbuild -ba openmpi-1.2.1.spec --define 'configure_options CFLAGS=-D_FORTIFY_SOURCE=0'

References:

  • http://www.redhat.com/archives/fedora-devel-announce/2007-September/msg00015.html
  • http://www.redhat.com/magazine/009jul05/features/execshield/
  • http://www.redhat.com/archives/fedora-tools-list/2004-September/msg00002.html
  • http://wiki.debian.org/Hardening
  • https://wiki.ubuntu.com/CompilerFlags

Add comment October 13, 2008

A poem about Subversion

I found an interesting poem from Peter Kriens about SVN (subversion) that was posted to the Eclipse committer mailing list. Really interesting how frustration can encourage your imagination in others arts besides computer programming.

(more…)

Add comment October 7, 2008

Previous Posts


Disclaimer

This is the technical weblog of Daniel Felix Ferber. The postings on this site are his own and don’t necessarily represent neither IBM’s, Stefanini IT Solutions nor Petrobras positions, strategies or opinions.

My Personal Weblog

This weblog is dedicated for my technical articles written in English. If you are interested in my personal thoughs, or my articles in Portuguese, please visit Daniel Ferbers Weblog.

Blogroll

Feeds

Pages

 

July 2009
M T W T F S S
« Jun    
 12345
6789101112
13141516171819
20212223242526
2728293031  

Archives

Meta