Wednesday, February 24, 2010

Scripting Refactoring -- Overthrowing the GUI

The most basic and common refactoring in any language is Rename. Whether it's Rename Variable, Rename Method, Rename Class, or Rename Namespace/Package, this simple refactoring helps improve code clarity and, when applied correctly, makes code easier to understand. As reading code is the most frequent activity a programmer undertakes, this makes the Rename refactoring one of the most powerful tools in a developers' arsenal.

Introducing Element Identification

Consider now the Rename refactoring when applied in a batch. In order to apply the Rename refactoring, I first need to be able to unambiguously identify the element being renamed, be it a class, method, variable, etc. Let's entertain for a moment the following notation

[Member::[Member::[Member::...]]]Member

This notation could be used to identify:

  • a namespace at any level (e.g., MyNamespace, Namespace1::Namespace2)
  • a class within a namespace (e.g., MyNamespace::MyClass, SomeClass)
  • a member of a class (e.g., AClass::someVar, ANamespace::MyClass::a)

Although this syntax resembles the C++ scoping syntax, it is important to note that it is ambiguous. Given the search query, Something::someMember, determining whether Something is a class, struct, or namespace is impossible. Similarly, someMember could be a class within a namespace, a class or struct within a namespace, or a member variable within a class or struct.

Overcoming Ambiguity

Since we need to constrain how we select elements, lets extend our syntax a bit by adding an optional { TYPE_LITERAL } constraint to each member:

[Member[{TYPE}]::...]Member[{TYPE}]

More cases can now be handled well:

  • Something{namespace}::someMember{class}
  • Something{class}::someMember{method}
  • Something{method}::someMember{variable}
  • Something{namespace}::Something{class}::someMember{method}

But... we're still ambiguous. Consider the following code:

void doSomething() {
    for (int i=0; i<10; ++i) { /* ... */ }
    /* ... more code here ... */
    for (int i=0; i<15; ++i) { /* ... */ }
}

Staying Useful

In the above code we have two different counter variables with a name of i. Does that mean it's pointless to try to support batch refactorings or to try to constrain elements in a refactoring scripting language? No.

Regular expressions don't allow everything to be queried, yet they're still useful. Even after adding positive or negative lookahead (or lookbehind) assertions, they're still limited. Yet, at the same time, they handle a few more cases and become more generally useful.

The same thing should be able to happen for (most) programming languages. It's possible to create a scripting language, or syntax, that will allow us to easily identify most syntactic elements within a programming language. Perhaps this refactoring scripting language could leverage the BNF that defines the programming language, or perhaps there's even something better. Either way, we're one step closer to overthrowing the GUI and enabling bulk and scriptable refactorings.

Yet more excerpts and discussion from my thesis to come.

Tuesday, February 16, 2010

Enabling Bulk Refactorings

In my last post, I talked about some of the limitations of refactoring IDEs and enumerated three different cases that could better be handled by refactoring IDEs:

  • Vendor Branches - can we make refactorings apply like patches?
  • Universal Language - can we switch between vocabularies or languages?
  • In Concrete - can we overcome some of the small problems presented?

For this post, I'm going to examine the small concrete cases presented in In Concrete.

Bulk Renaming

My first case was to "Rename all my *ElementName classes in a given namespace to *Node where the asterisk represents some wildcard.
Consider over a 100 classes like the following, spread out over at least as many files. Each one might represent a different element in an XML or otherwise non-trivial file format:

package com.example.xml.schemas;

public class ComplexTypeElement { /* ... */ }
public class CompoundTypeElement { /* ... */ }
public class SimpleTypeElement { /* ... */ }
public class AnyElement { /* ... */ }

Here's what the current Rename refactoring dialog looks like in Eclipse:

Augmenting the Rename UI

In this case, our goal is to bulk refactor the classes since our naming convention of *ElementName ended up being a poor convention. What we essentially need is a context sensitive, regex-based find-and-replace refactoring engine. So, let's augment the rename dialog with a couple of other elements that, left alone, wouldn't alter the rename behavior:

Now, let me describe a couple of the elements. First, I introduced a checkbox that allows the programmer to specify whether the refactoring should indeed be a wildcard refactoring. Second, I introduced an "Old name" field. This field represents the entire fully qualified class name to which the refactoring should be applied when a wildcard refactoring is being performed. When a standard (non-bulk or wildcard) refactoring is being performed, this field would display the original class name. Third, I added support for backreferences in the "New name" field. By allowing support for backreferences, the new class name can be based on and derived from the old class name, much like might be done with sed.

To handle the second case presented in my previous blog post, I could now use the following in the "Old name" and "New name" input fields, respectively:

Old name: DataNode::(.*)::get.*Instance

New name: DataNode::\1::instance

In this example, the backreference is used to identify the class but not needed for the member function. Like sed or awk, the syntax and structure would take a bit of learning, but it could be learned far faster than it would likely take to rename over 100 different classes manually.

Conclusion

The above syntax is, of course, not yet sufficient for everything that might need to happen. For example, it might also match DataNode::NestedNamespace::AnotherNestedNamespace::getSomeInstance even though only one namespace should have been considered, but it's a start, and one that I'll elaborate on in later posts.

The main point, however, is that it is possible to overcome many of the current limitations of refactoring engines. Yes, it will take some work, but it will be worth it in the long run. And, in the above cases at least, the implementation is trivial compared with the work already done to support a single rename refactoring.

Tuesday, February 9, 2010

Limitations of Refactoring IDEs

Refactoring Book

Coupled with every strength is a weakness. Within an IDE, the ability to leverage the utility of a mouse is a strength. When trying to automate the selection of the next four characters, no matter where you may be in a file, requiring the same mouse-driven approach becomes a weakness. And so it is with refactoring tools today. The current mouse-driven approach to identifying parameters to refactorings has a number of weaknesses:

  • Other than through GUI automation tools, such as APIs and macros supported by IDEs, starting the refactoring process cannot be automated.
  • It is impossible to apply the same refactoring to many elements at one time. Rather, each element must be selected one at a time and the refactoring applied.
  • It is difficult or impossible to script refactoring operations.
  • Elements being refactored are often identifi ed by line and column position within a fi le, which usually changes over time.

Although perhaps unseen and unrecognized, the weaknesses stated above are prevalent. The mainstay of element identi cation within existing refactoring tools is the mouse location and current cursor position. Although these function to identify the elements to which refactorings should be applied, they are not appropriate nor reasonable for every situation. Two use cases, detailed below, demonstrate an unmet need in the arena of refactoring tools.

Use Case 1 - Vendor Braches

Background

John is working with a vendor's source code and is maintaining a vendor branch. Unfortunately, the vendor's code has some unpatched bugs and often uses poorly named identi ers which make it hard to work with.

Problem

John has a set of patches which he applies using the standard vendor branch approach to patching. He has two sets of patches. The first set of patches applies the Rename refactoring to many di fferent elements in order to make the code easier to work with. The second set of patches fixes currently unpatched bugs.

Following the vendor branch patching process, after each vendor update John loads the vendor's source code into the source control system and then applies the first and second set of patches. Unfortunately, the first set of patches often fails to compile because any new references to the variables being renamed are not included in the patch. Each new reference must be manually renamed and the patch updated. Once the first patch has been successfully applied, the second bug-fi xing patch usually applies successfully.

Although this approach accomplishes what it needs to, it is less than ideal. With each new vendor release the Rename refactoring must be re-applied to the source code for every newly introduced reference to the variables being renamed.

Use Case 2 - Universal Language

Background

Beth is involved in an Open Source project whose primary developers are French. The source code is written and commented entirely in French, making it hard for Beth, who does not speak French. Although the developers have decided to change to English as their universal language, they have not yet made the change.

Beth worked with the developers to understand the meaning of each variable but needs to work with the French code for quite a while before English becomes the universal language.

Problem

Although Beth could maintain a separate English branch of the source code, that would require that all code be modi ed or committed twice, once for English and once for French. She could convert the source code over once within a separate branch, but each new reference to an existing variable would need to be changed. No good solution exists.

Use Case 3 - In Concrete

I would like to perform the following refactorings, but doing so through a GUI will be painful:

  • Rename all my *ElementName classes in a given namespace to *Node where the asterisk represents some wildcard.
  • Rename all get*Instance member functions to instance for every class within the DataNode namespace.

Perhaps the above seem meaningless, but I've had to do almost those exact same things. After a while, I discovered that some of the class names and methods that I was using weren't descriptive enough. I had to go rename over 50 classes even more files.

Conclusion

Current refactoring tools are wonderful... for what they were intended for. They make the process of writing code, and cleaning up that code while you're working on it, far easier and more likely to happen that if the process were manual. But, the GUI itself is also a weakness, one for which we need an alternative. We need a way to handle these and other uses cases. We need a way to back-port refactorings, where possible, without forcing the programmer to take manual steps. We need wildcards and sed-like functionality in our refactoring tools.

Much of the above content comes directly from my thesis. I believe there's a solution and future posts will discuss pieces of that solution. I don't believe my solution is complete or perfect, but I hope to further the work in this area.

Image courtesy seizethedave on Flickr