Tuesday, March 30, 2010

Scripting Refactoring -- Overthrowing the GUI (part 4)

Over the last few blog posts I've mentioned a number of different reasons why it would be nice to be able to script the application of refactorings. This post talks more about one possible scripting language that could be used to script refactorings.

Interesting Cases

Although I could just provide the grammar and some sample input, it wouldn't be very instructive or helpful. Here are some more interesting cases and considerations.

Camel Case

Camel Case is a standard Java convention that uses capital letters to separate words that form an identifier. For example, isTrue, shouldValidate, and eatsHotdogsOrHamburgers are all camel case identifier names. The first letter is typically lower case with the first letter in each subsequent word in the identifier being upper case.

Imagine that Jr. Programmer comes along and creates a class named AbstractSyntaxNode that contains methods called complexNode, simpleNode and 20 more similarly named methods. When Mr. Senior Programmer comes along he immediately notices that in order to follow standard Java programming conventions, these should instead be named getComplexNode, getSimpleNode, and so on. Although it would be painful to manually go through and invoke the rename refactoring on all 22 methods, this is a good case for scripted refactorings. Consider the following rename refactoring script:

rename {
    AbstractSyntaxNode{class}::(.*)Node{method},
    AbstractSyntaxNode::get\1Node;
}

This, however, would result in the following method names:

  • getcomplexNode
  • getsimpleNode
  • ...

In the above case, we need an easy way to tell it that the first letter needs to be transformed to an upper case character. Perl provides a \u regular expression escape sequence that capitalizes the following character. Thus, to handle the camel case issue nicely, the substitution engine would need to support similar escape sequences.

Absolute and Relative Position

Absolute and relative position markers also make an interesting case. For example, I might want to rename the variable on the second line after the start of the method. Or, I might have my cursor over a variable within an editor that has a keyboard shortcut for invoking a refactoring. In these cases, I have both relative and absolute position information that is likely desirable in a refactoring scripting language.

Assume for a moment that I want to do an Extract Method refactoring of everything between line 40 and 60, inclusive. I might write the following:

extractMethod {
    40, 60, newMethodName;
}

In this case, there's no need for column information, so using a basic integer for the line information is sufficient. But, what if you wanted to inline the function referenced between column 20 and 30 on line 42? Then I would need to write something like this:

inline {
    42:20, 42:30;
}

As many editors display the line and column information as line:column, the above syntax is fairly familiar. But, how could we go about identifying relative position indicators? Consider the following:

rename {
    @(SomeClass::someMethod{method && definition} + 2), newVariableName;
}

This shows a couple of more interesting cases - first, I need to be able to identify the the method definition which I do by restricting the results of my element query to definitions in my scope clause. And, second, I needed to use the @ operator to make the element reference a position based reference.

Conclusion

Such a scripting language could easily grow unwieldy, but the Pareto Principle suggests that there must be a fairly "easy" solution that will handle 80% of the cases. What would such a solution look like? It's really hard to say until somebody actually implements one and has some real-world users. Nevertheless, my sample input and ANTLR based grammar are linked to below.

Comments and thoughts? I'd love to hear them! Thanks.

A full set of sample inputs is available on GitHub. Similarly, the full grammar is also available.

Monday, March 8, 2010

Scripting Refactoring -- Overthrowing the GUI (part 3)

I've already mentioned why it might be useful to perform refactorings in bulk. Perhaps we've decided we followed a poor naming convention, or perhaps we've moved some classes into their own namespace, so part of the class name is redundant and can be removed.

Another point of textual element identification is to remove the necessity of the GUI. Take, for example, Emacs and Vim. Both are very capable editors and and lack (to the best of my knowledge) complete and capable refactoring support. But, if an element can be identified in text, either through line and column information, or through an element reference, the editors like Vim, Emacs, and Textmate could have macros written for them that call out to a command line refactoring engine.

Two Syntaxes for Scripting Refactoring

Each refactoring could look like a function call as in the following:

rename(OldNamespace::OldClass, NewNamespace::NewClass);

In the above case, the rename refactoring takes in two different ElementReference parameters. In order to perform many rename refactorings, we'd end up with a list of the above:

rename(OldNamespace::FirstClass, NewNamespace::FirstClass);
rename(OldClass, NewClass);
rename(OldClass::oldMethod, OldClass::newMethod);
rename(globalFunction, newGlobalFunction);
//...

Although the above is unreadable, it seemed a little repetitive, so I went with a block structure that allowed the refactoring to be specified once, with a list of parameter sequences:

rename {
    // support named parameters
    // and type restriction after name specifier
    oldName = SomeNamespace{namespace}::/(.*)ElementName/,
    newName = SomeNamespace::\1Node;

    // type restriction within regular expression
    /SomeClass{class}::get(.*)Node/,
    SomeClass::\1Node;

    ::memset, ::customMemSet;
}

The block syntax above isn't as repetitive as the first example. To help the case where an editor would invoke the refactoring, line and column information could be used to identify elements as well.

Conclusion

Although the above is a new syntax, it's built on many different syntaxes that are common to languages. It uses a block syntax for overall structure. Named parameters are supported in many different dynamic languages. It supports regular expressions for matching and backreferences for substitutions. It also has C++ and Java style comments and uses the sometimes-despised semicolon as a terminator.

Yeah, some people aren't going to like the syntax and may avoid it like the plague. But, arguably, there are some very good reasons to to make refactorings scriptable to the average programmer, among which are the easy interface it would provide to editors and the ability to doing certain refactorings in bulk.

Thursday, March 4, 2010

Scripting Refactoring -- Overthrowing the GUI (part 2)

Last week I introduced a hokey syntax that could be used to identify various elements that were going to be refactored in bulk. The syntax was ambiguous and incapable of expressing anything of even mild complexity. Now for something better.

The proposed syntax doesn't apply to every language. It would be both pointless and painful to attempt such a task. Rather, the following syntax is intended to work with languages like Java, C/C++, and other similarly structured languages.

A Grammar Proposal

Consider the following grammar defined in EBNF, with non-terminals being lower-cased and terminals being upper-cased and left implicit:

query-element: pattern ( "::" pattern )*;

pattern: ( IDENTIFIER | "/" REGEXP-LITERAL "/" ) type-specifier?;

type-specifier: "{" TYPE-LITERAL "}";

Using the above EBNF, arbitrary elements can be queried and their type constrained when necessary. Here's a few examples:

  • All classes named MyClass in any namespace - /.*/::MyClass
  • Any class containing the word Node in the AST namespace - AST::/.*Node.*/
  • The second class containing the word Node in the AST namespace - AST::/.*Node.*/[1]
  • All typedefs named subType contained in functions of the Node class - Node::/.*/::subType{typedef}
  • All variables named other contained in functions of the Node class - Node::/.*/{function}::other{variable}
  • Any function starting with test in the Temp namespace - Temp::/test.*/{function}

Admittedly, this query language does not support all the queries that one might want to make, but it should be noted that the above query language can be easily extended. For example, instead of using a TYPE-LITERAL terminal symbol, one could use a type-literal production that allowed for negating conditions, alternations, and more.

Function and operator overloading, as supported by C++, also complicate matters. Even though the fully qualified name of a function may be specified, it does not imply that only a single match exists. Any number of functions might exist. This language could be expanded to allow the types of the parameters to be specified. By specifying the parameter types, any function could be fully resolved. In addition, anonymous namespaces and blocks could be identified by augmenting the language to support empty names and array indexing.

Isn't It Complicated?

Yes, now we're starting to get complicated. But honestly, most of those features wouldn't matter. It's like the saying, "20% of your code you handle 80% of the cases." By starting out simple, and only augmenting the grammar when a real, arguable need is established, the grammar could be expanded little-by-little in useful ways. It really goes back to two principles: KISS and YAGNI. Start simple, design when necessary, refactor to your goal

Bonus points to the person who finds the grammar / example mismatch above.