Friday, February 27, 2009

A true code format, and its editor

This is post is influenced by my recent preoccupation with code representation (see post on coding in thought, for instance) and inspirations from Subtext and ediola.

Imagine a code editor that allowed you to edit the actual AST of the language instead of text that will be parsed into the AST of the language. It would obviously make it transparent to you that you're editing a tree instead of free form text, except for taking out the inefficiencies. For instance it would allow you to use the normal navigation keys to scroll up and down through the code, but would skip keywords and language constructs automatically because they're are actually tokens and don't need to be treated as English words. When you navigate to a user entered value such as a variable, it would drop into an in-place editor that allowed you to change it, and it would then automatically change all occurrences of that variable because it knows that its a variable.

For the same reason, it would be easily able to provide auto suggestions /completions as it knows the language's syntax. An add-on feature would collect data about imported and declared classes, methods and variables as you edit, and make those available in the auto suggest as well.

The best part of course would be it would have none of the limitations of text - no need to worry anymore about whitespace as it would become exactly what it should be - white space that is merely a visual thing, not something that's stored in the code.

Sidebar: I realized after I wrote this that such an editor could be at a loss with a language like Python, which essentially mandates whitespace. However, I do think the reason Python went the way it did is reconciling to text being the universal code format, and if so, making the best of it, namely to avoid all visual clutter caused by begin/end tokens, and instead rely on whitespace to denote it for both humans and compiler alike. That being the case, there's no reason this editor can mimic the indentation, and provide the output that the python interpreter expects.


How would such code be stored? Why, as the AST itself. The actual storage format (as long its not back to text) is not important, as any number of those are applicable. What is important, though, is that it is in the exact format that the language allows for human thought to be enshrined as code - nothing more (specifically, text formatting warts), nothing less.

Once we have this, we can think of more improvements. How about treating comments as annotations on code, and extracting them out completely? This way, all the issues related to comment obsolescence and non-locality are resolved. We could just tie a comment to a particular piece of code, and wherever the code goes, the comment follows. If the code chunk is removed, broken or altered, it would be trivial to mark the comment as outdated and to be treated with a pinch of salt.

There's another gem hiding in the previous paragraph - that of tying a comment to a particular piece of code. Well what exactly is "a particular piece of code" unless we're talking about some named chunk such as a method? Nothing defined, yet. However, consider this: what if we had a way of identifying every piece of code uniquely - an id for each atomic piece of code? Notice that I've specifically avoided considering the line number as the id because is not just another of those text formatting appendages, but also an unreliable anchor to any particular piece of code as it can easily change with an unrelated edit of the file. I'm thinking more of :
  • a unique id for each statement (or its component) (eg, the 5th statement of the main method - which could well be on the first "line" if it were text)
  • expressions for groups of such statements (eg, first 5 statements of method foo),
  • reference via scope/namespace (eg: 3rd if in Class a's method foo)
- an XPath for code, if you will. Once we have this, refactoring becomes analytical, even expressible as code. And of course, we'll be able to attach all kinds of metadata to chunks of referencible code, including comments.

Update: This from my wife: "3rd 'if' in method a.foo" is as arbitrary as line number. The better idea is to sequence them in order of creation, ie, the first 'if' created in a.foo gets id #1, the second #2 and so forth - for eternity. That way edits will keep the piece of code still identifiable. Ids could be periodically cleaned up to remove unused ones.

Ok, all of the above has been about editing code, not necessarily the format laid down by a language itself. But the more I think about it, formats have been more a limitation laid down by the tools that surround a language rather than the languages themselves. Compilers require classes to be contained in single files. Compilers require directory paths for package name spaces and so forth. There is really no reason a compiler cannot read a pre-parsed version of code as its primary source. By the same token, there is no reason source code cannot be stored as the AST - in version control or outside. The newer source code tools shouldnt have problems dealing with whatever binary format is chosen to store the AST, and diff against them.

And you know the best(est) part? No more tabs vs spaces wars!

No comments: