Friday, February 27, 2009

A true code format, and its editor

This is post is influenced by my recent preoccupation with code representation (see post on coding in thought, for instance) and inspirations from Subtext and ediola.

Imagine a code editor that allowed you to edit the actual AST of the language instead of text that will be parsed into the AST of the language. It would obviously make it transparent to you that you're editing a tree instead of free form text, except for taking out the inefficiencies. For instance it would allow you to use the normal navigation keys to scroll up and down through the code, but would skip keywords and language constructs automatically because they're are actually tokens and don't need to be treated as English words. When you navigate to a user entered value such as a variable, it would drop into an in-place editor that allowed you to change it, and it would then automatically change all occurrences of that variable because it knows that its a variable.

For the same reason, it would be easily able to provide auto suggestions /completions as it knows the language's syntax. An add-on feature would collect data about imported and declared classes, methods and variables as you edit, and make those available in the auto suggest as well.

The best part of course would be it would have none of the limitations of text - no need to worry anymore about whitespace as it would become exactly what it should be - white space that is merely a visual thing, not something that's stored in the code.

Sidebar: I realized after I wrote this that such an editor could be at a loss with a language like Python, which essentially mandates whitespace. However, I do think the reason Python went the way it did is reconciling to text being the universal code format, and if so, making the best of it, namely to avoid all visual clutter caused by begin/end tokens, and instead rely on whitespace to denote it for both humans and compiler alike. That being the case, there's no reason this editor can mimic the indentation, and provide the output that the python interpreter expects.


How would such code be stored? Why, as the AST itself. The actual storage format (as long its not back to text) is not important, as any number of those are applicable. What is important, though, is that it is in the exact format that the language allows for human thought to be enshrined as code - nothing more (specifically, text formatting warts), nothing less.

Once we have this, we can think of more improvements. How about treating comments as annotations on code, and extracting them out completely? This way, all the issues related to comment obsolescence and non-locality are resolved. We could just tie a comment to a particular piece of code, and wherever the code goes, the comment follows. If the code chunk is removed, broken or altered, it would be trivial to mark the comment as outdated and to be treated with a pinch of salt.

There's another gem hiding in the previous paragraph - that of tying a comment to a particular piece of code. Well what exactly is "a particular piece of code" unless we're talking about some named chunk such as a method? Nothing defined, yet. However, consider this: what if we had a way of identifying every piece of code uniquely - an id for each atomic piece of code? Notice that I've specifically avoided considering the line number as the id because is not just another of those text formatting appendages, but also an unreliable anchor to any particular piece of code as it can easily change with an unrelated edit of the file. I'm thinking more of :
  • a unique id for each statement (or its component) (eg, the 5th statement of the main method - which could well be on the first "line" if it were text)
  • expressions for groups of such statements (eg, first 5 statements of method foo),
  • reference via scope/namespace (eg: 3rd if in Class a's method foo)
- an XPath for code, if you will. Once we have this, refactoring becomes analytical, even expressible as code. And of course, we'll be able to attach all kinds of metadata to chunks of referencible code, including comments.

Update: This from my wife: "3rd 'if' in method a.foo" is as arbitrary as line number. The better idea is to sequence them in order of creation, ie, the first 'if' created in a.foo gets id #1, the second #2 and so forth - for eternity. That way edits will keep the piece of code still identifiable. Ids could be periodically cleaned up to remove unused ones.

Ok, all of the above has been about editing code, not necessarily the format laid down by a language itself. But the more I think about it, formats have been more a limitation laid down by the tools that surround a language rather than the languages themselves. Compilers require classes to be contained in single files. Compilers require directory paths for package name spaces and so forth. There is really no reason a compiler cannot read a pre-parsed version of code as its primary source. By the same token, there is no reason source code cannot be stored as the AST - in version control or outside. The newer source code tools shouldnt have problems dealing with whatever binary format is chosen to store the AST, and diff against them.

And you know the best(est) part? No more tabs vs spaces wars!

Sunday, February 08, 2009

Further Ideas for a new language

Some more random additions to the new language idea:

A name: Jack
This is mostly tongue in cheek but I thought if I ever got around to creating this language, I'd call it Jack. You know, like in Jack of all trades, cos its trying to be one - with its polyglot leanings. I did consider calling it Poly, but the options for source file name extensions would be .pl or .py or .ply. First two are taken, and the third doesn't seem that interesting. OTOH, .jak comes out nice.

Aside, I did also consider calling it V after me, but vetoed it cos it didnt give any indication about the language itself, just self promotion.

Anyway, Jack it is. And I've not yet checked if there's already a language with that name - hope not. But think of the possibilities with such a name:

- You could tell others off by saying "You dont know Jack"
- Like Java Jars and Ruby Gems, there could be Jack (in-the) boxes :)
- I know I'm gonna call the hello world in Jack "Jackrabbit slims" :))

Enough buffoonery, now for some serious stuff:

It has optimisations
You know how you have to "drop down a few layers" to optimize things? Like for eg, you can do 3d graphics using the Java3d APIs or you can use OpenGL or you could just drop into the video card's API? Well, Jack will have the same capability. If a thing can be done in more than one way, and one way is the composition of multiple functions, while the other is a single function that does effectively the same, you will be able to specify that. Instant reuse. Want to use the MFC classes, or OS sys calls to do something? Go right ahead. And supported in the language too. Sandbox to be specified, but definitely much easier than JNI et al - remember I'm still dreaming :).

The key thing I want to point out is the "supported in the language" bit. We as programmers routinely cut across language/API boundaries. However, each language is in its own sandbox held together by scripts, duct-tape and magic. Why not be explicit about it? Obviously there will be need for sandboxes, but making it explicit enables maintainability. It should be clear how different parts of the application ecosystem actually connect.

Saturday, February 07, 2009

Ideas for a new language

A number of nebulous ideas have been swirling in my mind of late on what's lacking with today's languages. I woke up today morning with the remnants of a dream where I was coding in a language that magically had closed all those gaps, and even improved things a bit. Here's what my conscious self remembers from that dream.

It was based on conditionals
I have been thinking a lot about conditionals lately, partly due to being impressed by Subtext. But more closer to work, I've been thinking of a large legacy application and how to refactor it so that chunks of logic that are in the wrong layers can be relocated easily, and how other chunks of logic that are interwined with each other even though they logically belong to different paths can be extracted, and then combined via configuration using (for lack of a better term) a micro ESB.

And it struck me that the conditional (represented by the plain vanilla if, or an inheritance hierarchy) is the cleave point of "paths". All conditionals that are based on a particular check (eg: is the object of type foo, or is the value equal to x) are essentially in the same path. Therefore code should be organised (or auto-organised) such that conditionals are matched by their paths, and the largest conditional branch naturally gravitates to the top. This is, of course, exactly what Jonathan Edwards' Subtext does, with the additional advantage that it also ensures that there are no gaps in the conditionals - all possible values(or value ranges) have to be filled in. This would be a great thing for a language to have.

So the language from my dream had this feature. It worked using conditionals, and like subtext it auto arranged them - but all in text. That's the dream part, i guess.

Now when I think about it though, I can envision this conditional cleaving happening incrementally - a module can be conditionally complete within itself, but not so wrt other modules in the application.

It had modules/services/components as a first class language construct
Meaning, the "chunks of code" concept that I mentioned above were directly supported by the language. Think COM/XPCOM, or SOA/Web Services - with more accent on the fact that there's a published API backed by an implementation; and less on how they would be discovered, remoted etc. Code by contract, basically.

I don't remember if it had explicit support for constructs such as classes and interfaces that combine to form the components. It wouldn't matter if it did or not, but the key thing was that the component WAS defined, and was the cornerstone of composition.

It was extractably modular
And by this, I mean that chunks of code could be extracted out into modules by the language itself.Not a preprocessor or optimizer - the language itself had operators to extract code, define the api based on the contents of the code, and place it elsewhere. It therefore had meta programming capabilities, and more importantly, code elements were addressable in a platform- and source file-neutral way. Every unit of execution from the statement to the module was addressable to enable the modularisation operators to work.

It had DRY testing
Defining a module or class implied testing it. The language statically checked for bounds and ranges based on the conditionals. In addition, every run with values can be recorded as a test, adding to the test set.

It supported legacy apps
It had the ability to read old app code; and then show the gaps in logic or discover (if not extract) the true modules. It might have had a compatibility mode or it allowed incomplete (from its perspective) code to run.

It was functional at its core
and used monads for everything - especially the micro esb.
... but it had a lot of dsl-ish sugar to make it seem more general purpose.

And finally, it was (j)vm based.

Aftermath
After I wrote this far, I got to thinking about what goals a feature set such as this would arise from? Here's what I got:
  • backwards compatibility
  • support multiple programming styles
  • direct support for common use cases in development and deployment