For my diploma project, I chose to do an “advanced text-editor”… something along the lines of an IDE. I’m writing it in ruby. At this point I have a GUI that provides almost everything I need. One of the things I thought my IDE would be cool to have is automatic language detection : you paste some source code in the editor, and it will highlight it BEFORE you save the file to disk. For this purpose I created the following class :
class LanguageDetector
def declare_language_arrays
# declare the language arrays
@oop = ["ruby","java","c#","c++","scala","php"]
@scripting = ["ruby","perl","php","python"]
@all = (@oop+@scripting).uniq
@text = ["text"]
end
def initialize
declare_language_arrays()
@score = Hash.new(0)
@language_map = {
"public" => @oop,
"private" => @oop,
"protected" => @oop,
"static" => ["java","c","c++","c#"],
"void" => ["java","c","c++","c#"],
"main" => ["java","c","c++"],
"Main" => ["c#"],
"class" => @oop + ["python"],
"def" => @scripting - ["perl","php"],
"begin" => ["ruby","pascal"],
"end" => ["ruby","pascal"],
"throw" => @oop,
"throws" => ["java","c++"],
"try" => @oop+["python"],
"catch" => ["java","c++","c#"],
"except" => ["python"],
"String" => ["java"],
"rescue" => ["ruby"],
"redo" => ["ruby","perl"],
"next" => ["ruby","perl"],
"last" => ["ruby","perl"],
"while" => @oop+["python","perl"],
"for" => @all,
"if" => @all,
"else" => @all,
"elif" => ["python"],
"elsif" => ["ruby","perl"],
"final" => ["java"],
"del" => ["python"],
"delete" => ["c++"],
"free" => ["c"],
"new" => ["java","c++","c#"],
"in" => ["python"],
:default => method(:default_detection)
}
# method that detects which language a token belongs to
# this gets called if a token was not found in the map
@default = @language_map[:default]
end
def get_tokens(code)
# return the tokens from the code sent as parameter
return code.split(/\s+/)
end
def get_score
# return the score hash
@score
end
def get_language(score)
# process the score hash and return the element with the highest value;
# should consider case with equal score languages
max = -1
language = ""
# language is key, score is value
score.keys.each do |key|
# store the score of the language
language_score = score[key]
# if it's bigger, we store it
if language_score > max
language = key
max = language_score
end
end
return language
end
# handler for each token
def process_token(token)
# obtain the language array for each word
languages = @language_map[token]
# if languages array is nil, the token doesn't exist in the map
if languages.nil?
# obtain the languages by processing the token with the
# language detector method
languages = @default.call(token)
end
# compute language score
languages.each do |language|
@score[language] += 1
end
end
# detect a language based on the source code sent
def detect_language(source_code)
@score.clear
# split source code into tokens ( should use a lexer here )
words = get_tokens(source_code)
# process each token
words.each do |word|
process_token(word)
end
end
def default_detection(token)
if token.start_with?("$")
return ["perl","ruby"]
end
return @text
end
end
It’s still “very incomplete” ( to say the least ), but I’ll continue to work on it and improve it. Here is how I envisioned something like this works : you split the code into tokens ( actual tokens, not by whitespace as I did here ), and you assign each token to a language. Each language has a “score” associated to it. When the language detector finishes with the last token, all that needs to be done is to obtain the key with the highest score from the score hash. Here is a snippet of how you could use it :
require "language_detector"
language = LanguageDetector.new
language.detect_language("this is a test")
# this will output text
puts language.get_language(language.get_score)
# because I'm tokenizing based on whitespace,I have to put spaces between tokens
# this will change in a future version
language.detect_language("public static void main ( String [] args )")
# this will output java
puts language.get_language(language.get_score)
This class will be updated to provide better support for ( more ) programming languages really soon.