Archive for the 'IDE' Category

21
Mar
09

building an IDE in small steps:language recognition

For my diploma project, I chose to do an “advanced text-editor”… something along the lines of an IDE. I’m writing it in ruby. At this point I have a GUI that provides almost everything I need. One of the things I thought my IDE would be cool to have is automatic language detection : you paste some source code in the editor, and it will highlight it BEFORE you save the file to disk. For this purpose I created the following class :


class LanguageDetector

	def declare_language_arrays
		# declare the language arrays
		@oop = ["ruby","java","c#","c++","scala","php"]
		@scripting = ["ruby","perl","php","python"]
		@all = (@oop+@scripting).uniq
		@text = ["text"]
	end

	def initialize
		declare_language_arrays()
		@score = Hash.new(0)		
		@language_map = {
			"public" => @oop,
			"private" => @oop,
			"protected" => @oop,
			"static" => ["java","c","c++","c#"],
			"void" => ["java","c","c++","c#"],
			"main" => ["java","c","c++"],
			"Main" => ["c#"],
			"class" => @oop + ["python"],
			"def" => @scripting - ["perl","php"],
			"begin" => ["ruby","pascal"],
			"end" => ["ruby","pascal"],
			"throw" => @oop,
			"throws" => ["java","c++"],
			"try" => @oop+["python"],
			"catch" => ["java","c++","c#"],
			"except" => ["python"],
			"String" => ["java"],
			"rescue" => ["ruby"],
			"redo" => ["ruby","perl"],
			"next" => ["ruby","perl"],
			"last" => ["ruby","perl"],
			"while" => @oop+["python","perl"],
			"for" => @all,
			"if" => @all,
			"else" => @all,
			"elif" => ["python"],
			"elsif" => ["ruby","perl"],
			"final" => ["java"],
			"del" => ["python"],
			"delete" => ["c++"],
			"free" => ["c"],
			"new" => ["java","c++","c#"],
			"in" => ["python"],		
			:default => method(:default_detection)
		}
		# method that detects which language a token belongs to
		# this gets called if a token was not found in the map
		@default = @language_map[:default]
	end

	def get_tokens(code)
		# return the tokens from the code sent as parameter
		return code.split(/\s+/)
	end

	def get_score
		# return the score hash		
		@score
	end

	def get_language(score)
		# process the score hash and return the element with the highest value;
		# should consider case with equal score languages
		max = -1
		language = ""
		# language is key, score is value
		score.keys.each do |key|
			# store the score of the language			
			language_score = score[key]
			# if it's bigger, we store it
			if language_score > max
				language = key
				max = language_score
			end
		end
		return language
	end

	# handler for each token
	def process_token(token)
		# obtain the language array for each word
		languages = @language_map[token]
		# if languages array is nil, the token doesn't exist in the map
		if languages.nil?
			# obtain the languages by processing the token with the
			# language detector method
			languages = @default.call(token)			
		end
		# compute language score
		languages.each do |language|
			@score[language] += 1
		end
	end

	# detect a language based on the source code sent
	def detect_language(source_code)
		@score.clear
		# split source code into tokens ( should use a lexer here )		
		words = get_tokens(source_code)
		# process each token
		words.each do |word|
			process_token(word)
		end
	end

	def default_detection(token)
		if token.start_with?("$")
			return ["perl","ruby"]
		end
		return @text
	end

end


It’s still “very incomplete” ( to say the least ), but I’ll continue to work on it and improve it. Here is how I envisioned something like this works : you split the code into tokens ( actual tokens, not by whitespace as I did here ), and you assign each token to a language. Each language has a “score” associated to it. When the language detector finishes with the last token, all that needs to be done is to obtain the key with the highest score from the score hash. Here is a snippet of how you could use it :


require "language_detector"

language = LanguageDetector.new
language.detect_language("this is a test")
# this will output text
puts language.get_language(language.get_score)
# because I'm tokenizing based on whitespace,I have to put spaces between tokens
# this will change in a future version
language.detect_language("public static void main ( String [] args )")
# this will output java
puts language.get_language(language.get_score)


This class will be updated to provide better support for ( more ) programming languages really soon.




Blog Stats

  • 281,739 hits