Parse and import documentation from XML and DOCX files into LMS

15 Feb 2015 GitHub source code data-wrangling


XML to CSV custom parser. Convert document paragraphs and chapters of technical and law documentation into csv table. ####Code source on GitHub https://github.com/alexbra/xmltocsv

####Applied technologies

Python, ElementTree, BeautifulSoup

###Description There are dozens of law and technical documents and more than 20,000 QUIZ questions stored in xml, html and docx files. The task is parse and convert them into csv files. Further, csv files will be import into learning management system.

####xml files format xml files format

####Parser I used xml.etree.ElementTree Python module to parse xml files and mammoth and BeautifulSoup for docx files. I had a csv file for each document and question file.

  • parts.csv documents structure
book_id book_name name code parent_code type url paragraphs  
book_332 332. Book 2. Chapter 1298385727632_34   folder ../content/12983857.html  
book_332 332. Book   1298385727632_35 1298385727632_34 lesson ../content/12983857.html 1298385727632_36
  • questions.csv testing questions
bookcode partcode qcode qname type is_correct answname bookname partname group
book_332 1298385727632_35 1298385727632_18_2 Name of question multiple 1 answ 332.Book name coll_2008

As a result, I transformed more than 120 documents and 20,000 quiz questions into learning management system. They are ready for learning of 100K+ company workers.