30916

Large data - storage and query

We have a huge data of about 300 million records, which will get updated every 3-6 months.We need to query this data(continously, real time) to get some information.What are the options - a RDBMS(mysql) , or some other option like Hadoop.Which will be better?

Answer1:

300M records is well within the bounds of regular relational databases and live querying should be no problem if you use indexes properly.

Hadoop sounds like overkill unless you really need highly distributed and redundant data, and it will also make it harder to find support if you run into trouble or for optimizations.

Answer2:

Well, I have a few PostgreSQL databases with some tables with more than 700M records and they are updated all the time.

A query in those tables works very fast (a few milliseconds) and without any problems. Now, my data is pretty simple, and I have indexes on the fields I query.

So, I'd say, it will all depends on what kind of queries you'll be making, and if you have enough money to spend on fast disks.

Answer3:

As others said, modern RDBMS can handle such tables, depending on the queries and schema (some optimizations would have to be made). If you have a good key to split the rows by (such as a date column), then partioniong/sharding techniques will help you split the table into several small ones.

You can read more on those and other scaling techniques in a question I asked sometime ago here - Scaling solutions for MySQL (Replication, Clustering)

Answer4:

300 million records should pose no problems to a top-end RDBMS like Oracle, SQL Server, DB2. I'm not sure about mySQL, but I'm pretty sure it gets used for some pretty big databases these days.

Answer5:

300 Million does not really count as huge these days :-).

If you are mostly querying, and, you know more or less what form the queries will take then MySQL tables with the appropriate indexes will work just fine.

If you are constantly appying updates at the same time as you are running queries then choose PostgreSQL as it has better concurrency handling.

MS SQLServer, Sybase, Oracle and DB2 will all handle these volumes with ease if your company prefers to spend money.

If on the other hand you intend to do truly free format queries on unstructured data then Hadoop or similar would be a better bet.

Recommend

  • How to specify that template parameter is a class template, and infer its template type from another
  • Programming style: Should you check for null in functions or out of functions?
  • Is there a CSS alternative to JS click, (as :hover is an alternative mouseover / mouseout)?
  • How to trigger block from any of multiple signal producers?
  • Translated grapple physics from Processing to Unity to get different results
  • scrape the about page of websites with Python [closed]
  • Does python store one value for an int and many references? [duplicate]
  • PHP - Rearranging a string into order
  • Does null == null?
  • Power curve fitting in gnuplot for redundant values
  • performance: joining tables vs. large table with redundant data
  • Pass table name used in FROM to function automatically?
  • Why does java bytecode “store” often followed by “load”?
  • Java Determine which textfield has Focus
  • PostgreSQL increase a table's sequence with one query
  • How to use generics to pass argument to a non-generic method?
  • HOMER de novo motif discovery cannot open hg19 fasta files
  • how to remove “Archive | ” from wordpress?
  • Custom security scenario in ASP.NET MVC
  • How to auto-sort a bar-chart with a toggle function
  • Django Email Change Form Setup
  • Removing redundant entry in named-object in R
  • jQuery: Loop iterating through numbered selectors?
  • How to avoid race conditions on cursor.observe?
  • Celery tasks functions - web server vs remote server
  • Java best way to implement builder pattern
  • Evaluating output from systrace on Android
  • Trying to Scrape multiple urls, can only scrape 1. (Any way to generate multiple URL list)?
  • Is escaped hyphen equal to unescaped hyphen?
  • Should multiple elements in a be put it in tags?
  • How can I specify columns in R to be used in matches (without listing each individually)?
  • “On Error Goto 0” before Exit Function
  • check for duplicates in a python list
  • How to cache partial crc32 checksums so I don't need to calculate it multiple times?
  • Spring: Response time
  • If “if” is the last control statement in function and its block always executes “return”, then shoul
  • Is there a way to create makefile that uses only required .o files?
  • Building an external list while filtering in LINQ
  • RESTful iPhone client and Model hierarchy
  • A Dependent Spring boot project application.properties not injecting default values
  • register dependency Unity
  • How to handle headers in HTML5
  • Random numbers which have not been previously chosen [duplicate]
  • Why do both struct and class exist in C++?
  • How to calculate correlation between time periods
  • Why does strlen not work on mallocated memory?
  • How do I traverse a JavaScript object and check for nulls?
  • Why Identity 2.0 adds a new column in AspNetUserRoles when I extend IdentityUser?
  • How to exclude a specific service from miniprofiler?
  • Counting duplicate cells on multiple rows and columns with Google Spreadsheet
  • Hibernate: ManyToOne Relation, Load Many relation Only Once
  • Is there an advantage to declaring a private method “final” in Java? [duplicate]
  • Interrupt thread after specified time - problems
  • Generate even numbers 1-4 using Math.random()?
  • Private constructor in abstract class
  • How can I automatically link local npm package?
  • NSURLConnection - Grabbing response data when using another class as a delegate, and when there is a
  • Doing inserts with zend 2's tableGateway
  • Integrating google+ api in iphone project
  • Configuring Spring Cloud Stream in Camden.SR5 with Spring boot 1.5.1
  • php - Extracting from between two strings using regex
  • How to match dangling blobs with file names in Git?
  • Will OrderedDict become redundant in Python 3.7?
  • “Trigger on changes in snapshot dependencies” does not seem to work properly
  • How to properly remove redundant components for Scikit-Learn's DPGMM?
  • Why does an element with { flex: 1; width: 0; } still have width?
  • Include twig loader from external file
  • Malformed string while using ast.literal_eval
  • Trouble loading JSON data into a ExtJS datastore
  • A route named 'DefaultRoute' is already in the route collection. Route names must be uniqu
  • Is there a way to keep notification active until user clicks clear?
  • Proper way to close connection
  • why javascript permits $ for name of functions?
  • With std::optional standardizing, can we stop using nullptr in new code and deprecate it? [closed]
  • javascript setInterval not working for object
  • 'Object.ReferenceEquals' is always false because it is called with a value type
  • javascript file input onchange not working [ios safari only]
  • C Programming: Preprocessor, macros as tokens
  • Coloring QSlider for Particular Range [closed]
  • FromHeader Asp.NET Core binding to default value
  • How to chain delete pairs from a vector in C++?
  • How does drawRect and CGGraphicsContext work?
  • Why “fopen” function does not use enumerations?
  • Questions about possible java(or other memory managed language) optimizations
  • Elegant Solutions to the Fencepost Problem (with Strings)
  • In IEEE 754, why does adding negative zero result in a no-op but adding positive zero does not?
  • Hidden classes and equivalence between {} object vs. custom constructor (v8)
  • Option Strict and Nulls
  • How to combine 2 variables and ignore NAs
  • Magento 2.1.2 Rourter action factory gets in to an infinite loop
  • EclEmma Code Coverage Ignore Junit Tests
  • ListView list is reversed when selection made
  • Alternative to database design with enumerated columns, leading to poor performances
  • Why doesn't this clpfd query terminate until I add a redundant constraint?
  • Using argparse arguments as keyword arguments
  • negation of std::integral_constant
  • Database for Full Text Search and 200M+ Records
  • At what point is code reordering in C++ optimization stopped?
  • android: disable opengl ES context switch upon device rotation
  • HTML ordered list indent to keep original numbering
  • How to implement a filter from another stream using RxJava
  • Object property instance on class?
  • how to compile code from svn into jar file?
  • Get highest value from a file using mSL and mIRC
  • Should try-catch be avoided for known cases
  • CONVERT MySQL Query to SQL Server (MSSQL / SQLSRV) (WiTH DISTINCT)
  • How to use app.selection[0] for scripts in Adobe InDesign
  • TFS express error: The working folder C:\\LocalFolder is already in use
  • Get both date and time in milliseconds
  • hibernate.properties does not found in maven project using hibernate
  • What's the point of nonfinal singleton objects in scala?
  • JFreeChart heap space
  • PostgreSQL 9.1 timezones
  • cannot load gems in test environment
  • jQuery toggle hide on click elsewhere
  • How to edit css for jquery datepicker prev/next buttons?
  • mysql table locked after php crashes
  • Sort by a column in a union query in SqlAlchemy SQLite
  • Build Matrix of Comparisons in SQl Server
  • C# 4 and CLR Compatibility
  • Insert records if not exist SQL Server 2005
  • Accessing the variables from a PHP Anonymous Function
  • ASPNetCore MVC Routing Let Server Handle Specific Route
  • How to create a Unix-domain socket with specific access permissions
  • Loading fixtures in sails tests
  • countdown bar android example
  • Is there a package like bigmemory in R that can deal with large list objects?
  • Simple linked list-C
  • Check all values in string[] for length?
  • Azure table store snapshot/backup capability
  • Bigquery event streaming and table creation
  • Filter strings with regex before casting to numeric
  • JPA flush vs commit
  • How to detect interior vertices in groups of 2d polygons? (E.g. ZIP Codes to determine a territory)
  • Is there a way to do normal logging with EureakLog?
  • Is it possible to access block's scope in method?
  • PostgreSQL Query without WHERE only ORDER BY and LIMIT doesn't use index
  • Meteor helpers not available in Angular template
  • Asynchronous UI Testing in Xcode With Swift
  • Linq Objects Group By & Sum
  • Regex thinks I'm nesting, but I'm not
  • How to recover from a Spring Social ExpiredAuthorizationException
  • ILMerge & Keep Assembly Name
  • Microsoft Visual Studio Community 2015 always crashes in Windows 10 if swithed to Visual FoxPro
  • WOWZA + RTMP + HTML5 Playback?